Latency and Bandwidth Requirements of Massively Parallel Programs: FFT as a Case Study
Many theoretical models of parallel computation are based on overly
simplistic assumptions on the performance of the interconnection
network. For example they assume constant latency for any
communication pattern or infinite bandwidth. This paper presents a
case study based on the FFT transpose algorithm, which is mapped on
two families of scalable interconnection networks, the k-ary
n-trees and the k-ary n-cubes. We analyze in depth the network
behavior of a minimal adaptive algorithm for the k-ary n-trees
and three algorithms for the k-ary n-cubes, each offering an
increasing degree of adaptivity: the deterministic routing, a
minimal adaptive routing based on Duato's methodology and the Chaos
routing, a non-minimal adaptive cut-through version of the hot
potato routing. The simulation results collected on topologies with
up to 256 processing nodes show that the k-ary n-trees can
efficiently support the basic FFT algorithm by factoring the
personalized broadcast in a sequence of congestion-free steps.
Though k-ary n-cubes are less favored in terms of bisection
bandwidth, we can narrow the performance gap between the two
interconnection networks by properly interleaving communication and
computation. While in the presence of bandwidth-bound patterns the
communication latency becomes difficult to predict, the global
accepted network bandwidth converges to a fixed value after a
stabilization period, though both adaptive algorithms on the cubes
suffer from post-saturation problems.