Latency and Bandwidth Requirements of Massively Parallel Programs: FFT as a Case Study

Many theoretical models of parallel computation are based on overly simplistic assumptions on the performance of the interconnection network. For example they assume constant latency for any communication pattern or infinite bandwidth. This paper presents a case study based on the FFT transpose algorithm, which is mapped on two families of scalable interconnection networks, the k-ary n-trees and the k-ary n-cubes. We analyze in depth the network behavior of a minimal adaptive algorithm for the k-ary n-trees and three algorithms for the k-ary n-cubes, each offering an increasing degree of adaptivity: the deterministic routing, a minimal adaptive routing based on Duato's methodology and the Chaos routing, a non-minimal adaptive cut-through version of the hot potato routing. The simulation results collected on topologies with up to 256 processing nodes show that the k-ary n-trees can efficiently support the basic FFT algorithm by factoring the personalized broadcast in a sequence of congestion-free steps. Though k-ary n-cubes are less favored in terms of bisection bandwidth, we can narrow the performance gap between the two interconnection networks by properly interleaving communication and computation. While in the presence of bandwidth-bound patterns the communication latency becomes difficult to predict, the global accepted network bandwidth converges to a fixed value after a stabilization period, though both adaptive algorithms on the cubes suffer from post-saturation problems.