Cray T3E 900: The benchmarks were run on the Pittsburgh Supercomputing Center T3E. It has 450 MHz DEC/Compaq Alpha processors capable of a peak speed of 900 Mflops. There is assembly code for the important kernels, though it is not particularly optimized for this chip as it was written for an earlier Alpha processor. There are special prefetching instructions that do speed up the code. There is not enough network bandwidth to allow a 44 problem per node to run effieciently, but performance for L&ge6 is quite acceptable with fine scaling out to 128 nodes (and beyond).
IBM SP Power 3 Winterhawk II: This machine was installed at Indiana University in December, 1999. The nodes are 4CPU Power3 + (375 MHz ). There are 16 nodes, for a total of 64 CPUs. The switch has a 150 MB/s peak bandwidth between nodes, and an aggregate bandwidth of 1.2 GB/sec. Tests were run while other jobs were running on the system. Some assembly code for the RS6000 architecture was used. On multinode benchmarks, 4 CPUs were used on each node.
This computer achieves excellent speed for L=8; however, for smaller L, there does not seem to be fast enough communication so smaller values of L do not perform as well, and there is a rather sharp falloff when L is increased from 8 to 10. This poor performance is even more pronounced for larger values of L, and is presumably due to inadequate memory bandwidth as the four processors on each node contend for access.
SGI Origin 2000 (195 MHz nodes): This machine was installed at IU in May, 1997. It has 195 MHz R10000 CPUs. There is no assembly code and the benchmarks were run on a 64-node machine that was running other jobs. No 64-node benchmark was ran here because on the Origin an MPI job that is to run on n-CPUs creates (n+1)-processes and on a 64-node Origin, there is no place to run the 65-th process without interfering with the rest of the computation.
Sun Enterprise 10000 (400 MHz nodes): The results presented here were obtained by Art Lazanoff on a machine at Sun. IU obtained a 64-node E10000 in March, 2000. I don't believe I ever got quite the performance seen at Sun. Since our local machine is quite busy and usually fewer than 64 nodes are available for parallel jobs, it is not easy to rerun the benchmarks on a quiet machine.
CANDYCANE (350 MHz Pentium II): CANDYCANE stands for Computers And Network Do Your Computation And Nothing Else. It is a 32-node Beowulf cluster built at the Indiana University Physics Department during Thanksgiving break in 1998. Each node has a single 350 MHz PII processor, 64 MB of memory and a 4.3 GB hard drive. The nodes are connected using an HP ProCurve Switch 4000M. This FastEthernet switch comes populated with 40 ports and has slots that can contain modules to accommodate another 40 ports, but the backplane is not fast enough to support that many simultaneous messages.
The message passing performance of MPI under FastEthernet is insufficient to support good performance for L= 4 or 6 for large numbers of node. We begin to get reasonable performance for L >= 8, with performance continuing to increase for larger values of L as the required bandwidth decreases. For the PC architectures, the cache is small and the communication latency is fairly high, so there is no moderate value of L, as on the IBM SP, where performance peaks.
Roadrunner (450 MHz Pentium II): Roadrunner is an AltaTech Cluster at the Albuquerque High Performance Computing Center. It is a 64-node cluster with dual CPU 450 MHz Pentium II nodes. The AltaTech Cluster has both Myrinet and FastEthernet communication networks, so it is very easy to compare the performance of single vs. dual processors and FastEthernet vs. Myrinet. For our code, the second processor is not a cost-effective addition under FastEthernet, but it is under Myrinet. Overall, FastEthernet is somewhat more cost-effective than Myrinet for a fixed number of nodes, but it's performance is slower.
Compaq ES40 (500 MHz Alpha EV6): The Pittsburgh Supercomputing Center has a four processor Compaq ES40. It has 500 MHz Alpha EV6 chips and a version of mpi that makes use of the shared memory. Each processor has a 4MB cache. Benchmarks were run both without assembler routines and with our old unoptimized routines. Because the machine is used for interactive logins, there has been greater variation between runs than during a typical run, so the errors are underestimated.The results are quite interesting. Without assembler, we see that the performance is monotonically decreasing as the problem size is increased, but the biggest drop is going from L = 8 to 10. Going to two nodes, there is a substantial performance penalty, even for the largest sizes where one might expect that there is enough bandwidth. For the largest sizes, the problem is probably contention for access to memory. For four nodes, we find the best performance for L=8. In that case, there is little contention for memory, but the problem size is large enough that the communication of data between the processors is not as big a bottleneck as for smaller problems. It is strange that for L=10 and 12, the performance per node is better for four nodes than for two.
Compaq SC Series Prototype: Chuck Schneider, a Compaq employee, was kind enough to run some of our benchmarks on a prototype of the Compaq Alphaserver SC series. This computer has 500 MHz EV6 quad-processor ES40s as the nodes and a Quadrics interconnect. The production machines will have the same interconnect, but will have 667 MHz EV67 chips. In addition, the cache size will be doubled from 4 to 8 MB per processor. The LLNL TeraCluster described below has the 667 MHz chips. The nodes have fully populated memories, which should result in better interleaving. Also, a newer version of the compiler was used than for the benchmarks run at PSC.
Lawrence Livermore National Lab TeraCluster: The LLNL Teracluster is contructed from Compaq ES40 Alpha based nodes and uses a Quadrics Interconnect. This is probably prototype or one of the first delivered Compaq AlphaServer SC computers. The system is in the process of being upgraded from 500 MHz to 667 MHz CPUs and eventually each node should have four processors. Benchmarks were run on nodes of both speeds, both for single and dual processors. In addition, straight C code was compared with our unoptimized assembler kernels. The results are quite impressive.