MILC QCD Code Benchmarks

The MILC QCD codes run on a wide variety of commercial supercomputers. They can also run on any Linux or NT cluster that supports MPI. This page summarizes some of the benchmarks that have been run for the Kogut-Susskind conjugate gradient part of the code.

To get good performance on a parallel computer requires good single-node performance, and high enough realized bandwidth to spread the problem out over many nodes. For our code, memory bandwidth and cache size are quite important for single-node performance. We often run a variety of benchmarks for which the problem size per CPU and the number of CPUs is varied. We do this because the single-node performance is sensitive to the problem size because of cache effects. Given that single-node performance, we can then ask whether the network performance is sufficient to run domains of that size on many nodes. This is a bit different from the traditional analysis that takes a fixed problem size and tests scaling as the number of nodes increases. A gziped tarfile containing pbs scripts to run the benchmarks on up to 256 nodes is available here. It should be easy to modify these to run on SMP nodes. We change nodes to procs in the name of the output files when running on SMP nodes.

The benchmarks are presented in a variety of formats. In tabular form, each line represents a problem size of L4 grid points per CPU. The result is in Megaflops per CPU and there is an error based on five conjugate gradient solves. In many cases, there are two sets of results for four CPUs. If there is a group of values at the bottom of the table, then they are for L2 × (2L)2 problems and the ones integrated into the rest of the table are for L3 × (4L). For 8 CPUs, the problem size is L × (2L) 3, for 16 it is (2L)4, etc. For additional CPUs, we again double the later dimensions first, with no dimension being more than a factor of 2 greater then the others. When the number of CPUs in increased from 2 to 16, the number of directions in which the program would like to communicate simultaneously increases from 2 to 4. Beyond that, no additional messages are required in each iteration, except for the global sums. Thus, we expect performance per CPU to drop more rapidly from 1 until 16 and then to level off.

There are also axis files to summarize results graphically, and PostScript files generated from the axis. The link with the name of the system contains more information about the corresponding benchmark. There may also be some additional benchmarks available there.

Results

Supercomputers

Cray T3E 900 (450 MHz)    table axis PostScript
IBM SP Power3 Winterhawk II (375 MHz)    table axis PostScript
SGI Origin 2000 (195 MHz nodes)    table axis PostScript
Sun E10000 (400 MHz nodes)    table axis PostScript

Linux Clusters (Intel Architecture)

CANDYCANE (350 MHz PII; FastEthernet) table axis PostScript
Roadrunner (450 MHz PII)
     Myrinet table axis PostScript
     FastEthernet table axis PostScript
     Myrinet (2 processors per node) table axis PostScript
     FastEthernet (2 processors per node) table axis PostScript
     Comparison of Fast Ethernet and Myrinet with one CPU per node (PostScript)
     Comparison of Fast Ethernet and Myrinet with two CPUs per node (PostScript)

Unix Clusters (Alpha Architecture)

ES40 (500 MHz EV6)
     No assembly code table axis PostScript
     Assembly code kernels table axis PostScript
Compaq SC Series Prototype
     500 MHz EV6 (no assembler) table axis PostScript
     500 MHz EV6 (assembler) table axis PostScript
Lawrence Livermore National Lab Teracluster
     500 MHz EV6 (no assembler; single CPU) table axis PostScript
     500 MHz EV6 (no assembler; dual CPU) table axis PostScript
     500 MHz EV6 (assembler; single CPU) table axis PostScript
     500 MHz EV6 (assembler; dual CPU) table axis PostScript
     667 MHz EV67 (no assembler; single CPU) table axis PostScript
     667 MHz EV67 (no assembler; dual CPU) table axis PostScript
     667 MHz EV67 (assembler; single CPU) table axis PostScript
     667 MHz EV67 (assembler; dual CPU) table axis PostScript

One final comment, if you give us an account we will run the code on your computer and we'll let you know the results.

If you have comments or suggestions, email Steven Gottlieb at sg@indiana.edu