The MILC QCD codes run on a wide variety of commercial supercomputers. They can also run on any Linux or NT cluster that supports MPI. This page summarizes some of the benchmarks that have been run for the Kogut-Susskind conjugate gradient part of the code.
To get good performance on a parallel computer requires good single-node performance, and high enough realized bandwidth to spread the problem out over many nodes. For our code, memory bandwidth and cache size are quite important for single-node performance. We often run a variety of benchmarks for which the problem size per CPU and the number of CPUs is varied. We do this because the single-node performance is sensitive to the problem size because of cache effects. Given that single-node performance, we can then ask whether the network performance is sufficient to run domains of that size on many nodes. This is a bit different from the traditional analysis that takes a fixed problem size and tests scaling as the number of nodes increases. A gziped tarfile containing pbs scripts to run the benchmarks on up to 256 nodes is available here. It should be easy to modify these to run on SMP nodes. We change nodes to procs in the name of the output files when running on SMP nodes.
The benchmarks are presented in a variety of formats. In tabular form, each line represents a problem size of L4 grid points per CPU. The result is in Megaflops per CPU and there is an error based on five conjugate gradient solves. In many cases, there are two sets of results for four CPUs. If there is a group of values at the bottom of the table, then they are for L2 × (2L)2 problems and the ones integrated into the rest of the table are for L3 × (4L). For 8 CPUs, the problem size is L × (2L) 3, for 16 it is (2L)4, etc. For additional CPUs, we again double the later dimensions first, with no dimension being more than a factor of 2 greater then the others. When the number of CPUs in increased from 2 to 16, the number of directions in which the program would like to communicate simultaneously increases from 2 to 4. Beyond that, no additional messages are required in each iteration, except for the global sums. Thus, we expect performance per CPU to drop more rapidly from 1 until 16 and then to level off.
There are also axis files to summarize results graphically, and PostScript files generated from the axis. The link with the name of the system contains more information about the corresponding benchmark. There may also be some additional benchmarks available there.
One final comment, if you give us an account we will run the code on
your computer and we'll let you know the results.
If you have comments or suggestions, email Steven Gottlieb at firstname.lastname@example.org