Multinode Performance

Some recent benchmarks of multnode performance with various combinations of CPU and network follow. A more comprehensive compilation may be found at the MILC Benchmark Site.

We have multinode benchmarks for the code on a number of architectures.  For each case, the problem size is scaled with the number of processors so that we always have L4 sites per processor.

For the Compaq nodes, we have been able to run a number of tests at the Pittsburgh Supercomputer Center (PSC).  We first present ES40 and ES45 results on up to 4 CPUs, i.e., within the 4-way SMP node.  For the ES40, we have results with many processors using the Quadrics network.

We see that the combination of going from our site major to field major code and upgrading the processor from 667 MHz EV67 to 1000 MHz EV68 results in a very substantial speedup. For the largest values of L, we may be seeing total system memory bandwidth playing an important limiting role as the ratio of 4-way SMP to single CPU speeds decreases steadily from L=8 to 14 on the ES45. Scaling is better on the ES40 nodes, but it is well known that it is easier to get good scaling on codes that run slowly on a single CPU.

Comparison of four CPU benchmarks on Compaq ES40 and ES45 nodes. On the ES40, we use the site major MILC code. On the ES45, we use the new field major improvement with prefetching.

L ES40 (MF) ES45 (MF) ES45/
ES40
4 159 525 3.30
6 368 667 1.81
8 419 621 1.48
10 330 601 1.82
12 225 491 2.18
14 160 345 2.15

Our large scale benchmarks of the MILC code on the TCS prototype at PSC were done before we completed work on the field major inverters. It is unfortunate that we do not have a similar set of benchmarks for that code, but we will present the results for the site major code and remind the reader to look at the previous table to estimate what type of improvement will be seen in the future.

Performance in Megaflops per CPU using the site major code on TCS the ES40 based Compaq computer at PSC
L 1
CPU
2
CPUs
4
CPUs
8
CPUs
16
CPUs
32
CPUs
64
CPUs
128
CPUs
256
CPUs
4 560 334 160 123 101 97 92 89 74
6 517 444 368 317 280 268 258 236 173
8 495 464 426 386 361 351 335 302 262
10 393 384 342 308 303 296 268 247 221
12 293 266 225 206 199 198 186 176 153
14 240 203 160 153 149 148 145 139 131

For the IBM SP we have results on up to 256 CPUs with both site major and field major inverters.  These were obtained on the Indiana University SP using 4-way SMP nodes.  The field major code results in a substantial increase in speed for L > 8, but for reasons not yet understood, it underperforms the site major code for small L and larger numbers of CPUs.

Performance in Megaflops per CPU on the Indiana University IBM SP using the site major code
L 1
CPU
2
CPUs
4
CPUs
8
CPUs
16
CPUs
32
CPUs
64
CPUs
128
CPUs
256
CPUs
4 432 305 280 110 84 78 72 64 54
6 438 382 369 252 208 196 183 171 151
8 375 342 340 276 239 231 223 204 181
10 235 208 157 138 127 125 122 115 108
12 153 128 81 77 73 73 72 69 67
14 133 112 67 65 63 63 62 61 59
Performance in Megaflops per CPU on the Indiana University IBM SP using the field major code
L 1
CPU
2
CPUs
4
CPUs
8
CPUs
16
CPUs
32
CPUs
64
CPUs
128
CPUs
256
CPUs
4 588 353 319 98 68 62 49 52 41
6 631 515 484 245 170 160 133 132 104
8 624 548 529 316 230 218 210 176 140
10 579 503 471 292 224 212 203 184 159
12 478 386 266 192 159 154 148 139 127
14 420 293 174 148 131 128 124 117 109

Some results are available on an Itanium cluster at NCSA.  The nodes run at 733 MHz and the network is Myrinet.  This is an early version of the Titan system at NCSA currently being installed.  We thank Avneesh Pant of NCSA for doing these benchmarks.  We can compare results using one or two CPUs per node.

CPUsnodes L=4L=6 L=8L=10
111067 927 503387
22 280 533 420339
21 454 636 331255
44 235 497 409335
42 263 513 304248
88 223 353 285280
84 211 315 219194
1616118 292 293281
168 90 247 224 212

It is also worthwhile to present some benchmarks from the Platinum cluster at NCSA.  This cluster consists of 512 IBM eServer x330 thin servers with dual 1 GHz Intel Pentium III processors connected by Myrinet.  For the Pentium III, we have not found a substantial speed increase from the field major code.  In fact, the performance is often poorer, especially for large numbers of CPUs.  Results are also available for using only one CPU per node, but are not presented here.

Performance in Megaflops per CPU on the NCSA Platinum cluster using the site major code
L 1
CPU
2
CPUs
4
CPUs
8
CPUs
16
CPUs
32
CPUs
64
CPUs
128
CPUs
4 433 288 224 118 96 87 74 70
6 150 106 94 85 77 75 76 74
8 139 96 94 86 82 80 78 75
10 132 93 91 85 81 81 81 79
12 132 90 90 78 74 73 74 79
14 131 90 91 73 66 71 71 80

Performance in Megaflops per CPU on the NCSA Platinum cluster using the field major code
L 1
CPU
2
CPUs
4
CPUs
8
CPUs
16
CPUs
32
CPUs
64
CPUs
128
CPUs
4 430 285 195 87 63 61 57 53
6 167 111 104 79 69 69 68 66
8 159 109 107 85 76 76 74 71
10 153 107 104 80 74 73 72 77
12 150 105 103 79 73 72 69 71
14 148 104 102 82 77 72 70 68

Concluding Remarks (next slide)

Back to Outline