|
|
Some recent benchmarks of multnode performance with various
combinations of CPU and network follow. A more comprehensive compilation may
be found at the
MILC Benchmark Site.
We have multinode benchmarks for the code on a number of architectures. 
For each case, the problem size is scaled with the number of processors
so that we always have L4 sites per processor.
For the Compaq nodes, we have been able to run a number of tests at the
Pittsburgh Supercomputer Center (PSC).  We first present ES40 and ES45 results
on up to 4 CPUs, i.e., within the 4-way SMP node.  For the ES40, we have
results with many processors using the Quadrics network.
We see that the combination of going from our site major to field major code
and upgrading the processor from 667 MHz EV67 to 1000 MHz EV68 results in
a very substantial speedup. For the largest values of L, we may be seeing
total system memory bandwidth playing an important limiting role as the
ratio of 4-way SMP to single CPU speeds decreases steadily from L=8 to 14
on the ES45. Scaling is better on the ES40 nodes, but it is well known
that it is easier to get good scaling on codes that run slowly on a single
CPU.
Comparison of four CPU benchmarks on Compaq ES40 and ES45 nodes. On the
ES40, we use the site major MILC code. On the ES45, we use the new field major
improvement with prefetching.
| L | ES40 (MF) | ES45 (MF) | ES45/ ES40 |
| 4 | 159 | 525 | 3.30 |
| 6 | 368 | 667 | 1.81 |
| 8 | 419 | 621 | 1.48 |
| 10 | 330 | 601 | 1.82 |
| 12 | 225 | 491 | 2.18 |
| 14 | 160 | 345 | 2.15 |
Our large scale benchmarks of the MILC code on the TCS prototype at PSC were done before we completed work on the field major inverters. It is unfortunate that we do not have a similar set of benchmarks for that code, but we will present the results for the site major code and remind the reader to look at the previous table to estimate what type of improvement will be seen in the future.
| L | 1 CPU |
2 CPUs |
4 CPUs |
8 CPUs |
16 CPUs |
32 CPUs |
64 CPUs |
128 CPUs |
256 CPUs |
| 4 | 560 | 334 | 160 | 123 | 101 | 97 | 92 | 89 | 74 |
| 6 | 517 | 444 | 368 | 317 | 280 | 268 | 258 | 236 | 173 |
| 8 | 495 | 464 | 426 | 386 | 361 | 351 | 335 | 302 | 262 |
| 10 | 393 | 384 | 342 | 308 | 303 | 296 | 268 | 247 | 221 |
| 12 | 293 | 266 | 225 | 206 | 199 | 198 | 186 | 176 | 153 |
| 14 | 240 | 203 | 160 | 153 | 149 | 148 | 145 | 139 | 131 |
For the IBM SP we have results on up to 256 CPUs with both site major and field major inverters.  These were obtained on the Indiana University SP using 4-way SMP nodes.  The field major code results in a substantial increase in speed for L > 8, but for reasons not yet understood, it underperforms the site major code for small L and larger numbers of CPUs.
| L | 1 CPU |
2 CPUs |
4 CPUs |
8 CPUs |
16 CPUs |
32 CPUs |
64 CPUs |
128 CPUs |
256 CPUs |
| 4 | 432 | 305 | 280 | 110 | 84 | 78 | 72 | 64 | 54 |
| 6 | 438 | 382 | 369 | 252 | 208 | 196 | 183 | 171 | 151 |
| 8 | 375 | 342 | 340 | 276 | 239 | 231 | 223 | 204 | 181 |
| 10 | 235 | 208 | 157 | 138 | 127 | 125 | 122 | 115 | 108 |
| 12 | 153 | 128 | 81 | 77 | 73 | 73 | 72 | 69 | 67 |
| 14 | 133 | 112 | 67 | 65 | 63 | 63 | 62 | 61 | 59 |
| L | 1 CPU |
2 CPUs |
4 CPUs |
8 CPUs |
16 CPUs |
32 CPUs |
64 CPUs |
128 CPUs |
256 CPUs |
| 4 | 588 | 353 | 319 | 98 | 68 | 62 | 49 | 52 | 41 |
| 6 | 631 | 515 | 484 | 245 | 170 | 160 | 133 | 132 | 104 |
| 8 | 624 | 548 | 529 | 316 | 230 | 218 | 210 | 176 | 140 |
| 10 | 579 | 503 | 471 | 292 | 224 | 212 | 203 | 184 | 159 |
| 12 | 478 | 386 | 266 | 192 | 159 | 154 | 148 | 139 | 127 |
| 14 | 420 | 293 | 174 | 148 | 131 | 128 | 124 | 117 | 109 |
Some results are available on an Itanium cluster at NCSA.  The nodes run at 733 MHz and the network is Myrinet.  This is an early version of the Titan system at NCSA currently being installed.  We thank Avneesh Pant of NCSA for doing these benchmarks.  We can compare results using one or two CPUs per node.
| CPUs | nodes | L=4 | L=6 | L=8 | L=10 |
| 1 | 1 | 1067 | 927 | 503 | 387 |
| 2 | 2 | 280 | 533 | 420 | 339 |
| 2 | 1 | 454 | 636 | 331 | 255 |
| 4 | 4 | 235 | 497 | 409 | 335 |
| 4 | 2 | 263 | 513 | 304 | 248 |
| 8 | 8 | 223 | 353 | 285 | 280 |
| 8 | 4 | 211 | 315 | 219 | 194 |
| 16 | 16 | 118 | 292 | 293 | 281 |
| 16 | 8 | 90 | 247 | 224 | 212 |
It is also worthwhile to present some benchmarks from the Platinum cluster at NCSA.  This cluster consists of 512 IBM eServer x330 thin servers with dual 1 GHz Intel Pentium III processors connected by Myrinet.  For the Pentium III, we have not found a substantial speed increase from the field major code.  In fact, the performance is often poorer, especially for large numbers of CPUs.  Results are also available for using only one CPU per node, but are not presented here.
| L | 1 CPU |
2 CPUs |
4 CPUs |
8 CPUs |
16 CPUs |
32 CPUs |
64 CPUs |
128 CPUs |
| 4 | 433 | 288 | 224 | 118 | 96 | 87 | 74 | 70 |
| 6 | 150 | 106 | 94 | 85 | 77 | 75 | 76 | 74 |
| 8 | 139 | 96 | 94 | 86 | 82 | 80 | 78 | 75 |
| 10 | 132 | 93 | 91 | 85 | 81 | 81 | 81 | 79 |
| 12 | 132 | 90 | 90 | 78 | 74 | 73 | 74 | 79 |
| 14 | 131 | 90 | 91 | 73 | 66 | 71 | 71 | 80 |
| L | 1 CPU |
2 CPUs |
4 CPUs |
8 CPUs |
16 CPUs |
32 CPUs |
64 CPUs |
128 CPUs |
| 4 | 430 | 285 | 195 | 87 | 63 | 61 | 57 | 53 |
| 6 | 167 | 111 | 104 | 79 | 69 | 69 | 68 | 66 |
| 8 | 159 | 109 | 107 | 85 | 76 | 76 | 74 | 71 |
| 10 | 153 | 107 | 104 | 80 | 74 | 73 | 72 | 77 |
| 12 | 150 | 105 | 103 | 79 | 73 | 72 | 69 | 71 |
| 14 | 148 | 104 | 102 | 82 | 77 | 72 | 70 | 68 |