|
Steven Gottlieb, Indiana University
ABSTRACT During the past year, we have benchmarked and tuned the MILC code on a number of architectures including Intel Itanium and Pentium IV (PIV), dual-CPU Athlon, and the latest Compaq Alpha nodes (ES45). Results will be presented for many of these, and we shall discuss some simple code changes that can result in very dramatic speedup of the KS conjugate gradient on processors with more advanced memory systems such as PIV, IBM SP and Alpha. Introduction Benchmarks presented here are for the Conjugate Gradient algorithm with Kogut-Susskind quarks. They are not just for dslash. They are done within the context of a complete application for creation of gauge fields using the R-algorithm. The application uses even-odd checkerboarding, which reduces possible reuse of data in cache. Even the single CPU benchmarks are done with a fully parallel application that splits the computation within dslash into two stages to accommodate the need to wait for boundary values that might come from another node. This also reduces potential cache reusage. On some of the architectures, we make use of assembly code for basic SU3 arithmetic routines or for prefetching data to cache. We use Kogut-Susskind quarks for benchmarking because those are the quarks used in our dynamical quark calculations. KS quarks are more demanding than Wilson quarks in terms of memory bandwidth. For example, in the case of matrix-vector multiplication: M × V M = 18 reals = 72 bytes
|
|
36 |
real multiplies |
|
30
|
real adds |
|
66
|
flops |
1.45 bytes input
flop
0.36 bytes output
flop
For Wilson quarks, we deal with two Dirac components at a time, so there are 120 = 72+2*24 bytes of input, and 48 bytes of output. The number of flops is doubled, so there are only 0.91 bytes of input per flop. The ratio for output is unchanged.
The Architectures
Since August 2000, MILC has been working with Intel and NCSA under a non-disclosure agreement to tune our code for the Itanium processor (first IA64 chip). In December 2000, we were allowed to report first results without assembly code. Some limited results with assembly code were reported at Linux World last January. We may now talk more freely about results on Itanium.
MILC has had several months of production running on the initial Terascale Computer System at the Pittsburgh Supercomputer Center. It is based on Compaq ES40 nodes that contain 667 MHz EV67 Alpha chips. The full 6 TF computer will be based on 1000 MHz EV68 chips. At the end of March, we were given access to the first ES45 node at PSC that contains that chip. We have been given permission to report results at Lattice 2001.
In March and April, NCSA provided access to a 1.5 GHz Pentium IV system for benchmarking our code. Future testing will be done at Fermilab.
In mid-June, Penguin Computing made a dual-Athlon system available to us for the tests reported on that architecture. This computer had 1.2 GHz Athlon MP processors.
We also report some results on the IBM SP. These have been run on either the IU SP or Blue Horizon at the San Diego Supercomputer Center. They have 4-way and 8-way SMP nodes with 375 MHz Power 3 chips.
Code Changes
The work on the Itanium processor was carried out in conjunction with two Intel engineers, Gautham Doshi and Brian Nickerson. Doshi worked on in-lining and optimizing compiler flags for the C code. Nickerson wrote assembly code that includes prefetching and loop control for looping over sites. Doshi also tested and benchmarked that code.
The MILC code data structure is "site major." That is, there is a structure for each site that contains all the physical variables for that site. The lattice is an array of site structures. To add variables to the application is quite easy. One only needs to modify the site structure in lattice.h and when the lattice is malloc'ed, the new variables will be globally accessible. This Spring, we tested performance enhancements from temporary allocations of "field major" variables for the conjugate gradient routines. On chips with wider cache lines, this results in substantial speedups. The gauge fields and necessary vectors are copied to temporary variables that are much better localized in memory. If a cache line contains data not needed for the current site, it is most likely the data required for the next site to be computed, rather than a different physical variable, as would be found in the next bytes of the site structure.
During the cluster workshop held at Fermilab in March, I suggested these changes to Dick Foster, of Compaq, who implemented them. He also combined basic MILC matrix and vector prefetch routines into three new routines tailored for the major loops and removed prefetches from MILC Alpha SU3 routine assembler code. I implemented similar changes for the the fat-Naik version of the conjugate gradient. These changes have not yet been tried in other parts of the code.
Single Node Results
The benchmarks presented here were run on lattices of size L4. They are all for single precision gauge links and vectors, with dot products accumulated in double precision.  The fermion matrix is either for the Kogut-Susskind (KS) or fat-link plus Naik (fat-Naik) action.  For production runs, we are using the "Asqtad" action that is correct to order a2. (The performance of the inverter is independent of the details of the fattening).  We present the results in performance order, starting with the faster processors. Which of Itanium and Alpha gives the best single node performance with the currently available codes depends on L. We shall discuss Itanium first.
<-- Itanium -->
The results in the next table are on Itanium processors with various speeds and memory systems.  The code used for these benchmarks does not include any floating point assembler instructions, but prefetching commands have been inserted to enhance performance.  The first line of the table heading indicates the processor speed.  The second line refers to the speed of the memory bus and whether it is "single pumped" or "double pumped".  The third line indicates the amount of CPU cache and total system memory. 
| 600MHz
100x1 2MB/1GB |
*667MHz
133x2 4MB/4GB |
800 MHz
133x2 4MB/2GB |
|
| L | |||
| 4 | |||
| 6 | |||
| 8 | |||
| 10 | |||
| 12 | |||
| 14 |
*It should be noted that the 667MHz processor is pre-production pilot. The two faster processors have double-pumped front side bus and they also have a larger cache. By looking at ratios of results we can see the effects of cache size, external memory speed and processor speed.
| 667
600 |
800
600 |
800
667 |
|
| clock ratio | 1.11 | 1.33 | 1.20 |
| L | ratio | ratio | ratio |
| 4 | 1.10 | 1.32 | 1.20 |
| 6 | 1.12 | 1.34 | 1.19 |
| 8 | 2.04 | 2.52 | 1.24 |
| 10 | 2.17 | 2.52 | 1.16 |
| 12 | 1.76 | 1.92 | 1.09 |
| 14 | 1.69 | 1.83 | 1.08 |
For L=4 and 6, we see that performance increases in proportion
to the clock speed. 
However, for L=8 and 10, the larger caches of the 667
and 800 MHz processors compared to the 600 MHz processor, result in much
improved performance. 
For these sizes, the two faster processors perform
roughly in the ratio of their clocks. 
For L=12 and 14, access to external
memory is crucial and we see that the double-pumped memory bus
allows the 667 and 800
MHz processors to perform considerably better than the 600 MHz system.
Comparing the two faster processors, we see that there
is only an 8-9% speedup despite the 20% speedup in the clock.
Using Brian Nickerson's assembly code, the single CPU benchmarks are quite
impressive.
| L | |
| 4 | |
| 6 | |
| 8 | |
| 10 | |
| 12 | |
| 14 |
Unfortunately, we don't have results here for the field major code.  It would require some reworking of Brian Nickerson's assembly code to convert to field major variables.
<-- Alpha -->
Now let's turn to the Compaq Alpha on which the field major code was developed.  We see substantial improvements both from the new processor and the reworking of the code.
| L | ES40 old code (MF) |
ES45 old code (MF) |
ES45 new code (MF) |
| 6 | 517 | 731 | 977 |
| 8 | 495 | 701 | 843 |
| 10 | 395 | 548 | 934 |
| 12 | 249 | 395 | 778 |
| 14 | 253 | 347 | 609 |
<-- IBM SP -->
On the IBM SP, we compare the two codes on the same speed chip and calculate the increase in performance.  The site major code showed a substantial drop off in performance for large values of L, that is greatly improved in the field major code where L=14 is faster than L=8 was before.
| L | old code (MF) |
new code (MF) |
speedup new/old |
| 4 | 512 | 663 | 1.29 |
| 6 | 458 | 705 | 1.54 |
| 8 | 391 | 682 | 1.74 |
| 10 | 215 | 557 | 2.58 |
| 12 | 158 | 528 | 3.35 |
| 14 | 135 | 449 | 3.32 |
<-- Pentium IV -->
The Pentium IV also shows excellent speedup on the field major code.  We expect to be able to test a dual-Pentium IV shortly after this conference.
| L | old code (MF) |
new code (MF) |
speedup new/old |
| 4 | 591 | 577 | 0.98 |
| 6 | 240 | 503 | 2.10 |
| 8 | 220 | 481 | 2.19 |
| 10 | 208 | 491 | 2.36 |
| 12 | 205 | 480 | 2.34 |
| 14 | 202 | 469 | 2.33 |
We were able to test a dual-Athlon MP system in mid-June.  We don't have any networked systems, so we present results for both single and dual CPU benchmarks here. 
| L | old code; 1 CPU (MF) |
old code; 2 CPUs (MF/CPU) |
new code; 1 CPU (MF) |
new code; 2 CPUs (MF/CPU) |
| 4 | 590 | 464 | 654 | 457 |
| 6 | 203 | 167 | 336 | 251 |
| 8 | 176 | 142 | 298 | 232 |
| 10 | 170 | 134 | 289 | 228 |
| 12 | 165 | 132 | 287 | 239 |
| 14 | 166 | 133 | 281 | 218 |
| L | old code; 1 CPU (MF) |
old code; 2 CPUs (MF/CPU) |
new code; 1 CPU (MF) |
new code; 2 CPUs (MF/CPU) |
| 4 | 468 | 222 | 547 | 330 |
| 6 | 208 | 166 | 314 | 241 |
| 8 | 189 | 154 | 304 | 235 |
| 10 | 183 | 148 | 300 | 248 |
| 12 | 175 | 150 | 300 | 248 |
| 14 | 175 | 149 | 304 | 241 |
Message Passing Speed
The program NetPIPE from the Ames Laboratory is useful for testing network bandwidth.  We present results from NetPIPE using several types of network hardware.  They include: FastEthernet, Myrinet, Quadrics, the IBM SP switch and Wulfkit (or Scali).  For some SMP nodes, we have tested both intra-node and inter-node communication.  For clarity we only show inter-node results here. We also plot the bandwidth required to overlap communication and floating point during the crucial stages of dslash.  The model is very simple.  There is no consideration of processor overhead from passing the message.  The model uses the achieved speed for the processor on matrix times vector operations, but this value can vary with the problem size due to cache effects.  The model result is plotted for a few assumed processor speeds.
Multinode Benchmarks
We have multinode benchmarks for the code on a number of architectures. 
For each case, the problem size is scaled with the number of processors
so that we always have L4 sites per processor.
For the Compaq nodes, we have been able to run a number of tests at the
Pittsburgh Supercomputer Center (PSC).  We first present ES40 and ES45 results
on up to 4 CPUs, i.e., within the 4-way SMP node.  For the ES40, we have
results with many processors using the Quadrics network.
We see that the combination of going from our site major to field major code
and upgrading the processor from 667 MHz EV67 to 1000 MHz EV68 results in
a very substantial speedup. For the largest values of L, we may be seeing
total system memory bandwidth playing an important limiting role as the
ratio of 4-way SMP to single CPU speeds decreases steadily from L=8 to 14
on the ES45. Scaling is better on the ES40 nodes, but it is well known
that it is easier to get good scaling on codes that run slowly on a single
CPU.
Comparison of four CPU benchmarks on Compaq ES40 and ES45 nodes. On the
ES40, we use the site major MILC code. On the ES45, we use the new field major
improvement with prefetching.
| L | ES45/ ES40 |
| 4 | 3.30 |
| 6 | 1.81 |
| 8 | 1.48 |
| 10 | 1.82 |
| 12 | 2.18 |
| 14 | 2.15 |
Our large scale benchmarks of the MILC code on the TCS prototype at PSC were done before we completed work on the field major inverters. It is unfortunate that we do not have a similar set of benchmarks for that code, but we will present the results for the site major code and remind the reader to look at the previous table to estimate what type of improvement will be seen in the future.
| L | 1 CPU |
2 CPUs |
4 CPUs |
8 CPUs |
16 CPUs |
32 CPUs |
64 CPUs |
128 CPUs |
256 CPUs |
| 4 | 560 | 334 | 160 | 123 | 101 | 97 | 92 | 89 | 74 |
| 6 | 517 | 444 | 368 | 317 | 280 | 268 | 258 | 236 | 173 |
| 8 | 495 | 464 | 426 | 386 | 361 | 351 | 335 | 302 | 262 |
| 10 | 393 | 384 | 342 | 308 | 303 | 296 | 268 | 247 | 221 |
| 12 | 293 | 266 | 225 | 206 | 199 | 198 | 186 | 176 | 153 |
| 14 | 240 | 203 | 160 | 153 | 149 | 148 | 145 | 139 | 131 |
For the IBM SP we have results on up to 256 CPUs with both site major and field major inverters.  These were obtained on the Indiana University SP using 4-way SMP nodes.  The field major code results in a substantial increase in speed for L > 8, but for reasons not yet understood, it underperforms the site major code for small L and larger numbers of CPUs.
| L | 1 CPU |
2 CPUs |
4 CPUs |
8 CPUs |
16 CPUs |
32 CPUs |
64 CPUs |
128 CPUs |
256 CPUs |
| 4 | 432 | 305 | 280 | 110 | 84 | 78 | 72 | 64 | 54 |
| 6 | 438 | 382 | 369 | 252 | 208 | 196 | 183 | 171 | 151 |
| 8 | 375 | 342 | 340 | 276 | 239 | 231 | 223 | 204 | 181 |
| 10 | 235 | 208 | 157 | 138 | 127 | 125 | 122 | 115 | 108 |
| 12 | 153 | 128 | 81 | 77 | 73 | 73 | 72 | 69 | 67 |
| 14 | 133 | 112 | 67 | 65 | 63 | 63 | 62 | 61 | 59 |
| L | 1 CPU |
2 CPUs |
4 CPUs |
8 CPUs |
16 CPUs |
32 CPUs |
64 CPUs |
128 CPUs |
256 CPUs |
| 4 | 588 | 353 | 319 | 98 | 68 | 62 | 49 | 52 | 41 |
| 6 | 631 | 515 | 484 | 245 | 170 | 160 | 133 | 132 | 104 |
| 8 | 624 | 548 | 529 | 316 | 230 | 218 | 210 | 176 | 140 |
| 10 | 579 | 503 | 471 | 292 | 224 | 212 | 203 | 184 | 159 |
| 12 | 478 | 386 | 266 | 192 | 159 | 154 | 148 | 139 | 127 |
| 14 | 420 | 293 | 174 | 148 | 131 | 128 | 124 | 117 | 109 |
Some results are available on an Itanium cluster at NCSA.  The nodes run at 733 MHz and the network is Myrinet.  This is an early version of the Titan system at NCSA currently being installed.  We thank Avneesh Pant of NCSA for doing these benchmarks.  We can compare results using one or two CPUs per node.
| CPUs | nodes | L=4 | L=6 | L=8 | L=10 |
| 1 | 1 | 1067 | 927 | 503 | 387 |
| 2 | 2 | 280 | 533 | 420 | 339 |
| 2 | 1 | 454 | 636 | 331 | 255 |
| 4 | 4 | 235 | 497 | 409 | 335 |
| 4 | 2 | 263 | 513 | 304 | 248 |
| 8 | 8 | 223 | 353 | 285 | 280 |
| 8 | 4 | 211 | 315 | 219 | 194 |
| 16 | 16 | 118 | 292 | 293 | 281 |
| 16 | 8 | 90 | 247 | 224 | 212 |
It is also worthwhile to present some benchmarks from the Platinum cluster at NCSA.  This cluster consists of 512 IBM eServer x330 thin servers with dual 1 GHz Intel Pentium III processors connected by Myrinet.  For the Pentium III, we have not found a substantial speed increase from the field major code.  In fact, the performance is often poorer, especially for large numbers of CPUs.  Results are also available for using only one CPU per node, but are not presented here.
| L | 1 CPU |
2 CPUs |
4 CPUs |
8 CPUs |
16 CPUs |
32 CPUs |
64 CPUs |
128 CPUs |
| 4 | 433 | 288 | 224 | 118 | 96 | 87 | 74 | 70 |
| 6 | 150 | 106 | 94 | 85 | 77 | 75 | 76 | 74 |
| 8 | 139 | 96 | 94 | 86 | 82 | 80 | 78 | 75 |
| 10 | 132 | 93 | 91 | 85 | 81 | 81 | 81 | 79 |
| 12 | 132 | 90 | 90 | 78 | 74 | 73 | 74 | 79 |
| 14 | 131 | 90 | 91 | 73 | 66 | 71 | 71 | 80 |
| L | 1 CPU |
2 CPUs |
4 CPUs |
8 CPUs |
16 CPUs |
32 CPUs |
64 CPUs |
128 CPUs |
| 4 | 430 | 285 | 195 | 87 | 63 | 61 | 57 | 53 |
| 6 | 167 | 111 | 104 | 79 | 69 | 69 | 68 | 66 |
| 8 | 159 | 109 | 107 | 85 | 76 | 76 | 74 | 71 |
| 10 | 153 | 107 | 104 | 80 | 74 | 73 | 72 | 77 |
| 12 | 150 | 105 | 103 | 79 | 73 | 72 | 69 | 71 |
| 14 | 148 | 104 | 102 | 82 | 77 | 72 | 70 | 68 |
The last cluster we would like to consider is one that uses Wulfkit or Scali for the interconnect.  We have seen that this interconnect has comparable performance to Myrinet on the Netpipe benchmark.  Unfortunately, the cluster to which we were given access has older nodes that do not have the same capability as the newer systems.  Each node has two 450 MHz Pentium III processors.
| L | 1 CPU |
2 CPUs |
4 CPUs |
8 CPUs |
16 CPUs |
32 CPUs |
| 4 | 142 | 107 | 86 | 68 | 55 | 48 |
| 6 | 142 | 117 | 76 | 59 | 55 | 51 |
| 8 | 147 | 63 | 63 | 58 | 52 | 52 |
| 10 | 147 | 61 | 60 | 57 | 53 | 53 |
| 12 | 142 | 59 | 59 | 56 | 55 | 54 |
| 14 | 72 | 59 | 59 | 57 | 55 | 55 |
Future Work
There are several additional opportunities to improve code performance. 
For the improved
action codes, the Conjugate Gradient inverter no longer dominates the time
to the extent it used to. 
The techniques used for the CG must be tried on other sections
of the code. 
We must also apply these techniques to our applications that use
Wilson or Clover quarks. 
Finally, with increased single node performance,
we must work on the message passing parts of the code to reduce the loss
of performance there. 
Within the context of the Department of Energy SciDAC Lattice Gauge Theory
project, several MILC members and new postdocs will be working on
code development.
The field major enhancements discussed here will be included in the next
release of the MILC code.
Acknowledgements
Many people outside the MILC collaboration have helped to make this work
possible. 
I am grateful to all of them.
At Intel, they include Gautam Doshi, Robert Fogel and Brian
Nickerson. 
At Compaq, Dick Foster has played an important role.
Keith Murphy at Dolphi Interconnect Solutions arranged for access to
a computer using Scali, and David Garret of Lockheed Martin provided
access and explanations. 
At NCSA, Rob Pennington, Avneesh Pant and Dave McWilliams
have provided early acess, advice and support. 
At PSC, R. Reddy, Sergiu Sanielivici, Michael Levine and Ralph Roskies
have done the same. 
At IU, we particularly thank Mary Papakhian. 
Rick Rodriguez, of Penguin Computing and Andrew Kretzer of Bold Data are
thanked for access to dual-Athlon systems.
Finally, I would like to thank the Theory Group at Fermilab where this
poster was prepared.