Benchmarking and Tuning the MILC Code on Clusters and Supercomputers
 
 

Steven Gottlieb, Indiana University
MILC Collaboration
  Poster available at: http://physics.indiana.edu/~sg/lattice01/
 

ABSTRACT 

During the past year, we have benchmarked and tuned the MILC code on a number of architectures including Intel Itanium and Pentium IV (PIV), dual-CPU Athlon, and the latest Compaq Alpha nodes (ES45).  Results will be presented for many of these, and we shall discuss some simple code changes that can result in very dramatic speedup of the KS conjugate gradient on processors with more advanced memory systems such as PIV, IBM SP and Alpha. 

Introduction

Benchmarks presented here are for the Conjugate Gradient algorithm with Kogut-Susskind quarks.  They are not just for dslash.  They are done within the context of a complete application for creation of gauge fields using the  R-algorithm.  The application uses even-odd checkerboarding, which reduces possible reuse of data in cache.  Even the single CPU benchmarks are done with a fully parallel application that splits the computation within dslash into two stages to accommodate the need to wait for boundary values that might come from another node.  This also reduces potential cache reusage.

On some of the architectures, we make use of assembly code for basic SU3 arithmetic routines or for prefetching data to cache.

We use Kogut-Susskind quarks for benchmarking because those are the quarks used in our dynamical quark calculations.  KS quarks are more demanding than Wilson quarks in terms of memory bandwidth.  For example, in the case of matrix-vector multiplication:

M × V

M = 18 reals = 72 bytes
V = 6 reals = 24 bytes
Total input = 96 bytes
Total output = 24 bytes

 

36
real multiplies
30
real adds
66
flops

     1.45 bytes    input
             flop

     0.36 bytes    output
             flop

For Wilson quarks, we deal with two Dirac components at a time, so there are 120 = 72+2*24 bytes of input, and 48 bytes of output.  The number of flops is doubled, so there are only 0.91 bytes of input per flop.  The ratio for output is unchanged.

The Architectures

Since August 2000, MILC has been working with Intel and NCSA under a non-disclosure agreement to tune our code for the Itanium processor (first IA64 chip).  In December 2000, we were allowed to report first results without assembly code.  Some limited results with assembly code were reported at Linux World last January.  We may now talk more freely about results on Itanium.

MILC has had several months of production running on the initial Terascale Computer System at the Pittsburgh Supercomputer Center.  It is based on Compaq ES40 nodes that contain 667 MHz EV67 Alpha chips.  The full 6 TF computer will be based on 1000 MHz EV68 chips.  At the end of March, we were given access to the first ES45 node at PSC that contains that chip.  We have been given permission to report results at Lattice 2001.

In March and April, NCSA provided access to a 1.5 GHz Pentium IV system for benchmarking our code.  Future testing will be done at Fermilab.

In mid-June, Penguin Computing made a dual-Athlon system available to us for the tests reported on that architecture.  This computer had 1.2 GHz Athlon MP processors.

We also report some results on the IBM SP.  These have been run on either the IU SP or Blue Horizon at the San Diego Supercomputer Center.  They have 4-way and 8-way SMP nodes with 375 MHz Power 3 chips.

Code Changes

The work on the Itanium processor was carried out in conjunction with two Intel engineers, Gautham Doshi and Brian Nickerson.  Doshi worked on in-lining and optimizing compiler flags for the C code.  Nickerson wrote assembly code that includes prefetching and loop control for looping over sites.  Doshi also tested and benchmarked that code.

The MILC code data structure is "site major." That is, there is a structure for each site that contains all the physical variables for that site.  The lattice is an array of site structures.  To add variables to the application is quite easy.  One only needs to modify the site structure in lattice.h and when the lattice is malloc'ed, the new variables will be globally accessible.  This Spring, we tested performance enhancements from temporary allocations of "field major" variables for the conjugate gradient routines.  On chips with wider cache lines, this results in substantial speedups.  The gauge fields and necessary vectors are copied to temporary variables that are much better localized in memory.  If a cache line contains data not needed for the current site, it is most likely the data required for the next site to be computed, rather than a different physical variable, as would be found in the next bytes of the site structure.

During the cluster workshop held at Fermilab in March, I suggested these changes to Dick Foster, of Compaq, who implemented them.  He also combined basic MILC matrix and vector prefetch routines into three new routines tailored for the major loops and removed prefetches from MILC Alpha SU3 routine assembler code.  I implemented similar changes for the the fat-Naik version of the conjugate gradient.  These changes have not yet been tried in other parts of the code.

Single Node Results

The benchmarks presented here were run on lattices of size L4. They are all for single precision gauge links and vectors, with dot products accumulated in double precision.  The fermion matrix is either for the Kogut-Susskind (KS) or fat-link plus Naik (fat-Naik) action.  For production runs, we are using the "Asqtad" action that is correct to order a2. (The performance of the inverter is independent of the details of the fattening).  We present the results in performance order, starting with the faster processors. Which of Itanium and Alpha gives the best single node performance with the currently available codes depends on L. We shall discuss Itanium first.

<-- Itanium -->

The results in the next table are on Itanium processors with various speeds and memory systems.  The code used for these benchmarks does not include any floating point assembler instructions, but prefetching commands have been inserted to enhance performance.  The first line of the table heading indicates the processor speed.  The second line refers to the speed of the memory bus and whether it is "single pumped" or "double pumped".  The third line indicates the amount of CPU cache and total system memory. 

Speed of site major code on various Itanium processors.  This code does not include assembly code for floating point, but has in-lining and prefetching.
600MHz
100x1
2MB/1GB
*667MHz
133x2
4MB/4GB
800 MHz
133x2
4MB/2GB
L
MF
MF
MF
4
692
761
916
6
646
726
867
8
290
591
732
10
214
464
539
12
187
330
359
14
178
301
326

*It should be noted that the 667MHz processor is pre-production pilot. The two faster processors have double-pumped front side bus and they also have a larger cache. By looking at ratios of results we can see the effects of cache size, external memory speed and processor speed.

Ratio of performance on various Itanium systems
667
vs.
600
800
vs.
600
800
vs.
667
clock ratio 1.11 1.33 1.20
L ratio ratio ratio
4 1.10 1.32 1.20
6 1.12 1.34 1.19
8 2.04 2.52 1.24
10 2.17 2.52 1.16
12 1.76 1.92 1.09
14 1.69 1.83 1.08

For L=4 and 6, we see that performance increases in proportion to the clock speed.  However, for L=8 and 10, the larger caches of the 667 and 800 MHz processors compared to the 600 MHz processor, result in much improved performance.  For these sizes, the two faster processors perform roughly in the ratio of their clocks.  For L=12 and 14, access to external memory is crucial and we see that the double-pumped memory bus allows the 667 and 800 MHz processors to perform considerably better than the 600 MHz system. Comparing the two faster processors, we see that there is only an 8-9% speedup despite the 20% speedup in the clock.

Using Brian Nickerson's assembly code, the single CPU benchmarks are quite impressive.

L
MF
4
1223
6
1139
8
938
10
N/A
12
508
14
464

Unfortunately, we don't have results here for the field major code.  It would require some reworking of Brian Nickerson's assembly code to convert to field major variables.

<-- Alpha -->

Now let's turn to the Compaq Alpha on which the field major code was developed.  We see substantial improvements both from the new processor and the reworking of the code.

Speed on ES40 (667 MHz EV67) using site major variables and on ES45 (1000 MHz EV68) using both site (labelled old) and field major (new) variables
L ES40
old code
(MF)
ES45
old code
(MF)
ES45
new code
(MF)
6 517 731 977
8 495 701 843
10 395 548 934
12 249 395 778
14 253 347 609

<-- IBM SP -->

On the IBM SP, we compare the two codes on the same speed chip and calculate the increase in performance.  The site major code showed a substantial drop off in performance for large values of L, that is greatly improved in the field major code where L=14 is faster than L=8 was before.

Speedup from using field major variables on an IBM SP (375 MHz Power 3)
L old code
(MF)
new code
(MF)
speedup
new/old
4 512 663 1.29
6 458 705 1.54
8 391 682 1.74
10 215 557 2.58
12 158 528 3.35
14 135 449 3.32

<-- Pentium IV -->

The Pentium IV also shows excellent speedup on the field major code.  We expect to be able to test a dual-Pentium IV shortly after this conference.

Speedup from using field major variables on a 1.5 GHz Pentium 4 system.
L old code
(MF)
new code
(MF)
speedup
new/old
4 591 577 0.98
6 240 503 2.10
8 220 481 2.19
10 208 491 2.36
12 205 480 2.34
14 202 469 2.33
<-- dual-Athlon -->

We were able to test a dual-Athlon MP system in mid-June.  We don't have any networked systems, so we present results for both single and dual CPU benchmarks here. 

Speed of Kogut-Susskind Conjugate Gradient on one and two CPUs on a dual 1.2 GHz Athlon MP system
L old code; 1 CPU
(MF)
old code; 2 CPUs
(MF/CPU)
new code; 1 CPU
(MF)
new code; 2 CPUs
(MF/CPU)
4 590 464 654 457
6 203 167 336 251
8 176 142 298 232
10 170 134 289 228
12 165 132 287 239
14 166 133 281 218
Speed of fat-Naik Conjugate Gradient on one and two CPUs on a dual 1.2 GHz Athlon MP system
L old code; 1 CPU
(MF)
old code; 2 CPUs
(MF/CPU)
new code; 1 CPU
(MF)
new code; 2 CPUs
(MF/CPU)
4 468 222 547 330
6 208 166 314 241
8 189 154 304 235
10 183 148 300 248
12 175 150 300 248
14 175 149 304 241

Message Passing Speed

The program NetPIPE from the Ames Laboratory is useful for testing network bandwidth.  We present results from NetPIPE using several types of network hardware.  They include: FastEthernet, Myrinet, Quadrics, the IBM SP switch and Wulfkit (or Scali).  For some SMP nodes, we have tested both intra-node and inter-node communication.  For clarity we only show inter-node results here. We also plot the bandwidth required to overlap communication and floating point during the crucial stages of dslash.  The model is very simple.  There is no consideration of processor overhead from passing the message.  The model uses the achieved speed for the processor on matrix times vector operations, but this value can vary with the problem size due to cache effects.  The model result is plotted for a few assumed processor speeds.

Multinode Benchmarks

We have multinode benchmarks for the code on a number of architectures.  For each case, the problem size is scaled with the number of processors so that we always have L4 sites per processor.

For the Compaq nodes, we have been able to run a number of tests at the Pittsburgh Supercomputer Center (PSC).  We first present ES40 and ES45 results on up to 4 CPUs, i.e., within the 4-way SMP node.  For the ES40, we have results with many processors using the Quadrics network.

We see that the combination of going from our site major to field major code and upgrading the processor from 667 MHz EV67 to 1000 MHz EV68 results in a very substantial speedup. For the largest values of L, we may be seeing total system memory bandwidth playing an important limiting role as the ratio of 4-way SMP to single CPU speeds decreases steadily from L=8 to 14 on the ES45. Scaling is better on the ES40 nodes, but it is well known that it is easier to get good scaling on codes that run slowly on a single CPU.

Comparison of four CPU benchmarks on Compaq ES40 and ES45 nodes. On the ES40, we use the site major MILC code. On the ES45, we use the new field major improvement with prefetching.

L ES45/
ES40
4 3.30
6 1.81
8 1.48
10 1.82
12 2.18
14 2.15

Our large scale benchmarks of the MILC code on the TCS prototype at PSC were done before we completed work on the field major inverters. It is unfortunate that we do not have a similar set of benchmarks for that code, but we will present the results for the site major code and remind the reader to look at the previous table to estimate what type of improvement will be seen in the future.

Performance in Megaflops per CPU using the site major code on TCS the ES40 based Compaq computer at PSC
L 1
CPU
2
CPUs
4
CPUs
8
CPUs
16
CPUs
32
CPUs
64
CPUs
128
CPUs
256
CPUs
4 560 334 160 123 101 97 92 89 74
6 517 444 368 317 280 268 258 236 173
8 495 464 426 386 361 351 335 302 262
10 393 384 342 308 303 296 268 247 221
12 293 266 225 206 199 198 186 176 153
14 240 203 160 153 149 148 145 139 131

For the IBM SP we have results on up to 256 CPUs with both site major and field major inverters.  These were obtained on the Indiana University SP using 4-way SMP nodes.  The field major code results in a substantial increase in speed for L > 8, but for reasons not yet understood, it underperforms the site major code for small L and larger numbers of CPUs.

Performance in Megaflops per CPU on the Indiana University IBM SP using the site major code
L 1
CPU
2
CPUs
4
CPUs
8
CPUs
16
CPUs
32
CPUs
64
CPUs
128
CPUs
256
CPUs
4 432 305 280 110 84 78 72 64 54
6 438 382 369 252 208 196 183 171 151
8 375 342 340 276 239 231 223 204 181
10 235 208 157 138 127 125 122 115 108
12 153 128 81 77 73 73 72 69 67
14 133 112 67 65 63 63 62 61 59
Performance in Megaflops per CPU on the Indiana University IBM SP using the field major code
L 1
CPU
2
CPUs
4
CPUs
8
CPUs
16
CPUs
32
CPUs
64
CPUs
128
CPUs
256
CPUs
4 588 353 319 98 68 62 49 52 41
6 631 515 484 245 170 160 133 132 104
8 624 548 529 316 230 218 210 176 140
10 579 503 471 292 224 212 203 184 159
12 478 386 266 192 159 154 148 139 127
14 420 293 174 148 131 128 124 117 109

Some results are available on an Itanium cluster at NCSA.  The nodes run at 733 MHz and the network is Myrinet.  This is an early version of the Titan system at NCSA currently being installed.  We thank Avneesh Pant of NCSA for doing these benchmarks.  We can compare results using one or two CPUs per node.

CPUsnodes L=4L=6 L=8L=10
111067 927 503387
22 280 533 420339
21 454 636 331255
44 235 497 409335
42 263 513 304248
88 223 353 285280
84 211 315 219194
1616118 292 293281
168 90 247 224 212

It is also worthwhile to present some benchmarks from the Platinum cluster at NCSA.  This cluster consists of 512 IBM eServer x330 thin servers with dual 1 GHz Intel Pentium III processors connected by Myrinet.  For the Pentium III, we have not found a substantial speed increase from the field major code.  In fact, the performance is often poorer, especially for large numbers of CPUs.  Results are also available for using only one CPU per node, but are not presented here.

Performance in Megaflops per CPU on the NCSA Platinum cluster using the site major code
L 1
CPU
2
CPUs
4
CPUs
8
CPUs
16
CPUs
32
CPUs
64
CPUs
128
CPUs
4 433 288 224 118 96 87 74 70
6 150 106 94 85 77 75 76 74
8 139 96 94 86 82 80 78 75
10 132 93 91 85 81 81 81 79
12 132 90 90 78 74 73 74 79
14 131 90 91 73 66 71 71 80

Performance in Megaflops per CPU on the NCSA Platinum cluster using the field major code
L 1
CPU
2
CPUs
4
CPUs
8
CPUs
16
CPUs
32
CPUs
64
CPUs
128
CPUs
4 430 285 195 87 63 61 57 53
6 167 111 104 79 69 69 68 66
8 159 109 107 85 76 76 74 71
10 153 107 104 80 74 73 72 77
12 150 105 103 79 73 72 69 71
14 148 104 102 82 77 72 70 68

The last cluster we would like to consider is one that uses Wulfkit or Scali for the interconnect.  We have seen that this interconnect has comparable performance to Myrinet on the Netpipe benchmark.  Unfortunately, the cluster to which we were given access has older nodes that do not have the same capability as the newer systems.  Each node has two 450 MHz Pentium III processors.

Performance in Megaflops per CPU on a computer using the Scali interconnect using the site major code
L 1
CPU
2
CPUs
4
CPUs
8
CPUs
16
CPUs
32
CPUs
4 142 107 86 68 55 48
6 142 117 76 59 55 51
8 147 63 63 58 52 52
10 147 61 60 57 53 53
12 142 59 59 56 55 54
14 72 59 59 57 55 55

Future Work

There are several additional opportunities to improve code performance.  For the improved action codes, the Conjugate Gradient inverter no longer dominates the time to the extent it used to.  The techniques used for the CG must be tried on other sections of the code.  We must also apply these techniques to our applications that use Wilson or Clover quarks.  Finally, with increased single node performance, we must work on the message passing parts of the code to reduce the loss of performance there.  Within the context of the Department of Energy SciDAC Lattice Gauge Theory project, several MILC members and new postdocs will be working on code development.

The field major enhancements discussed here will be included in the next release of the MILC code.

Acknowledgements

Many people outside the MILC collaboration have helped to make this work possible.  I am grateful to all of them. At Intel, they include Gautam Doshi, Robert Fogel and Brian Nickerson.  At Compaq, Dick Foster has played an important role. Keith Murphy at Dolphi Interconnect Solutions arranged for access to a computer using Scali, and David Garret of Lockheed Martin provided access and explanations.  At NCSA, Rob Pennington, Avneesh Pant and Dave McWilliams have provided early acess, advice and support.  At PSC, R. Reddy, Sergiu Sanielivici, Michael Levine and Ralph Roskies have done the same.  At IU, we particularly thank Mary Papakhian.  Rick Rodriguez, of Penguin Computing and Andrew Kretzer of Bold Data are thanked for access to dual-Athlon systems. Finally, I would like to thank the Theory Group at Fermilab where this poster was prepared.