Single Node Performance

It is easy to waste a lot of money on poor system design. Processor price can rise rapidly with speed, but for many codes access to memory is the bottleneck so performance increases little.

For a fixed speed, processor price decreases with time, especially for the more expensive CPUs.

For our QCD codes, access to memory is quite important. By comparing 500 MHz and 600 MHz Athlons, we demonstrate that performance does not increase in proportion to the speed of the chip. This is because memory access is fixed by the FSB speed. My latest cluster uses a 650 MHz Athlon and PC133 memory with a Biostar motherboard. We see that the boost from the faster memory is much less than 33 percent.

Athlon
500 MHz
100 MHz FSB
Athlon
600 MHz
100 MHz FSB
Ratio Athlon
650 MHz
133 MHz FSB
L
MF
MF
(1.2)
MF
4
231
276
1.19
276
6
129
135
1.05
145
8
97
102
1.05
113
10
92
97
1.05
105
12
90
95
1.05
104
14
89
93
1.06
102

When comparing the two cases with 100 MHz FSB, we see that for L = 4, for which the problem fits in cache, there is a 19.5% speedup on the faster processor. But for all the larger problems, the speedup is only 5%. We expect that for even faster processors, the memory access will become an even greater issue and performance increases will be marginal.

The chipset can make a big difference!

Several motherboards with support for PC133 memory were tested, but only those using a ServerWorks chipset gave significantly better performance than motherboards with PC100 memory.

Gigabyte
GA6VXE+
VIA
Intel
CC820
I 820
Supermicro
PIIISED
810e
PII
350 MHz
100 MHz FSB
RR PII
450 MHz
PGI compiler
L
MF
MF
MF
MF
MF
4
186
182
174
113
141
6
106
98
94
83
99
8
81
76
73
72
82
10
76
72
70
70
79
12
76
70
69
70
78
14
73
70
69
70
78

The Intel 815e support chip is somewhat better but it does not support error correction (ECC). The next table compares an Intel motherboard, with two systems that use a ServerWorks chipset. Los Lobos, uses Intel 733 MHz chips in IBM NetFinity servers. (I don't know if the motherboards are made by IBM or a third party.) Both Supermicro and Tyan produce motherboards that use ServerWorks chips. The Supermicro costs about $275 and supports dual CPUs.

Intel D815EEA2
933 MHz
133 MHz FSB
LosLobos
733 MHz
133 MHz FSB
Supermicro
370 DLE
1000 MHz
L
MF
MF
MF
4
355
319
413
6
126
140
148
8
117
130
136
10
112
127
131
12
112
127
129
14
111
126
129

CPUs with wider cache lines can help if you structure your code to take advantage.

Traditional MILC code data structure is "site major," i.e, there is a structure for each site that contains all the physical variables for that site.  The lattice is an array of site structures.  To add variables to the application is quite easy.  One only needs to modify the site structure in lattice.h and when the lattice is malloc'ed, the new variables will be globally accessible. 

This Spring, we tested performance enhancements from temporary allocations of "field major" variables for the conjugate gradient routines.  Some site variables are copied to temporary vectors that are much better localized in memory.  If a cache line contains data not needed for the current site, it is most likely the data required for the next site to be computed, rather than a different physical variable, as would be found in the next bytes of the site structure.

Single Node Results

The benchmarks presented here were run on lattices of size L4. They are all for single precision gauge links and vectors, with dot products accumulated in double precision.  The fermion matrix is either for the Kogut-Susskind (KS) or fat-link plus Naik (fat-Naik) action.  For production runs, we are using the "Asqtad" action that is correct to order a2. (The performance of the inverter is independent of the details of the fattening).  We present the results in performance order, starting with the faster processors. Which of Itanium and Alpha gives the best single node performance with the currently available codes depends on L. We shall discuss Itanium first.

<-- Itanium -->

The results in the next table are on Itanium processors with various speeds and memory systems.  The code used for these benchmarks does not include any floating point assembler instructions, but prefetching commands have been inserted to enhance performance.  The first line of the table heading indicates the processor speed.  The second line refers to the speed of the memory bus and whether it is "single pumped" or "double pumped".  The third line indicates the amount of CPU cache and total system memory. 

Speed of site major code on various Itanium processors.  This code does not include assembly code for floating point, but has in-lining and prefetching.
600MHz
100x1
2MB/1GB
*667MHz
133x2
4MB/4GB
800 MHz
133x2
4MB/2GB
L
MF
MF
MF
4
692
761
916
6
646
726
867
8
290
591
732
10
214
464
539
12
187
330
359
14
178
301
326

*It should be noted that the 667MHz processor is pre-production pilot. The two faster processors have double-pumped front side bus and they also have a larger cache. By looking at ratios of results we can see the effects of cache size, external memory speed and processor speed.

Ratio of performance on various Itanium systems
667
vs.
600
800
vs.
600
800
vs.
667
clock ratio 1.11 1.33 1.20
L ratio ratio ratio
4 1.10 1.32 1.20
6 1.12 1.34 1.19
8 2.04 2.52 1.24
10 2.17 2.52 1.16
12 1.76 1.92 1.09
14 1.69 1.83 1.08

For L=4 and 6, we see that performance increases in proportion to the clock speed.  However, for L=8 and 10, the larger caches of the 667 and 800 MHz processors compared to the 600 MHz processor, result in much improved performance.  For these sizes, the two faster processors perform roughly in the ratio of their clocks.  For L=12 and 14, access to external memory is crucial and we see that the double-pumped memory bus allows the 667 and 800 MHz processors to perform considerably better than the 600 MHz system. Comparing the two faster processors, we see that there is only an 8-9% speedup despite the 20% speedup in the clock.

Using Brian Nickerson's assembly code, the single CPU benchmarks are quite impressive.

L
MF
4
1223
6
1139
8
938
10
N/A
12
508
14
464

Unfortunately, we don't have results here for the field major code.  It would require some reworking of Brian Nickerson's assembly code to convert to field major variables.

<-- Alpha -->

Now let's turn to the Compaq Alpha on which the field major code was developed.  We see substantial improvements both from the new processor and the reworking of the code.

Speed on ES40 (667 MHz EV67) using site major variables and on ES45 (1000 MHz EV68) using both site (labelled old) and field major (new) variables
L ES40
old code
(MF)
ES45
old code
(MF)
ES45
new code
(MF)
6 517 731 977
8 495 701 843
10 395 548 934
12 249 395 778
14 253 347 609

<-- IBM SP -->

On the IBM SP, we compare the two codes on the same speed chip and calculate the increase in performance.  The site major code showed a substantial drop off in performance for large values of L, that is greatly improved in the field major code where L=14 is faster than L=8 was before.

Speedup from using field major variables on an IBM SP (375 MHz Power 3)
L old code
(MF)
new code
(MF)
speedup
new/old
4 512 663 1.29
6 458 705 1.54
8 391 682 1.74
10 215 557 2.58
12 158 528 3.35
14 135 449 3.32

<-- Pentium IV -->

The Pentium IV also show excellent speedup on the field major code.  We expect to be able to test a dual-Pentium IV shortly after this conference.

Speedup from using field major variables on a 1.5 GHz Pentium 4 system.
L old code
(MF)
new code
(MF)
speedup
new/old
4 591 577 0.98
6 240 503 2.10
8 220 481 2.19
10 208 491 2.36
12 205 480 2.34
14 202 469 2.33
<-- dual-Athlon -->

We were able to test a dual-Athlon MP system in mid-June.  We don't have any networked systems, so we present results for both single and dual CPU benchmarks here. 

Speed of Kogut-Susskind Conjugate Gradient on one and two CPUs on a dual 1.2 GHz Athlon MP system
L old code; 1 CPU
(MF)
old code; 2 CPUs
(MF/CPU)
new code; 1 CPU
(MF)
new code; 2 CPUs
(MF/CPU)
4 590 464 654 457
6 203 167 336 251
8 176 142 298 232
10 170 134 289 228
12 165 132 287 239
14 166 133 281 218
Speed of fat-Naik Conjugate Gradient on one and two CPUs on a dual 1.2 GHz Athlon MP system
L old code; 1 CPU
(MF)
old code; 2 CPUs
(MF/CPU)
new code; 1 CPU
(MF)
new code; 2 CPUs
(MF/CPU)
4 468 222 547 330
6 208 166 314 241
8 189 154 304 235
10 183 148 300 248
12 175 150 300 248
14 175 149 304 241

Pentiuim IV seems to be the best commodity chip, but the PCI suppport on current boards is a problem. For high-speed network, PCI support is important.


Multinode Results (next slide)

Back to Outline