|
Single Node Performance
It is easy to waste a lot of money on poor system design. Processor price can rise rapidly with speed, but for many codes access to memory is the bottleneck so performance increases little.
For a fixed speed, processor price decreases with time, especially for the more expensive CPUs.
For our QCD codes, access to memory is quite important.
By comparing 500 MHz and 600 MHz Athlons,
we demonstrate that performance does not increase in proportion to the
speed of the chip. This is because memory access is fixed by the FSB speed.
My latest cluster uses a 650 MHz Athlon and PC133 memory with a Biostar
motherboard. We see that the boost from the faster memory is much less
than 33 percent.
When comparing the two cases with 100 MHz FSB, we see that for L = 4, for which the problem fits in cache, there is a 19.5% speedup on the faster processor. But for all the larger problems, the speedup is only 5%. We expect that for even faster processors, the memory access will become an even greater issue and performance increases will be marginal. The chipset can make a big difference! Several motherboards with support for PC133 memory were tested, but only those using a ServerWorks chipset gave significantly better performance than motherboards with PC100 memory.
The Intel 815e support chip is somewhat better but it does not support error correction (ECC). The next table compares an Intel motherboard, with two systems that use a ServerWorks chipset. Los Lobos, uses Intel 733 MHz chips in IBM NetFinity servers. (I don't know if the motherboards are made by IBM or a third party.) Both Supermicro and Tyan produce motherboards that use ServerWorks chips. The Supermicro costs about $275 and supports dual CPUs.
CPUs with wider cache lines can help if you structure your code to take advantage. Traditional MILC code data structure is "site major," i.e, there is a structure for each site that contains all the physical variables for that site. The lattice is an array of site structures. To add variables to the application is quite easy. One only needs to modify the site structure in lattice.h and when the lattice is malloc'ed, the new variables will be globally accessible. This Spring, we tested performance enhancements from temporary allocations of "field major" variables for the conjugate gradient routines. Some site variables are copied to temporary vectors that are much better localized in memory. If a cache line contains data not needed for the current site, it is most likely the data required for the next site to be computed, rather than a different physical variable, as would be found in the next bytes of the site structure. Single Node Results The benchmarks presented here were run on lattices of size L4. They are all for single precision gauge links and vectors, with dot products accumulated in double precision.  The fermion matrix is either for the Kogut-Susskind (KS) or fat-link plus Naik (fat-Naik) action.  For production runs, we are using the "Asqtad" action that is correct to order a2. (The performance of the inverter is independent of the details of the fattening).  We present the results in performance order, starting with the faster processors. Which of Itanium and Alpha gives the best single node performance with the currently available codes depends on L. We shall discuss Itanium first. <-- Itanium --> The results in the next table are on Itanium processors with various speeds and memory systems.  The code used for these benchmarks does not include any floating point assembler instructions, but prefetching commands have been inserted to enhance performance.  The first line of the table heading indicates the processor speed.  The second line refers to the speed of the memory bus and whether it is "single pumped" or "double pumped".  The third line indicates the amount of CPU cache and total system memory. 
*It should be noted that the 667MHz processor is pre-production pilot. The two faster processors have double-pumped front side bus and they also have a larger cache. By looking at ratios of results we can see the effects of cache size, external memory speed and processor speed.
For L=4 and 6, we see that performance increases in proportion
to the clock speed. 
However, for L=8 and 10, the larger caches of the 667
and 800 MHz processors compared to the 600 MHz processor, result in much
improved performance. 
For these sizes, the two faster processors perform
roughly in the ratio of their clocks. 
For L=12 and 14, access to external
memory is crucial and we see that the double-pumped memory bus
allows the 667 and 800
MHz processors to perform considerably better than the 600 MHz system.
Comparing the two faster processors, we see that there
is only an 8-9% speedup despite the 20% speedup in the clock.
Using Brian Nickerson's assembly code, the single CPU benchmarks are quite
impressive.
Unfortunately, we don't have results here for the field major code.  It would require some reworking of Brian Nickerson's assembly code to convert to field major variables. <-- Alpha -->
Now let's turn to the Compaq Alpha on which the field major code was developed.  We see substantial improvements both from the new processor and the reworking of the code.
<-- IBM SP --> On the IBM SP, we compare the two codes on the same speed chip and calculate the increase in performance.  The site major code showed a substantial drop off in performance for large values of L, that is greatly improved in the field major code where L=14 is faster than L=8 was before.
<-- Pentium IV --> The Pentium IV also show excellent speedup on the field major code.  We expect to be able to test a dual-Pentium IV shortly after this conference.
We were able to test a dual-Athlon MP system in mid-June.  We don't have any networked systems, so we present results for both single and dual CPU benchmarks here. 
Pentiuim IV seems to be the best commodity chip, but the PCI suppport on current boards is a problem. For high-speed network, PCI support is important.
|