The STREAM memory benchmark (http://www.cs.virginia.edu/stream) is a widely used synthetic test that measures sustainable memory bandwidth in MB/s. Memory bandwidth has become more important as CPU vendors start adding more cores to a chip. This ratio of core count to available memory bandwidth can have a huge impact on performance. If a system isn't capable of moving data from main memory to a CPU core fast enough, the cores will sit idle as they wait for the data to arrive. This idle time can lower the efficiency of a system and negate some of the benefits of more cores or faster clock speeds.
STREAM has become quite popular and is one of the de-facto benchmarks used to gauge system memory performance. It's included as part of the HPC Challenge test suite (http://icl.cs.utk.edu/hpcc/), a benchmark which tries to measure performance of an high performance computing system with multiple metrics, not just clock speed. Like the previous benchmarking we have performed, several different compilers were tested (GCC, Intel, and Portland Group) and the one that consistently provided us the best results was used for all the data you see below. The different compilers didn't produce much of a difference in performance, but the Intel compiler showed a slight advantage, so the benchmark was compiled using Intel icc with OpenMP enabled to parallelize the test. All test results are based on running 1 thread for every physical core. For example, on the Intel Xeon 5500 series (Nehalem) the test was run with 8 threads, on the AMD Opteron 2400 series (Istanbul) it was run with 12.
Before discussing the results of the benchmarks, a short overview of the processor architecture and memory controllers can help provide context to our findings. The system architectures are in the order of memory performance, from slowest to fastest. You can click on the headings to see a diagram of the CPU and system architecture.
The Xeon 5400 "Harpertown" processor is a essentially two dual core processors on one physical CPU package. Each physical CPU share a connection to the front side bus. There is one memory controller per system and it is part of the system's chipset called the memory controller hub or MCH. The MCH provides all of the physical access to the system's 667MHz or 800MHz fully buffered DIMM's (FBDIMM's). This shared memory controller and FSB create a bottleneck in memory bandwidth performance.
The Opteron Barcelona and Shanghai are integrated four core physical CPU packages. In a 2 CPU system, each CPU has a dedicated point-to-point connection to the other CPU which AMD calls HyperTransport. Each CPU also has a dedicated memory controller with its own bank of dual-channel DDR2 DIMM's. This provides double the number of memory controllers of the Xeon 5400 system and more than doubles the memory bandwidth it can offer. The Barcelona processor was produced at 65nm and could support dual-channel 667MHz DDR2 DIMM's, while the newer Shanghai processor is a 45nm product that can support dual-channel 800MHz DDR2 DIMM's.
The Opteron Istanbul processor is very similar to the Shanghai processor except for the addition of two extra cores, for a total of six, and increased CPU to CPU HyperTransport speeds. There have been no major changes to the memory controller: each CPU can support dual-channel 800MHz DDR2 DIMM's.
The Xeon 5500 was a major architecture overhaul from the 5400 series. Its design resembles the Opteron more than the older Xeon. Each physical CPU is native quad core with dedicated memory controller and point-to-point connection between CPU's which Intel calls Quick Path Interconnect (QPI). Each CPU's memory controller provides tri-channel DDR3 memory support. Depending on the processor model and number of DIMMs, the memory speed can be 800, 1066, or 1333MHz.
For the Xeon 5500:
For the Opteron 2400:
While the memory size is slightly different, it was selected because 12 DIMMs performs the best on the Xeon 5500 series, while 8 DIMMs performs best on Opteorn 2435. The other tests were performed on older discontinued Advanced Clustering models.
As you can see by the graph above, the Intel Xeon 5500 has far exceeded the performance of previous generation Xeon architecture, as well as beating out the best of the AMD Opteron systems. Comparing memory bandwidth of a Xeon 5400 to a 5500, we see an almost 4x improvement in performance: jumping from 9.7GB/s to 37GB/s. Even the slowest memory speed on a Xeon 5500 processor bests the fastest produced by the Opteron by as much as 20%; comparing the Opteron to the fastest Xeon, the Xeon outperforms by over 75%. The Xeon 5500 gets these much higher memory bandwidth results because of tri-channel instead of dual-channel memory, the increased clock speed of DDR3 (up to 1333MHz), and the fast point-to-point CPU interconnect provided by its Quick Path Interconnect.
With the Istanbul processor, AMD didn't change the memory controller, number of channels, or DIMM speed which leads to the performance staying about the same. It dips slightly -- most likely due to the extra contention of 2 more cores per memory controller. The addition of HT3 between CPU's doesn't change the performance in the memory controller, itself.
When you add cost per machine into the mix, the results still show the Xeon 5500 series with a clear lead. The Xeon machine as configured has a price of approximately $3,800 while the Opteron is priced at $3,500. This gives the Xeon a rate of 9.8 megabytes per second per dollar vs. 5.9 megabytes per second per dollar for the Opteron: a 66% advantage for the Intel Xeon 5500 series.
If your code is bound by memory bandwidth, then the Intel Xeon 5500 series processor is a clear winner. One thing to remember is that memory bandwidth isn't everything and is only one part of overall system performance. If you refer to our previous benchmark of HPL, you will see that the addition of the 2 extra cores gives the Opteron systems a lead in raw GFLOP's. Of course, these synthetic benchmarks are not always indicative of real world performance. In our upcoming blog posts, we will be comparing performance on real world applications including popular codes for molecular dynamics, computational fluid dynamics, and weather forecasting. We always strongly recommend that our customers benchmark their code and figure out what hardware best fits their software. If you have suggestions or comments, please send an email to