We haven't seen any molecular dynamics benchmarks on the newest processors: AMD Opteron (Istanbul) 6-core or Intel Xeon (Nehalem) 4-core. To help us get a better understanding of the performance on MD code, I took a look at two of the most common MD codes out there: LAMMPS and GROMACS. My results are a little surprising!
I did an exhaustive set of benchmarks on the latest compilers: GCC 4.4.0, Portland Group 9.0, Intel 11, and the newly minted Open64 release from AMD. On both of these codes, Intel 11 narrowly beat all other compilers on both AMD and Intel processors. The results below reflect code tuned using -msse3 on AMD Opteron 2435 and -xSSE4.2 on Intel Xeon X5550. I made many attempts to find a better compiler and set of optimizations for AMD, but Intel's compiler was consistently at the top of the stack—even on their competitor's platform.
Each set of data represents five runs of the benchmark on each chip and I include the box-plot to demonstrate the amount—or lack thereof—of jitter in the results. Overall, standard deviation in the results of both benchmarks was low meaning the results hold constant from run-to-run.
The AMD system tested was a Pinnacle 1BA2301 compute blade with dual 2.6GHz Opteron 2435 processors; the Intel system was a Pinnacle 1BX5501 with dual 2.66GHz Xeon X5550 processors. Memory speed is 800MHz DDR2 on Opteron and 1333MHz DDR3 on Xeon.
Neither benchmark uses more than a few hundred MB of RAM per core at any point. At most, some 25% of the RAM of either system was consumed during a run.
Intel Xeon 5500-series processors are capable of “Turbo mode”—a kind of self-over-clocking when the chips' cores are under 50% utilized. However, this code is consuming all cores, so Turbo does not apply here.
About the Benchmarks
GROMACS' benchmark results are the total seconds it took to run the DPPC simulation included with the gmxbench 3.0 distribution on each platform. GROMACS supports MKL (Intel's Math Kernel Library) directly, as a compile time option, as opposed to many codes which require an intermediate library like FFTW to access the accelerated functions of MKL. However, I found that using MKL directly only added a fraction of a percentage point in increased performance. Nevertheless, I did use GROMACS in MKL-direct mode on both platforms. GROMACS times itself and that is the number reported; a lower score is better: that a given CPU was able to do the same amount of work in that much less time.
LAMMPS' benchmark results are the number of CPU seconds per atom per time step as recommended by the LAMMPS developers. LAMMPS does not support MKL directly so MKL's wrapped FFTW2 interface library was used for acceleration. LAMMPS times itself and that is the number reported; a lower score is better: a given CPU takes less time per unit work.
I used MPICH2 1.1 as the MPI layer for all benchmarks.
Neither benchmark benefited from the introduction of hyper-threading on Xeon (running -np 16 in place of -np 8).
Each DPPC run on AMD Opteron was performed with -np 12; Intel Xeon, -np 8. The simulation size was held at the constants included with the gmxbench 3.0 distribution.
Opteron offers a mean runtime of 151.4 seconds; Xeon, 172.8 seconds. The standard deviation is 0.55 and 0.84 seconds, respectively. Opteron is the clear winner on this benchmark besting Xeon by 12%. When considering the price of each configured system, Opteron is an even bigger winner offering up that additional 12% performance for 8% less cost ($3,500 versus $3,800). This benchmark appears to benefit from the additional cores that are available on Opteron with little penalty for the increased simulation fragmentation.
Like the GROMACS benchmark, initially, I held the simulation size constant using in.rhodo.scaled and providing a 2 x 2 x 2 scaling factor constituting a simulation size of 256,000 atoms. LAMMPS begins to offer a more interesting performance picture, muddying the results.
For statistical purposes, Opteron 2435 and Xeon X5550 are tied when the simulation size is held constant. AMD Opteron offers 2.15E-06 seconds per atom per time step; Intel Xeon, 2.15E-06. The standard deviation is three orders of magnitude smaller. What does this mean? AMD makes up for a less efficient architecture here by bringing more cores to bear on the problem. How much less efficient? That's the subject of the next diagram.
LAMMPS recommends scaling the benchmark by to the number of cores thrown at the problem: 32,000 atoms for each core. For the AMD Opteron test, this time, I scaled the problem on a 2 x 3 x 2 problem grid requiring 384,000 atoms for the simulation still running with -np 12.
Once scaled, the benchmark is essentially core-versus-core rather than platform-versus-platform. Each AMD Opteron core offers 2.62E-05 seconds per atom per time step; Intel Xeon, 1.72E-05. Again, standard deviation is three orders of magnitude smaller. Essentially, this result shows that each Xeon core is 34% more efficient at the same work-load size—a fact that the Opteron chip manages to overcome by bringing 50% more cores to the platform.
AMD has managed to beat or match Intel with their latest chip offering, even with its shortcomings. Bringing more cores to the fight has managed to compensate for a less efficient platform. When considering a total system cost that's 8% lower and offering up to 12% higher performance on GROMACS, AMD Opteron 2435 is a clear choice. LAMMPS users have a muddier choice: its unclear that this performance benefit will hold when taking in to account multi-node MPI scaling. There are more cores contending for the MPI interconnect and so what is now essentially a tie may tip toward Intel's favor when scaling to more nodes. I hope to follow up on this question in a later post.
There is one other caveats to this conclusion, however. If your purchase is up against a power limit, Intel Xeon X5500-series can attain much higher clock rates at nearly the same power envelope than AMD Opteron 2400-series currently can. While you will pay a hefty price for those upper clock rates, avoiding having to install additional air-conditioning and power could make Xeon a better choice.
We'd like to hear your comments, please leave them below, be sure to let us know if you have a benchmark suite that you would like us to tackle. We're actively looking for new code to analyze. If you have any questions about these results, feel free to contact me privately, as well,
. Or call our number at the top of the page and let us help you decide which platform is right for you. Our test cluster is available to our customers—and potential customers—for benchmarking.