Advanced Clustering personnel spent part of last week at the Intel Developer Forum (IDF) conference to learn the ins and outs of the latest Intel innovations. One highlight was the discussion of the new Intel Xeon Phi accelerator cards. This has had a handful of other names recently: MIC (Many Integrated Cores), Knights Corner, and Knights Ferry.
The Xeon Phi is an add-in accelerator card designed to compete with the Nvidia Tesla line of cards. While the aim to accelerate parallel code is the same, the approach between the Phi and Tesla is very different. The Tesla design has its origins in GPUs designed for 3D graphics and games and has a completely different instruction set and programming model than existing processors. The Xeon Phi is based on the x86 instruction set but optimized for parallel code.
While not all of the specifications of the Xeon Phi have been released, these are the details we have thus far:
- multi-core (at least 50 cores)
- 64bit cores
- 4 threads per core
- 512bit wide registers
- Vector Processing Unit (VPU), integer, Single Precision and Dual Precision floating point
- Fully coherent L2 cache
- Specialized ring interconnect between all cores
More information about the software and how programs are written for the Phi is available as well. Intel is presenting multiple models for using the Phi, including:
- Offload - Certain highly parallel code functions will be offload to the Phi, while general purpose computing and MPI calls still run on the host CPU (very similar to the Tesla approach).
- Symmetric - Both the host and Phi are used for all parts of the application. Using something like MPI, the Phi can be treated as just another 'node' in a cluster.
- Hosted - For highly parallel codes, the Xeon Phi runs code independently of the host; all computation is on the accelerator without affecting the host.
Offload programming can be achieved in a number of ways. The simplest to implement is with the use of the Intel MKL math library. Functions in the MKL library that can take advantage of the Phi will be automatically offloaded if a Phi is present in the system. This requires no code changes, just compilation with the latest version of MKL.
You can control the offloading in your own code by issuing a special compiler pragma before the loop you want to offload:
#pragma offload target (mic)
Here is an example code snipped where following OpenMP loop would be offloaded to the Phi
#pragma omp parallel for reduction(+:pi)
for (i = 0; i < count; i ++)
float t = (float)((i+0.5)/count);
pi += 4.0/(1.0+t*t);
pi /= count; /* executes on host */
When building code with the Intel compiler, a single additional flag is all that’s necessary:
icc -offload-build -o hello hello.c
In the above example, if an accelerator is present in the system, the for loop would run on the Phi; if not, the loop would run on the host CPU without the need for re-compilation.
Xeon Phi uOS
The symmetric and hosted models are quite different from other accelerator cards on the market. This is possible because the Xeon Phi actually runs an entire OS in the card. The uOS is based on the Linux kernel version 2.6.34, with busybox, full ethernet bridging, NFS client, MICdirect for InfiniBand HCA access, and complete ssh access.
Building and running code on the Xeon Phi
Code that runs on the Phi can be built using the Intel compiler suite by simply adding the additional flag '-mmic':
icc -mmic -o hello hello.c
Then the binary can be copied directly to the Xeon Phi to run (replace node-phi with the ip or hostname assigned to the bridged interface):
scp hello node-phi:/tmp/hello
Then ssh to the Phi and run the code:
ssh node-phi /tmp/hello
Code that runs on the Phi can take advantage of many standard libraries including OpenMP, Intel Math Kernel Library, GNU libc, GNU libstdc++, and even pthreads.
Running MPI codes
Since the Intel Xeon Phi has an OS and is fully network accessible, the existing MPI code can be run on the Phi along with existing compute nodes. Let's assume you have 2 compute nodes (node01, node02), each with a Xeon Phi installed (node01-phi, node02-phi).
First we build the MPI code (example with Intel MPI):
For the host Xeon processor:
mpiicc -o mpi-hello.x86_64 mpi-hello.c
For the Xeon Phi processor (add the additional ‘-mmic’ argument):
mpicc -mmic -o mpi-hello.mic mpi-hello.c
Run it as follows:
mpirun -np 4 -host node01 mpi-hello.x86_64 \
-host node01-phi mpi-hello.mic \
-host node02 mpi-hello.x86_64 \
-host node02-phi mpi-hello.mic
As you can see the from the above examples, the multiple programming models on the Phi make it very attractive, and should make it quite easy to get up and running quickly. Depending on your code, a simple recompile could be all that is necessary to take advantage of the Xeon Phi.
We expect to be able to provide more information on pricing and availability in the coming months.
Questions or comments? email