In the period prior to the launch of Intel® Xeon Phi™ coprocessor, Intel collected questions from developers who had been involved in pilot testing. This document contains some of the most common questions asked. Additional information and Best-Know-Methods for the Intel Xeon Phi coprocessor can be found here.
The Intel® Compiler reference guides can be found at:
C/C++:
http://software.intel.com/sites/products/documentation/doclib/stdxe/2013/composerxe/compiler/cpp-lin/index.htm
All the compiler reference pages referenced in the document can be found by visiting the C/C++ compiler reference page and traversing the table of contents to reach the specified page.
______________________________________________________________________________________________________
Q) What all do I need to run offload code on the Intel Xeon Phi coprocessor?
With Intel® Manycore Platform Software Stack (Intel® MPSS)2.0, everything uses a single “fat” binary that contains everything needed for executing on both the host and the coprocessor.
You can check the shared library dependencies for this binary using the following command:
~#/usr/linux-k1om-4.7/bin/x86_64-k1om-linux-readelf -d ./a.out | grep NEEDED
0x0000000000000001 (NEEDED) Shared library: [libm.so.6]
0x0000000000000001 (NEEDED) Shared library: [libiomp5.so]
0x0000000000000001 (NEEDED) Shared library: [liboffload.so.5]
0x0000000000000001 (NEEDED) Shared library: [libstdc++.so.6]
0x0000000000000001 (NEEDED) Shared library: [libgcc_s.so.1]
0x0000000000000001 (NEEDED) Shared library: [libpthread.so.0]
0x0000000000000001 (NEEDED) Shared library: [libc.so.6]
0x0000000000000001 (NEEDED) Shared library: [libdl.so.2]
Note that the offload compiler automatically loads all these shared library dependencies to the coprocessor. If the compiler is unable to find any of the required libraries, an appropriate warning or suggestion is displayed.
______________________________________________________________________________________________________
Q) The into modifier does not work correctly with offload or results in an error?
The into modifier enables you to transfer data from a variable on the host to another variable located on the coprocessor, and vice versa. When you use into with the in clause, data is copied from the CPU object to the coprocessor object. The alloc_if, free_if, and alloc modifiers apply to the into expression.
Similarly, when you use into with the out clause, data is copied from the coprocessor object to the CPU object. The alloc_if, free_if, and alloc modifiers apply to the out expression. However, there are certain conditions you need to fulfill for the into directive to work correctly with an offload.
- The into modifier is not allowed with inout and nocopy clauses.
- An overlap between the source and destination memory ranges leads to undefined behavior.
- Shape change is not allowed, e.g. transferring from a 1D array to a 2D array.
More information can be found in the compiler reference at:
Key Features > Intel® Many Integrated Core Architecture (Intel® MIC Architecture) > Programming for Intel® MIC Architecture > Offload using a pragma > Moving Data from One Variable to Another.
______________________________________________________________________________________________________
Q) Why does my nocopy modifier not work correctly? Why does it generate a compiler or runtime error?
The operation of the nocopy clause is dependent on a number of factors. The following conditions must be met to ensure the correct operation of the nocopy modifier:
- The coprocessor number must be set when using nocopy. By default, the offloads to the coprocessors happen in a round robin fashion, and hence, it is essential to let the compiler know which coprocessor to use for the offloads. For example,
#pragma offload target(mic:0)” nocopy(a:length(10) alloc_if(0) free_if(0))
- All dynamically allocated variables that need to be moved to the coprocessor should be global and declared with directive __attribute__((target(mic)))
- Ensure that the memory used in the nocopy has already been allocated and is persisted by using the alloc_if and free_if modifiers.
- Another alternative for nocopy is using an in/out clause with length set to 0:
#pragma offload target(mic:0) in(a:length(0) alloc_if(0) free_if(0))
For more information on nocopy, please refer to the following web page:
http://software.intel.com/en-us/articles/effective-use-of-the-intel-compilers-offload-features
______________________________________________________________________________________________________
Q) Between offloads, how does statically allocated, stack allocated and heap allocated data persist?
Refer to the following link for more details on persistence of data across offloads:
http://software.intel.com/en-us/articles/effective-use-of-the-intel-compilers-offload-features
______________________________________________________________________________________________________
Q) What is the default directory on the coprocessor to which files are written?
For an Offload code or Native code executed using micnativeloadex utility:
If a directory for the file I/O is not specified, then the file is written to /tmp/coi_procs/<card #>/<PID>.
So if I am offloading to card #1, and the offload is handled by process PID#2929, then the default directory on the Intel Xeon Phi coprocessor is /tmp/coi_procs/1/2929.
Note that the numbering of the Intel Xeon Phi coprocessors on the system starts at 1. For example, the first coprocessor, located at 192.168.1.100, is coprocessor #1. This is different than in the pragma offload target specification (target(mic:0)).
For a Native code executed after being copied to the coprocessor using scp:
If the directory is not specified, the file is created in the user's home directory which generally is “/home/userid” for a non-root user. If the user is logged in as the root (via sudo or otherwise) the file is created at "/root" which is the root home directory.
______________________________________________________________________________________________________
Q) What happens if both host and coprocessor write to / read from the same file?
The same behavior as results from any such NFS conflict
_____________________________________________________________________________________________________
Q) When I see an array of length(0) in an offload pragma, what does it mean?
Pointers used within offload regions are by default inout, that is, data associated with them is transferred in and out. Sometimes data may be used strictly locally; it is assigned and used on the coprocessor only. The nocopy clause is useful in this case to leave the data unmodified by the offload clauses, and allow the programmer to explicitly manage its contents. In other cases, data is transferred into the location from the CPU, and a subsequent offload may want to either
a) use the same memory allocated and transfer fresh data into it, or
b) keep the same memory and reuse the same data.
For case a), an in clause with length equal to the number of elements is useful. For case b) an in clause with length of 0 can be used to “refresh” the pointer but avoid any data transfer
The following table gives a complete description of how to use in/out/nocopy with the length clause:
_____________________________________________________________________________________________________
Q) What is processor affinity and how do we set it?
Here is a small excerpt from a white paper that will help answer your questions. More information about setting the thread affinity through the OpenMP runtime can be found in point 7 (below).
1 Why Worry ‘Bout a Thing?
On a single die Intel® Xeon® processor based system, pinning threads to cores is often only a minor optimization, since the shared L3 cache provides fast inter-thread communication between all the threads. However, on Intel Xeon Phi coprocessor there is no shared L3 cache and it is therefore more important to ensure that threads stay near the caches that contain the data they have touched, and aren’t moved around by the OS. The way to achieve that is to force thread affinity.
2 Linux* Affinity Calls
Inside the kernel Linux maintains a cpu_set_t for each thread. This is a set of integers (implemented as a bitset) that contains the logical CPUs on which the thread can be run. When a thread is created (as a result of a fork() or pthread_create() call) it inherits its affinity from its parent (the thread that made the creation call). The logical CPU numbers used here align with those used by the kernel elsewhere, for instance in /proc/cpuinfo.
Threads can change their affinity by using the sched_setaffinity() call, and discover their existing affinity using sched_getaffinity() (documented in the same place). sched_setaffinity() lets you force the affinity to any value you choose; by using it you can escape the affinity that you started with, allowing your thread to run on parts of the machine that its parent could not run on, though doing that is rather bad manners unless you really know everything that is running on the machine. (Hint, you probably don’t know that even if you think you do!)
3 Mapping Hardware to Logical CPUs
Since the affinity calls all deal with logical CPUs, if we’re to get the correct affinity for our threads we need to understand how the kernel’s logical CPU enumeration maps onto the physical cores and hardware threads in the Intel Xeon Phi coprocessor. That mapping looks like this
4 Granularity of Affinity
Remember that the affinity is a set of logical CPUs on which a thread can run. We can therefore restrict a thread either to any of the logical CPUs that map to the same physical core (core affinity), or, more finely, to a specific hardware thread on that core (thread affinity). Mapping to a core allows the kernel more freedom to move the thread, which is potentially useful if one of the hardware threads is taking interrupts. On the other hand, the OS may abuse that freedom and moving thread between logical CPUs isn’t a free operation even when they are sharing all levels of cache.
We have generally observed that binding to thread granularity provides more consistent results, though binding to core level can sometimes give better average performance over many runs. So, “your mileage may vary”, and this may be worth experimenting with.
To set a core affinity you should use CPU_SET() to create a cpu_set_t that contains each of the four logical CPUs that map to the same physical core, and then use sched_setaffinity() to force the appropriate affinity. (Or, if you are creating pthreads yourself, you could use the same cpu_set_t at pthread_create() time.)
To set a thread level affinity you should create a cpu_set_t with a single logical CPU enabled in it
5 Pre-existing Affinities
In most circumstances the affinity that is inherited will allow the thread to run on any logical CPU in the machine. However, there are a number of exceptions to that
- When executed by the offload mechanism the affinity is set so that the last physical core in the machine will not be used. (Since Intel parallel runtimes use the number of available logical CPUs in the incoming affinity to determine the correct number of threads to run, this is why an offloaded OpenMP code will use four fewer threads by default than the number of available HW threads in the machine).
- In native mode if the user uses the taskset command they can set the initial affinity to a subset of the machine (taskset will be supported in an upcoming release of Intel MPSS).
- In MPI, the MPI system can be used to set the affinity of MPI processes to that each can run on only a subset of the machine.
In each of these cases the affinity mask has been changed to reflect a sensible use of the machine that the process itself cannot easily determine. This is why when setting affinity by hand it is polite only to reduce the set of available logical CPUs on which a thread can run, not simply force it.
6 Sensible Affinities
Under Intel MPSS many of the kernel services and daemons are affinitized to the “Bootstrap Processor” (BSP), which is the last physical core. This is also where the offload daemon runs the services required to support data transfer for offload. It is therefore generally sensible to avoid using this core for user code. (Indeed, as already discussed, the offload system does that automatically by removing the logical CPUs on the last core from the default affinity of offloaded processes).
7 OpenMP
So far we’ve been talking about affinities at the level of cpu_set_t and system calls. If you’re using OpenMP you can ask the OpenMP runtime to set affinities for you using the KMP_AFFINITY environment variable. If you use the “explicit” form of affinity, you can give the precise set of logical CPUs to which to bind each thread (so you still need to understand the hardware to logical CPU mapping above).
You can find more information at the following compiler reference:
Key Features > openMP* support > OpenMP* Library Support > Thread Affinity Interface (Linux* and Windows*)
______________________________________________________________________________________________________
Q) What is the consistency of floating-point results using the Intel® Compiler?
OR
Q) Why doesn’t my application always give the same answer?
To find more about the consistency of floating-point results and the related compiler options, please visit the following link:
To find more about the differences in floating-point arithmetic between Intel Xeon processors and the Intel Xeon Phi coprocessor, please refer to the following white paper:
http://software.intel.com/sites/default/files/article/326703/floating-point-differences-sept11.pdf
______________________________________________________________________________________________________
Q) Should I use explicit prefetching?
Generally, explicit software prefetching helps the most when the data access pattern is convoluted and it is impossible for the compiler to optimize effectively. So if most of your loads are gather instructions, then a good option is to consider explicit prefetching.
On the other hand, the compiler heuristics can do a decent job of prefetching the data. The compiler uses all available information to compute the prefetch distance for each loop. The information that is most useful for the compiler, and that it does not have in many case, is the estimate for the "average" trip-count for a loop. In such cases, the best way to provide this information to the compiler is to use the loop_count pragma just before the loop. Please refer to the documentation of this pragma in the compiler reference guide at:
Compiler Reference>Pragmas>Intel-Specific Pragma Reference>loop_count
______________________________________________________________________________________________________
Q) How do I perform an asynchronous / non-blocking transfer to Intel Xeon Phi coprocessor?
Asynchronous data transfers, also known as non-blocking transfers, can be made to the coprocessor using the offload_transfer pragma. More information can be found about this pragma at the following compiler reference:
Compiler Reference > Intel-specific Pragma Reference > offload_transfer
Some important things to keep in mind while using offload_transfer:
- Always explicitly state the coprocessor that each offload or offload_transfer is going to use. For e.g. #pragma offload target(mic:0) or #pragma offload_transfer target(mic:1) will use coprocessor 0 for the offload and coprocessor 1 for the offload_transfer
- Always remember to use alloc_if and free_if modifier to control memory persistence. In most cases, it is more convenient to have a single offload pragma that allocates memory before the start of the program and another offload pragma at the end that frees all the allocated memory. In the above scenario, remember that all intermediate offloads or offload_transfers should neither allocate nor free any memory.
- Double buffering can provide improvement when using asynchronous data transfers.
For more examples as well as Best Known Methods (BKMs) on asynchronous transfers, please refer to the following link:
http://software.intel.com/sites/default/files/article/326700/6.2.1-asynchronous-offload.pdf
______________________________________________________________________________________________________
Q) Is peer-to-peer communication between coprocessors possible in the offload mode without MPI?
We do not support communication between cards in offload mode.
______________________________________________________________________________________________________
Q) If I compile my code for Intel MIC natively, how do I reverse offload some of the computations back to the CPU?
The compiler does not support reverse offload.
______________________________________________________________________________________________________
Q) What environment variables can be used to control and monitor the behavior of code offloaded to the Intel Xeon Phi coprocessor?
Several environment variables exist to control and monitor offload codes on the coprocessor. A small list of useful environment variables can be found at the following compiler reference:
Compilation> Setting Environment Variables.
Some other useful environment variables can be found at:
http://software.intel.com/en-us/articles/effective-use-of-the-intel-compilers-offload-features
______________________________________________________________________________________________________
Q) How do I mount NFS volumes onto the coprocessor?
The readme-en.txt for the Intel MPSS release provides most details needed to automount NFS filesystems on the card at boot time.
1. For the /etc/fstab entry that is mentioned, one adds to the card’s /etc/fstab, to automount at boot time, add this entry into the card's file image on the host under: /opt/intel/mic/filesystem/mic0/etc/fstab
For the example cited in the readme, add the line below into /opt/intel/mic/filesystem/mic0/etc/fstab:
172.31.1.254:/mic0fs /mic0fs nfs rsize=8192,wsize=8192,nolock,intr 0 0
2. To create the necessary mount point, add an entry into the mic0.filelist file on the host: /opt/intel/mic/filesystem/mic0.filelist
For the example cited in the readme, add the line shown below into the base.filelist:
dir /mic0fs 755 0 0
If you are running the Gold MPSS release (2.1.4346-16), reboot the card (micctrl -R) and the filesystem should mount at boot time.
For a multi-card configuration, here is one method to setup the same NFS mounted filesystem on all cards.
1. Under /opt/intel/mic/filesystem/common, create a sub-directory "etc" and place a copy of the fstab file from one of the "mic#" card fileystem images there. Edit the fstab and add the entry for the NFS filesystem. Next add the mount point and fstab file entries into common.filesystem.
For example, create /opt/intel/mic/filesystem/common/etc/fstab with:
devpts /dev/pts devpts defaults 0 0
tmpfs /dev/shm tmpfs defaults 0 0
sysfs /sys sysfs defaults 0 0
proc /proc proc defaults 0 0
host:/micfs /micfs nfs rsize=8192,wsize=8192,nolock,intr 0 0
Create /opt/intel/mic/filesystem/common.filelist containing:
dir /micfs 644 0 0
file /etc/fstab etc/fstab 664 0 0
2. Under /opt/intel/mic/filesystem, remove the fstab file entry from each card's mic#.filesystem file. (Optional, under each card's filesystem image/opt/intel/mic/filesystem/mic# (e.g. /opt/intel/mic/filesystem/mic0) rename the file "etc/fstab")
3. On the host, add the appropriate entry for /etc/exports.
For example,
/micfs 172.31.0.0/255.255.0.0(rw,no_root_squash)
______________________________________________________________________________________________________
Q) Is OpenCL supported on Intel Xeon Phi coprocessor?
For more information please take a look at:
http://software.intel.com/en-us/blogs/2012/11/12/introducing-opencl-12-for-intel-xeon-phi-coprocessor
______________________________________________________________________________________________________
Q) Can I explicitly allocate memory in an offload?
Yes, you can explicitly allocate memory within an offload. For details regarding persistence of heap allocated memory, please refer to the following:
http://software.intel.com/en-us/articles/effective-use-of-the-intel-compilers-offload-features
______________________________________________________________________________________________________
Q) How do I conditionally compile code only for the Intel Xeon Phi coprocessor?
You can compile Intel MIC architecture-only code by protecting the code using #ifdef __MIC__
e.g.
#ifdef __MIC__
//Code for Intel MIC architecture goes in here
#endif
Please remember that #includes for certain headers files related to Intel MIC architecture should be protected in this manner.
______________________________________________________________________________________________________
Q) How do I compile code only for the Intel Xeon Phi coprocessor?
Compiling code for only the Intel Xeon Phi coprocessor, also known as compiling native compilation, can be done by using the –mmic compiler switch.
______________________________________________________________________________________________________
Q) Does Intel Xeon Phi coprocessor support third-party tools and libraries?
For the most up to date information on the third-party tools and library support for Intel Xeon Phi coprocessor please check the following page:
http://software.intel.com/en-us/articles/intel-and-third-party-tools-and-libraries-available-with-support-for-intelr-xeon-phitm
______________________________________________________________________________________________________
Q) How do I instantiate and manipulate shared versions of C++ STL vectors using _Cilk_shared and _Cilk_offload?
Shared versions of C++ STL vectors can be instantiate through the use of shared allocators defined in offload.h. Here is an example using shared allocators.
#include <vector> #include <offload.h> #include <stdio.h> using namespace std; typedef vector<int, __offload::shared_allocator<int> > shared_vec_int; _Cilk_shared shared_vec_int * _Cilk_shared v; _Cilk_shared int test_result() { int result = 1; for (int i = 0; i < 5; i++) { if ((*v)[i] != i) { result = 0; } } return result; } int main() { int result; v = new (_Offload_shared_malloc(sizeof(vector<int>))) _Cilk_shared vector<int,__offload::shared_allocator<int>>(5); for (int i = 0; i < 5; i++) { (*v)[i] = i; } result = _Cilk_offload test_result(); if (result != 1) printf("Failedn"); else printf("Passedn"); return 0; }
______________________________________________________________________________________________________