In the period prior to the launch of Intel® Xeon Phi™ coprocessor, Intel collected questions from developers who had been involved in pilot testing. This document contains some of the most common questions asked. Additional information and Best-Known-Methods for the Intel Xeon Phi coprocessor can be found here.
The Intel® Compiler reference guides can be found at:
______________________________________________________________________________________________________
Q) How do I profile native applications?
To profile a native application, please follow the steps provided here.
______________________________________________________________________________________________________
Q) Where can I find the description of the hardware performance counters for Intel Xeon Phi coprocessor?
Intel® Vtune™ analyzer provides a short description of the hardware performance counters when adding events to custom analysis. The description of some common performance counters as well as metrics can be found here.
______________________________________________________________________________________________________
Q) Where can I find more about the key features, peak performance and available SKUs of Intel Xeon Phi coprocessors?
Important information about Intel® Xeon Phi™ coprocessor can be found at:
http://download.intel.com/newsroom/kits/xeon/phi/pdfs/Intel-Xeon-Phi_Factsheet.pdf
______________________________________________________________________________________________________
Q) Can hand-written code optimized for SSE or Intel® Advanced Vector Extensions (Intel AVX) work for Intel Xeon Phi coprocessor?
No. The Intel Xeon Phi coprocessor does not support SSE or Intel AVX. Furthermore, the techniques used to produce optimal SSE or Intel AVX code need to be changed when adapting the implementation for the Intel Xeon Phi coprocessor. Code using SSE or Intel AVX assumes a vector length or 128-bits or 256-bits, respectively, while the Intel Xeon Phi coprocessor has a vector width of 512 bits. Thus the algorithm will need to be rewritten to effectively use the wider vector width, whether written by hand in intrinsics, or in a higher-level language that has been structured to enable the compiler to produce the best SSE or Intel AVX code.
______________________________________________________________________________________________________
Q) How can I reduce the memory allocation overhead in an offload?
Tips on minimizing coprocessor memory allocation overhead can be found on the following page under the section “Minimize Coprocessor Memory Allocation Overhead”.
http://software.intel.com/en-us/articles/effective-use-of-the-intel-compilers-offload-features
______________________________________________________________________________________________________
Q) Is there a way that I can automatically time each individual offload?
Yes, you can automatically time, each individual offload by using the environment variable OFFLOAD_REPORT. You can find out more about the offload report on the following webpage under the section “Environment Variables for Controlling Offload”:
http://software.intel.com/en-us/articles/effective-use-of-the-intel-compilers-offload-features
Alternately, the compiler reference for OFFLOAD_REPORT can be found at:
Compiler Reference: Setting Environmental Variables
Compiler Reference: __Offload_report
______________________________________________________________________________________________________
Q) How can I improve data transfer rate from the host to the coprocessor in an offload?
The data transfer rate from the host to the coprocessor can be improved by the following:
- On the host, align data to 4KB boundaries for optimal DMA performance over the PCIe bus. To align data use _mm_malloc() instead of malloc() when allocating data.
- Depending on the use alloc_if and free_if modifiers, an offload timing measurement can include memory allocation and free (on the coprocessor) overhead. Using persistent memory helps eliminate the allocation and free overheads and also provides more consistent timing. You can find more about persistent memory on the following webpage:
http://software.intel.com/en-us/articles/effective-use-of-the-intel-compilers-offload-features
______________________________________________________________________________________________________
Q) Where can I find more about the Intel Xeon Phi coprocessor Instruction Set Architecture (ISA)?
Intel Xeon Phi coprocessor ISA can be found at
http://software.intel.com/sites/default/files/forum/278102/327364001en.pdf
______________________________________________________________________________________________________
Q) Where can I find more documentation about the Performance Monitoring Units (PMUs) in the Intel Xeon Phi coprocessor?
You can learn more about the PMUs in the Intel Xeon Phi coprocessor in the Software Developer’s Guide that be found at:
http://software.intel.com/sites/default/files/forum/278102/327364001en.pdf
______________________________________________________________________________________________________
Q) How do I implement memory fences on the Intel Xeon Phi coprocessor?
Since Intel Xeon Phi coprocessor is an in-order machine, it does not normally require any instructions to enforce the ordering of memory instructions, as they naturally become globally visible in program order. Therefore it is normally sufficient to implement memory barriers in compiled code as a simple compiler barrier ('__asm__ __volatile__("":::"memory")') which ensures that the compiler does not reorder loads and stores over the barrier, while generating no code.
The exceptions to this are the Non-Globally-Ordered (NGO) stores. If you explicitly code these using assembly instructions or intrinsics, then you do need to insert a memory fence.
The best known memory fence implementations are
- If a store instruction is present at the point where the barrier is required, then replace it with an xchg; since xchg is a locked operation (even though it has no lock prefix) it is automatically a full memory fence.
- If there is not a convenient store, then use lock; addl $0,(%rsp). This is also a locked instruction (so a full memory fence) that has no other effect. Provided that the stack is still in the cache, it seems to complete on the Intel Xeon Phi coprocessor in four cycles, which is much faster than using cpuids ( which was another option that has been suggested).
______________________________________________________________________________________________________
Q) How can I improve software prefetching to get better performance?
Software prefetching basics, guidelines, and best-know-methods can be found at:
http://software.intel.com/sites/default/files/article/326703/5.3-prefetching-on-mic-4.pdf
______________________________________________________________________________________________________
Q) Does Intel Xeon Phi coprocessor support the PAPI hardware counter library?
Intel currently does not support the PAPI hardware counter library. You can find some third party work at :
http://www.eece.maine.edu/~vweaver/projects/mic/
______________________________________________________________________________________________________
Q) What do the performance hotspots in kmp_wait_sleep and kmp_static_yield imply?
kmp_wait_sleep is where a thread waits inside the OpenMP runtime if it has nothing to do. There are a number of scenarios when a thread has nothing to do.
The most significant cases are:
- There could be at an explicit OpenMP barrier in the code, and some threads are waiting for the others to reach the barrier
- Some threads are waiting for the other threads at the implicit "join-barrier" found at the end of every parallel section (unless the nowait clause is used)
- The OpenMP thread pool could be waiting for a serial section of code to finish
The first two cases imply a load imbalance. The last case results in a hot spot when an algorithm jumps between parallel and serial execution a lot in a performance-critical area, which usually also amplifies any load imbalances in the parallel portions of the algorithm.
kmp_static_yield is effectively the same place; This is where the runtime is delaying and is called from kmp_wait_sleep.
So a large amount of time in these routines can mean that you have a load-imbalance, and/or that you aren’t exploiting all of the threads you have available effectively.
Remember that on the Intel Xeon Phi coprocessor, you have a large number of threads, so if you’re using a static loop scheduling (which is the default) and even if there’s no variance in the time for each iteration, you may get significant imbalance even in cases that would have been fine on a machine with eight or 16 threads.
For instance, a loop with 256 iterations run on 240 hardware threads will be processed in two batches: after the first 240 iterations are processed, the remaining 16 iterations will be processed. Since there are a total of 480 work units available during the processing of this loop, you’ll waste 224, for a maximum efficiency of 53%.
______________________________________________________________________________________________________