where should i get the faster-rcnn.xml model to run the intel inference engine object detection sample
intel deep learning inference engine
Need to Re-regester Xeon Phi Under Different Account
I was told by one of Intel's support staff that I needed to make a post on these forums to help me regarding the following issue.
"“I need to register the serial number of your Xeon Phi under a different account because, when I registered the serial number, I had a typo with the registered email address and due this error; now, I am having issues in order to reset the password or change the email account, since the serial number of the Xeon Phi is tied to the wrong email address account and I don't have access to the wrong email address. Please advice."
Let me know how I can contact the appropriate staff. Thank you.
Sobel Filter (OpenMP implementation for Knights Landing)
I am trying to implement a parallelized + vectorized version of Sobel Filter in C with OpenMP pragmas for the paralleization and #pragma simd for vectorization. My input is a .pgm image of 1024 by 1024. I am compiling this using Intel Compiler on a Xeon Knights Landing processor using the following command:
icc -qopenmp -O3 -qopt-report3 xeon.c -o xeon
So problems I am facing with the code in general are:
a) when do I parallelize and when do I vectorize. I have a nested for loop made up of four for loops -> should I parallelize or vectorize this piece of code
b) My 'min' and 'max' values are wrong. They are both shared variables and hence prone to race conditions, so I have added a #pragma omp critical around them. However, the values printed out for these two variables are still wrong and I have no idea why. I have even added a barrier before the print statement to make sure all threads pass through that critical section before the min and max values get printed out
c) the #pragma omp critical is making my program very very slow. In fact the execution time is even longer than the sequential runtime. Is there any way to avoid it?
Code:
**mypgm.h**
/* pgm file IO headerfile ------ mypgm.h */ /* Constant declaration */ #define MAX_IMAGEWIDTH 1024 #define MAX_IMAGEHEIGHT 1024 #define MAX_BRIGHTNESS 255 /* Maximum gray level */ #define GRAYLEVEL 256 /* No. of gray levels */ #define MAX_FILENAME 256 /* Filename length limit */ #define MAX_BUFFERSIZE 256 /* Global constant declaration */ /* Image storage arrays */ float image1[MAX_IMAGEWIDTH][MAX_IMAGEHEIGHT] __attribute__((aligned(64))), image2[MAX_IMAGEWIDTH][MAX_IMAGEHEIGHT] __attribute__((aligned(64))); int x_size1, y_size1, /* width & height of image1*/ x_size2, y_size2; /* width & height of image2 */ /* Prototype declaration of functions */ void load_image_data( ); /* image input */ void save_image_data( ); /* image output*/ void load_image_file(char *); /* image input */ void save_image_file(char *); /* image output*/ /* Main body of functions */ void load_image_data( ) /* Input of header & body information of pgm file */ /* for image1[ ][ ],x_size1,y_size1 */ { char file_name[MAX_FILENAME]; char buffer[MAX_BUFFERSIZE]; FILE *fp; /* File pointer */ int max_gray; /* Maximum gray level */ int x, y; /* Loop variable */ /* Input file open */ printf("\n-----------------------------------------------------\n"); printf("Monochromatic image file input routine \n"); printf("-----------------------------------------------------\n\n"); printf(" Only pgm binary file is acceptable\n\n"); printf("Name of input image file? (*.pgm) : "); scanf("%s", file_name); fp = fopen(file_name, "rb"); if (NULL == fp) { printf(" The file doesn't exist!\n\n"); exit(1); } /* Check of file-type ---P5 */ fgets(buffer, MAX_BUFFERSIZE, fp); if (buffer[0] != 'P' || buffer[1] != '5') { printf(" Mistaken file format, not P5!\n\n"); exit(1); } /* input of x_size1, y_size1 */ x_size1 = 0; y_size1 = 0; while (x_size1 == 0 || y_size1 == 0) { fgets(buffer, MAX_BUFFERSIZE, fp); if (buffer[0] != '#') { sscanf(buffer, "%d %d", &x_size1, &y_size1); } } /* input of max_gray */ max_gray = 0; while (max_gray == 0) { fgets(buffer, MAX_BUFFERSIZE, fp); if (buffer[0] != '#') { sscanf(buffer, "%d", &max_gray); } } /* Display of parameters */ printf("\n Image width = %d, Image height = %d\n", x_size1, y_size1); printf(" Maximum gray level = %d\n\n",max_gray); if (x_size1 > MAX_IMAGEWIDTH || y_size1 > MAX_IMAGEHEIGHT) { printf(" Image size exceeds %d x %d\n\n", MAX_IMAGEWIDTH, MAX_IMAGEHEIGHT); printf(" Please use smaller images!\n\n"); exit(1); } if (max_gray != MAX_BRIGHTNESS) { printf(" Invalid value of maximum gray level!\n\n"); exit(1); } /* Input of image data*/ #pragma simd for (y = 0; y < y_size1; y++) { #pragma simd for (x = 0; x < x_size1; x++) { image1[y][x] = (unsigned char)fgetc(fp); } } printf("-----Image data input OK-----\n\n"); printf("-----------------------------------------------------\n\n"); fclose(fp); } void save_image_data( ) /* Output of image2[ ][ ], x_size2, y_size2 in pgm format*/ { char file_name[MAX_FILENAME]; FILE *fp; /* File pointer */ int x, y; /* Loop variable */ /* Output file open */ printf("-----------------------------------------------------\n"); printf("Monochromatic image file output routine\n"); printf("-----------------------------------------------------\n\n"); printf("Name of output image file? (*.pgm) : "); scanf("%s",file_name); fp = fopen(file_name, "wb"); /* output of pgm file header information */ fputs("P5\n", fp); fputs("# Created by Image Processing\n", fp); fprintf(fp, "%d %d\n", x_size2, y_size2); fprintf(fp, "%d\n", MAX_BRIGHTNESS); /* Output of image data */ #pragma simd for (y = 0; y < y_size2; y++) { #pragma simd for (x = 0; x < x_size2; x++) { fputc(image2[y][x], fp); } } printf("\n-----Image data output OK-----\n\n"); printf("-----------------------------------------------------\n\n"); fclose(fp); } void load_image_file(char *filename) /* Input of header & body information of pgm file */ /* for image1[ ][ ],x_size1,y_size1 */ { char buffer[MAX_BUFFERSIZE]; FILE *fp; /* File pointer */ int max_gray; /* Maximum gray level */ int x, y; /* Loop variable */ /* Input file open */ fp = fopen(filename, "rb"); if (NULL == fp) { printf(" The file doesn't exist!\n\n"); exit(1); } /* Check of file-type ---P5 */ fgets(buffer, MAX_BUFFERSIZE, fp); if (buffer[0] != 'P' || buffer[1] != '5') { printf(" Mistaken file format, not P5!\n\n"); exit(1); } /* input of x_size1, y_size1 */ x_size1 = 0; y_size1 = 0; while (x_size1 == 0 || y_size1 == 0) { fgets(buffer, MAX_BUFFERSIZE, fp); if (buffer[0] != '#') { sscanf(buffer, "%d %d", &x_size1, &y_size1); } } /* input of max_gray */ max_gray = 0; while (max_gray == 0) { fgets(buffer, MAX_BUFFERSIZE, fp); if (buffer[0] != '#') { sscanf(buffer, "%d", &max_gray); } } if (x_size1 > MAX_IMAGEWIDTH || y_size1 > MAX_IMAGEHEIGHT) { printf(" Image size exceeds %d x %d\n\n", MAX_IMAGEWIDTH, MAX_IMAGEHEIGHT); printf(" Please use smaller images!\n\n"); exit(1); } if (max_gray != MAX_BRIGHTNESS) { printf(" Invalid value of maximum gray level!\n\n"); exit(1); } /* Input of image data*/ #pragma simd for (y = 0; y < y_size1; y++) { #pragma simd for (x = 0; x < x_size1; x++) { image1[y][x] = (float)fgetc(fp); } } fclose(fp); } void save_image_file(char *filename) /* Output of image2[ ][ ], x_size2, y_size2 */ /* into pgm file with header & body information */ { FILE *fp; /* File pointer */ int x, y; /* Loop variable */ fp = fopen(filename, "wb"); /* output of pgm file header information */ fputs("P5\n", fp); fputs("# Created by Image Processing\n", fp); fprintf(fp, "%d %d\n", x_size2, y_size2); fprintf(fp, "%d\n", MAX_BRIGHTNESS); /* Output of image data */ #pragma simd for (y = 0; y < y_size2; y++) { #pragma simd for (x = 0; x < x_size2; x++) { fputc(image2[y][x], fp); } } fclose(fp); }
**xeon.c**
/* sobel.c */ #include <stdio.h> #include <stdlib.h> #include <float.h> #include <time.h> #include <omp.h> #include "mypgm.h" void sobel_filtering( ) /* Spatial filtering of image data */ /* Sobel filter (horizontal differentiation */ /* Input: image1[y][x] ---- Outout: image2[y][x] */ { /* Definition of Sobel filter in horizontal direction */ float weight[3][3] __attribute__((aligned(64)))= {{ -1, 0, 1 }, { -2, 0, 2 }, { -1, 0, 1 }}; float pixel_value; float min, max; int x, y, i, j; /* Loop variable */ /* Maximum values calculation after filtering*/ printf("Now, filtering of input image is performed\n\n"); min = DBL_MAX; max = -DBL_MAX; #pragma omp parallel shared(image2,weight,image1,min,max) private(y,x,j,i) { #pragma omp for collapse(2) for (y=0;y<y_size1;y++) { for (x=0;x<x_size1;x++) { image2[y][x]=0; } } #pragma omp for collapse(2) reduction(+:pixel_value) for (y = 1; y < y_size1 - 1; y++) { //#pragma simd for (x = 1; x < x_size1 - 1; x++) { pixel_value = 0.0; #pragma simd //#pragma omp for collapse(2) for (j = -1; j <= 1; j++) { #pragma simd for (i = -1; i <= 1; i++) { pixel_value += weight[j + 1][i + 1] * image1[y + j][x + i]; } } image2[y][x] = (float)pixel_value; #pragma omp critical { if (pixel_value < min) min = pixel_value; if (pixel_value > max) max = pixel_value; } } } #pragma omp barrier #pragma omp single { if ((int)(max - min) == 0) { printf("Nothing exists!!!\n\n"); exit(1); } printf("%f\n",min); printf("%f\n",max); } /* Generation of image2 after linear transformtion */ #pragma omp for private(x) collapse(2) //#pragma simd for (y=1;y<y_size1-1;y++) { //#pragma simd for (x=1;x<x_size1-1;x++) { image2[y][x] = MAX_BRIGHTNESS * (image2[y][x] - min) / (max - min); } } } // ends the parallel section } //end of sobel filtering function int main( ) { load_image_data( ); /* Input of image1 */ clock_t begin=clock(); sobel_filtering( ); /* Sobel filter is applied to image1 */ clock_t end=clock(); double time_spent = (double)(end-begin)/CLOCKS_PER_SEC; printf("\n\nTiming result of multiplication of matrix-vector: %f\n",time_spent); save_image_data( ); /* Output of image2 */ return 0; }
Zone:
Thread Topic:
Knights Landing mesh hop cost X- vs Y-dir
Hi
in "Knights Landing: Second generation Xeon Phi product" by Sodani et al. [1],
the authors state, that "One hop on mesh takes one clock in the Y direction and two clocks in the X direction" without further explanation.
As I could not find any additional sources on this topic, it only remains for me to speculate about the reasoning behind it.
Does anyone know something about this issue and may even provide some references?
Thanks in advance,
Michael
[1]: http://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=7453080
Knights Landing's MCDRAM Address Mapping
I am interested to know how MCDRAM address mapping happens in the Kinghts Landing. For a given physical address how it decides MCDRAM's row, column, bank, and channel? Is there any MCDRAM architecture spec that describes this procedure?
Thanks in advance.
proformance issue for writing back to thread local matrix
Hello Everyone,
I have a scientific program that computes integrals, then combined the integral with the input matrix to form the result and finally write the result to the output matrix. I found that the performance downgrades significantly (slow down for 8-10 times) when the result is writing back to the output matrix. However, the output matrix is private for the given thread, so it should not be false sharing?? I use the Intel® Xeon Phi™ coprocessors 5110P.
Here is example of a piece of code:
aKetExRho is the output matrix, it's memory also alligned to 64 bit and it's created inside the same thread in the upper level. The function of getPtr is to get the proper pointer to write the result back
abcd is the result raw integral, aBraDenPhi contains the input data to combined with the raw integral to form the result.
__attribute__((aligned(64))) Double abcd[36];
__attribute__((aligned(64))) Double aBraDenPhi[6];
Double* aKetExRho = matrix_phi::getPtr(colLocBasOffset,iGrid,aKetAtomBlockExRho);
for(UInt j=0; j<6; j++) {
const Double* abcd_ptr = &abcd[j*6];
Double result = ZERO;
result += aBraDenPhi[0]*abcd_ptr[0];
result += aBraDenPhi[1]*abcd_ptr[1];
result += aBraDenPhi[2]*abcd_ptr[2];
result += aBraDenPhi[3]*abcd_ptr[3];
result += aBraDenPhi[4]*abcd_ptr[4];
result += aBraDenPhi[5]*abcd_ptr[5];
aKetExRho[j] += -1.0E0*result;
}
I found if I comment out the line of " aKetExRho[j] += -1.0E0*result;" the performance increases significantly. However, the output matrix is also private to the thread. How can I solve this problem?
Thank you,
Phoenix
Xeon PHI MIA after flash update 3.8.2
I have been trying to get the Xeon Phi in my Microway Windows 7 SP1 Workstation to work with MKL Automatic OffLoad
As part of that process I upgraded to MPSS 3.8 and re-flashed - the process went without error. Of course I rebooted.
However the PHI card seems now non-functional , and crashes MKL when I call mkl_mic_enable()
Before upgrade
>micinfo
MicInfo Utility Log
Copyright 2011-2013 Intel Corporation All Rights Reserved.
Created Wed Jun 14 11:34:46 2017
System Info
HOST OS : Windows
OS Version : Microsoft Windows 7 Professi
Driver Version : 3.3.30726.0
MPSS Version : 3.3.30726.0
Host Physical Memory : 32709 MB
Device No: 0, Device Name: mic0
Version
Flash Version : 2.1.02.0390
SMC Firmware Version : 1.16.5078
SMC Boot Loader Version : 1.8.4326
uOS Version : 2.6.38.8+mpss3.3
Device Serial Number : ADKC32800563
Board
Vendor ID : 0x8086
Device ID : 0x225d
Subsystem ID : 0x3608
Coprocessor Stepping ID : 2
PCIe Width : x16
PCIe Speed : 5 GT/s
PCIe Max payload size : 256 bytes
PCIe Max read req size : 512 bytes
Coprocessor Model : 0x01
Coprocessor Model Ext : 0x00
Coprocessor Type : 0x00
Coprocessor Family : 0x0b
Coprocessor Family Ext : 0x00
Coprocessor Stepping : C0
Board SKU : C0PRQ-3120/3140 P/A
ECC Mode : Enabled
SMC HW Revision : Product 300W Active CS
Cores
Total No of Active Cores : 57
Voltage : 1039000 uV
Frequency : 1100000 kHz
After Upgrade
C:\Program Files (x86)\Microsoft Visual Studio 14.0\VC>micinfo
MicInfo Utility Log
Created Fri Jun 16 08:42:07 2017
System Info
HOST OS : Windows
OS Version : Microsoft Windows 7 Professional
Driver Version : 3.8.2.4191
MPSS Version : 3.8.2.4191
Host Physical Memory : 32709 MB
Device No: 0, Device Name: mic0
Version
Flash Version : NotAvailable
SMC Firmware Version : NotAvailable
SMC Boot Loader Version : NotAvailable
Coprocessor OS Version : NotAvailable
Device Serial Number : NotAvailable
Board
Vendor ID : 0x8086
Device ID : 0x225d
Subsystem ID : 0x3608
Coprocessor Stepping ID : 2
PCIe Width : x16
PCIe Speed : 5 GT/s
PCIe Max payload size : 256 bytes
PCIe Max read req size : 512 bytes
Coprocessor Model : 0x01
Coprocessor Model Ext : 0x00
Coprocessor Type : 0x00
Coprocessor Family : 0x0b
Coprocessor Family Ext : 0x00
Coprocessor Stepping : C0
Board SKU : C0PRQ-3120/3140 P/A
ECC Mode : NotAvailable
SMC HW Revision : NotAvailable
Cores
Total No of Active Cores : NotAvailable
Voltage : NotAvailable
Frequency : NotAvailable
Thermal
Fan Speed Control : NotAvailable
Fan RPM : NotAvailable
Fan PWM : NotAvailable
Die Temp : NotAvailable
GDDR
GDDR Vendor : NotAvailable
GDDR Version : NotAvailable
GDDR Density : NotAvailable
GDDR Size : NotAvailable
GDDR Technology : NotAvailable
GDDR Speed : NotAvailable
GDDR Frequency : NotAvailable
GDDR Voltage : NotAvailable
strange behaviour with icpc 2016,2017 and _m512d arithmetic
Hello, the following sample code is compiled with icpc (version 16.0.2 (gcc version 4.9.3 compatibility)
#include <stdio.h> #include <immintrin.h> /* inline __m512d operator+(const __m512d val1, __m512d val2){ return _mm512_add_pd(val1, val2); } */ int main(int argc, char **argv) { __m512d a={1,2,3,4,1,2,3,4},b={5,6,7,8,5,6,7,8},c; c=a+b; double *pc = (double *)&c; printf("c = %e %e %e %e %e %e %e %e \n", pc[0], pc[1], pc[2], pc[3], pc[4], pc[5], pc[6], pc[7] ); return 1; }
we get the following error:
$icpc -xMIC-AVX512 main.cpp
main.cpp(14): error: operation not supported for these simd operands
c=a+b;
^
compilation aborted for main.cpp (code 2)
If we uncomment the operator overloading compilation is fine.
If we move to intel 2017 (icpc version 17.0.0 (gcc version 5.4.0 compatibility) we have the same error when operator overload is commented (which seems to be coherent) BUT also when overload is not commented !
So my questions are:
Why arithmetic with __m512d does not works with intel2016 and intel217 (works perfectly with AVX/AVX2 types)
Why operator overloading does not work with intel2017?
Best regards
T. Guignon
Thread Topic:
KNL cache performance using SIMD intrinsic
Hi
I am very curious about the cache performance of KNL with SIMD intrinsic. I have the following observations.
I write a matrix to matrix multiplication program. I have two versions. The first one does gemm in a formal way, without intrinsic. And I wrote another one with intrinsic. Let's say the matrices are small ones, i.e., 16 * 16. I profile the two versions using VTune. I find that the first version really has a very small number of L1 cache misses. However, the second one has much more L1 cache misses than the first version, several times more.
The first version is compiled with -O1, so it is not vectrized. The second version is fully vectorized since I use the AVX512 intrinsic instructions. For the runtime, the first version takes much more time without doubt.
The question is why the cache miss number is so much different? The two versions should have the same memory access pattern. And all data (three 16*16 floats matrices) should be cached in the L1 cache. There should be only compulsory cache misses.
Could anyone help to explain why?
CentOS 7.3 crashes after installation of MPSS 3.8.2
When I try to install mpss 3.8.2 for my Xeon Phi 31S1p coprocessor on CentOS 7.3 the system crashes. Is there anything I can do/try or a possible fix? Any help would be highly appreciated, thank you!
I downloaded mpss-3.8.2(released: April 25, 2017) from the page https://software.intel.com/en-us/articles/intel-manycore-platform-software-stack-mpss#lx38rel and followed the instructions as provided in the readme file. As the kernel of my system is slightly newer than what the mpss-download provides for, I had to recompile mpss, which worked fine. I can also install the rpm-packages, receiving the following error message (which I am not sure if it is related to the problem at all):
depmod: ERROR: failed to load symbols from /lib/modules/3.10.0-514.21.2.el7.x86_64/extra/nvidia-uvm.ko: Invalid argument
After having installed the mpss-software, however, I can no longer boot the system (see below).
When I execute "modprobe mic", I get the following error message three times:
NMI watchdog: BUG: soft lockup - CPU#32 stuck for 22s! [modprobe:17376]
After displaying this message three times, the command prompt reappears. I can execute "micctrl --initdefaults" without any messages being displayed.
If I then execute "micctrl -s" I get the error "mic0: reset failed".
If I try "/usr/bin/miccheck", the system freezes completely.
After having installed mpss, I get the errors below when rebooting the system. I. e. the system cannot boot anymore. I can correct the problem by entering recovery mode and executing the "uninstall.sh"-script delivered in the mpss-download. After that, I can reboot the system without problems.
The coprocessor is correctly identified by "lspci" as below and large BAR support has been enabled in the BIOS ("above 4G decoding"):
09:00.0 Co-processor: Intel Corporation Xeon Phi coprocessor 31S1 (rev 11)
---BASIC SYSTEM INFORMATION---
ASUS X99-E WS
Intel Xeon E5-2696V3
64 GB RAM
NVidia GForce 1080
---ERROR LOG WHEN REBOOTING---
[ 12.1884] pcieport 0000:00:02.0 PCIe Bus Error: severity: Uncorrected (Non-Fatal), type=Tansaction Layer, id=0010(Requester-ID) [ 12.1885] pcieport 0000:00:02.0 device [8086:2f04] error status/mask=000040000/00000000 [ 12.1886] pcieport 0000:00:02.0 [14] Completion Timeout (First) [ 40.0710] NMI Watchdog: BUG: soft lockup - CPU#0 stuck for 23s! [modprobe:784] [ 68.0710] NMI Watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [modprobe:784] [ 72.2060] INFO: rcu_sched self-detected stall on CPU { 0} (t=60001 jiffies g=135 c=134 q=2018) [ 100.0710] NMI Watchdog: BUG: soft lockup - CPU#0 stuck for 23s! [modprobe:784] [ 113.4049] ETC timer compensation(-1000000ppm) is much higherthan expected [ 113.4049] pcieport 0000:00:02.0: device [8086:2f04] error status/mask=000040000/00000000 [ 113.4049] pcieport 0000:00:02.0: [14] Completion Timeout (First) ... [ 120.8210] mce: [Hardware Error]: CPU 16: Machine Check Exception: 0 Bank 3: fe00000000800400 [ 120.8210] mce: [Hardware Error]: TSC 0 ADDR ffe0000000000000 MISC ffffffff81060ff5 [ 120.8210] mce: [Hardware Error]: PROCESSOR 0:306f2 TIME 1498668525 SOCKET 0 APIC 34 microcode 38 ... [ 120.8210] mce: [Hardware Error]: CPU 22: Machine Check Exception: 5 Bank 18: be200000008c110a [ 120.8210] mce: [Hardware Error]: RIP !INEXACT! 10:<ffffffff81060fe6> {native_save_halt+0x6/0x10} [ 120.8210] mce: [Hardware Error]: TSC e627fde4082 ADDR e0900fc0 MISC 74fc381600402086 [ 120.8210] mce: [Hardware Error]: PROCESSOR 0:306f2 TIME 1498661446 SOCKET 0 APIC 9 microcode 38 [ 120.8210] mce: [Hardware Error]: Some CPUs didn't answer in synchronization [ 120.8210] mce: [Hardware Error]: Machine check: Processor context corrupt [ 120.8210] Kernel panic - not syncing: Fatal machine check on current CPU [ 120.8210] Shutting down cpus with NMI [ 120.8210] Rebooting in 30 seconds..
Thread Topic:
About installing CentOS on SSD in KNL 7250
Dear ALL
I encountered a problem of the CentOS installation in a Adam Pass server with KNL 7250 and a SSD disc of about 800GiB. I want to install CentOS 7 into this SSD disc and use CentOS-7-x86_64-DVD-1611.iso.
The installation of CentOS uses a USB flash drive, to which the installation ISO image has been directly written by using the dd command in Linux or by using UltraISO in Win.
After powering on the server, the SSD disc can be found in BIOS, but in the CentOS Installation Destination screen it did not appear.
I have tried several ways, but still cannot solve this problem, and therefore I have to write this post for asking helps.
Thread Topic:
mpirun: command not found
Hi ! I am getting "mpirun: command not found" error on my xeon PHI card. Could you please help me to solve the problem ? The response of "which mpirun" on my main procesor is : /opt/intel/compilers_and_libraries_2016.1.150/linux/mpi/intel64/bin/mpirun.
Thread Topic:
Intel MKL performance drop OpenMP vs TBB
Hi everyone,
I tried the below example program on KNL and I am puzzled about the huge performance difference. It computes a small matrix-matrix product using the MKL. In this (naive) example there is a 1000x performance difference when switching from OpenMP to TBB. The file was compiled with
icc -std=c++11 -O3 -xmic-avx512 -mkl -qopenmp tbb_vs_omp.cpp -o omp icc -std=c++11 -O3 -xmic-avx512 -mkl -tbb tbb_vs_omp.cpp -o tbb
I tried a few things, e.g. using tbb::task_scheduler_init or OpenMP env variables, but nothing seems to make the TBB version nearly as fast as the OpenMP version, or the OpenMP version as slow. Does anyone know what might the problem and how to fix it, that is how to configure TBB? The gap gets smaller when increasing the problem size (only 10x for N=1024).
#include <iostream> #include <mkl.h> constexpr size_t N = 64; constexpr size_t RUNS = 20; int main() { double* A = (double*)_mm_malloc(N * N * sizeof(double), 64); double* B = (double*)_mm_malloc(N * N * sizeof(double), 64); double* C = (double*)_mm_malloc(N * N * sizeof(double), 64); VSLStreamStatePtr stream; vslNewStream(&stream, VSL_BRNG_SFMT19937, 1337); vdRngUniform(VSL_RNG_METHOD_UNIFORM_STD, stream, N * N, A, -10, 10); vdRngUniform(VSL_RNG_METHOD_UNIFORM_STD, stream, N * N, B, -10, 10); vslDeleteStream(&stream); std::cout << "Created matrices, N = "<< N << ".\n"; { double total = 0.0; cblas_dgemm(CBLAS_LAYOUT::CblasColMajor, CBLAS_TRANSPOSE::CblasTrans, CBLAS_TRANSPOSE::CblasNoTrans, N, N, N, 1.0, A, N /* lda */, B, N /* ldb */, 0.0, C, N /* ldc */); for (size_t i = 0; i < RUNS; ++i) { // A[0] = i; double start = dsecnd(); cblas_dgemm(CBLAS_LAYOUT::CblasColMajor, CBLAS_TRANSPOSE::CblasTrans, CBLAS_TRANSPOSE::CblasNoTrans, N, N, N, 1.0, A, N /* lda */, B, N /* ldb */, 0.0, C, N /* ldc */); total += dsecnd() - start; } std::cout << "Time needed "<< total << ", "; } std::cout << C[0] << '\n'; _mm_free(A); _mm_free(B); _mm_free(C); return 0; }
Thread Topic:
Can Intel Xeon Phi get data direct from another PCI device?
Hello,
Can Intel Xeon Phi be configured to receive data direct from FPGA board, process them and send result to host memory?
I have large flow of input data and don't want to have redundant transfers (FPGA board ->Host Memory->MIC->Host Memory) over PCI.I tried a lot to find out the solution by watching some intel product videos but wasn't satisfied. I want more elegant scheme (FPGA board-> MIC->Host Memory) Is it possible?
Please help me out.
Any help will be appreciated.
Thank you.
when mpirun to host mic, error while loading shared libraries: libmkl_intel_lp64.so
When I use "-host mic0" in the host, there is an error that the mic0 can not find the file libmkl_intel_lp64.so.
[yd@yd-ws3 ~]$ mpirun -host mic0 -iface mic0 -n 1 /yd_tools/binaries_mic/yd_binary_mpiicpc_mx008_shnao_iposlm_cmvlf_dwyg /yd_tools/binaries_mic/yd_binary_mpiicpc_mx008_shnao_iposlm_cmvlf_dwyg: error while loading shared libraries: libmkl_intel_lp64.so: cannot open shared object file: No such file or directory
I need to set the LP_LIBRARY_PATH manually by "-env".
[yd@yd-ws3 ~]$ mpirun -host mic0 -iface mic0 -env LD_LIBRARY_PATH /opt/intel/mkl/lib/mic -n 1 /yd_tools/binaries_mic/yd_binary_mpiicpc_mx008_shnao_iposlm_cmvlf_dwyg Usage: /yd_tools/binaries_mic/yd_binary_mpiicpc_mx008_shnao_iposlm_cmvlf_dwyg [ work path ]
When I run mpirun in the host or in the mic, this error disappear.
[yd@yd-ws3 ~]$ mpirun -n 1 /yd_tools/binaries/yd_binary_mpiicpc_mx008_shnao_iposlm_cmvlf_dwyg Usage: /yd_tools/binaries/yd_binary_mpiicpc_mx008_shnao_iposlm_cmvlf_dwyg [ work path ] [yd@yd-ws3 common]$ ssh mic0 [yd@yd-ws3-mic0 ~]$ mpirun -n 1 /yd_tools/binaries_mic/yd_binary_mpiicpc_mx008_shnao_iposlm_cmvlf_dwyg Usage: /yd_tools/binaries_mic/yd_binary_mpiicpc_mx008_shnao_iposlm_cmvlf_dwyg [ work path ] [yd@yd-ws3-mic0 ~]$ exit logout Connection to mic0 closed.
This is the output of env. There is no LP_LIBRARY_PATH.
[yd@yd-ws3 ~]$ mpirun -host mic0 -iface mic0 -env LD_LIBRARY_PATH /opt/intel/mkl/lib/mic -n 1 /usr/bin/env |grep PATH PATH=/opt/intel/compilers_and_libraries_2017.4.196/linux/bin/intel64:/opt/intel/compilers_and_libraries_2017.4.196/linux/mpi/intel64/bin:/opt/intel/debugger_2017/gdb/intel64_mic/bin:/usr/local/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/yd_tools/binaries:/home/yd/.local/bin:/home/yd/bin MANPATH=/opt/intel/man/common:/opt/intel/compilers_and_libraries_2017.4.196/linux/mpi/man:/opt/intel/documentation_2017/en/debugger//gdb-ia/man/:/opt/intel/documentation_2017/en/debugger//gdb-mic/man/:/opt/intel/documentation_2017/en/debugger//gdb-igfx/man/:/usr/local/share/man:/usr/share/man: LIBRARY_PATH=/opt/intel/compilers_and_libraries_2017.4.196/linux/ipp/lib/intel64:/opt/intel/compilers_and_libraries_2017.4.196/linux/compiler/lib/intel64_lin:/opt/intel/compilers_and_libraries_2017.4.196/linux/mkl/lib/intel64_lin:/opt/intel/compilers_and_libraries_2017.4.196/linux/tbb/lib/intel64/gcc4.7:/opt/intel/compilers_and_libraries_2017.4.196/linux/daal/lib/intel64_lin:/opt/intel/compilers_and_libraries_2017.4.196/linux/daal/../tbb/lib/intel64_lin/gcc4.4 MIC_LD_LIBRARY_PATH=/opt/intel/compilers_and_libraries_2017.4.196/linux/mpi/mic/lib:/opt/intel/compilers_and_libraries_2017.4.196/linux/compiler/lib/mic:/opt/intel/compilers_and_libraries_2017.4.196/linux/ipp/lib/mic:/opt/intel/mic/coi/device-linux-release/lib:/opt/intel/mic/myo/lib:/opt/intel/compilers_and_libraries_2017.4.196/linux/compiler/lib/intel64_lin_mic:/opt/intel/compilers_and_libraries_2017.4.196/linux/mkl/lib/intel64_lin_mic:/opt/intel/compilers_and_libraries_2017.4.196/linux/tbb/lib/mic MIC_LIBRARY_PATH=/opt/intel/compilers_and_libraries_2017.4.196/linux/mpi/mic/lib:/opt/intel/compilers_and_libraries_2017.4.196/linux/compiler/lib/mic:/opt/intel/compilers_and_libraries_2017.4.196/linux/compiler/lib/intel64_lin_mic:/opt/intel/compilers_and_libraries_2017.4.196/linux/mkl/lib/intel64_lin_mic:/opt/intel/compilers_and_libraries_2017.4.196/linux/tbb/lib/mic CPATH=/opt/intel/compilers_and_libraries_2017.4.196/linux/ipp/include:/opt/intel/compilers_and_libraries_2017.4.196/linux/mkl/include:/opt/intel/compilers_and_libraries_2017.4.196/linux/tbb/include:/opt/intel/compilers_and_libraries_2017.4.196/linux/daal/include CLASSPATH=/opt/intel/compilers_and_libraries_2017.4.196/linux/mpi/intel64/lib/mpi.jar:/opt/intel/compilers_and_libraries_2017.4.196/linux/daal/lib/daal.jar INFOPATH=/opt/intel/documentation_2017/en/debugger//gdb-ia/info/:/opt/intel/documentation_2017/en/debugger//gdb-mic/info/:/opt/intel/documentation_2017/en/debugger//gdb-igfx/info/ I_MPI_CMD=mpirun -host mic0 -iface mic0 -env LD_LIBRARY_PATH /opt/intel/mkl/lib/mic -n 1 /usr/bin/env
If I want to avoid setting the LP_LIBRARY_PATH manually by "-env". How can I do?
Thread Topic:
OpenCL support for Xeon Phi processor (Knights Landing architecture)
Hi,
Is OpenCL supported on the newer Xeon Phi processors? The most definite information that I found that it's not supported is a post from 2 years ago (https://software.intel.com/en-us/forums/opencl/topic/697753). Has anything changed since?
Thanks,
Viktor
BUG in XPPSL 1.5.1/2
Hi,
As I couldn't think of any other place where I should post this information to get someone to fix it:
XPPSL version 1.5.1 introduced a bug in its bundled hwloc. This results in wrong process binding using slurm when the KNL is configured in SNC4 + Flat. In particular when, e.g. running 4 processes per KNL, the first two processes are bound correctly, the third hower is bound to the hwthread id #2 of all 64 cores, the fourth process to hwthread id #3 of all cores. Thus process 3 and 4 are running one thread on every core instead of 4 threads on 16 cores only. SNC4 + Cache is not affected.
This bug has not been fixed in 1.5.2. I actually wanted to dig into the sources to find the exact bug, however as it seems like Intel still prefers to hide any changes to open source software as good as possible, I gave up and just inform you this way. A public github page or a simple bugtracker could be useful as well for users to submit bugs. If such thing actually exists (and I do not have to register any product to get access), please tell me where I can find it.
Does MIC really run faster than CPU
Hi!
I compared the speed of CPU and the MIC by running identical C++ programs using openmp(both fully occupied during operation). However, under the release version, the speed of CPU(9.7s) is nearly 3 times faster than the MIC(26.5s). How come!? If the MIC is actually slower than CPU, then what is the point of using it?
The testing code is as follows:
#pragma omp parallel for reduction(+:sum)
for(int i=0; i<100000; i++)
for(int j=0; j<100000; j++)
sum += sqrt(sqrt(j^2+1) + sqrt(sqrt(i^2+1)) + 1);
For MIC, I used offload pragma to run the code.
The MIC I used is:
Intel Xeon Phi Coprpcessor 7120
The CPU I used is:
Genuine Intel(R) CPU @ 1.80GHz 1.80GHz (2 processor)
Hopefully someone can tell me the reason.
Installation issue: 'modprobe mic' freezes server
Dear all,
I am trying to install mpss 3.8.2 on a OpenSuse Leap 42.2 (in principle equivalent to Suse12.2). It comes with a newer kernel that the one stated in the 'readme.txt' file.
So I proceed with the instructions found in 'readme.txt' file. Everything ok until
# modprobe mic
At that point the machine freezes.
I am using the following kernel:
$ uname -r
$ 4.4.27-2-default
The mic.ko is correctly at '/lib/modules/4.4.27-2-default/extra/mic.ko'
For information:
lspci | grep -i Co-processor
05:00.0 Co-processor: Intel Corporation Xeon Phi coprocessor 31S1 (rev 11)
42:00.0 Co-processor: Intel Corporation Xeon Phi coprocessor 5100 series (rev 11)
Ideas are more than welcome...
Intel cooprocessor error No mic cards found
My computer has a xeon phi coprocessor. I get an error No mic cards found or specified in command line. I am using mpss 3.8.2. My kernel is 2.6.32-696.6.3.el6.x86_64. I have rebuild the kernel modules using source. I also tried reistalling mpss. When I run micrasd command error given is
Wed Jul 26 09:41:18 2017 MICRAS INFO : Open MCA filter log history.
Wed Jul 26 09:41:18 2017 MICRAS ERROR : No MIC device detected!