intel deep learning inference engine

May 29, 2017, 12:19 am

Latest and popular articles on Intel Technologies

≫ Next: Need to Re-regester Xeon Phi Under Different Account

≪ Previous: Phi 7240P on Centos: no MIC boards present

where should i get the faster-rcnn.xml model to run the intel inference engine object detection sample

Zone:

Artificial Intelligence

Thread Topic:

Question

↧

Need to Re-regester Xeon Phi Under Different Account

June 1, 2017, 7:45 am

Latest and popular articles on Intel Technologies

≫ Next: Sobel Filter (OpenMP implementation for Knights Landing)

≪ Previous: intel deep learning inference engine

I was told by one of Intel's support staff that I needed to make a post on these forums to help me regarding the following issue.

"“I need to register the serial number of your Xeon Phi under a different account because, when I registered the serial number, I had a typo with the registered email address and due this error; now, I am having issues in order to reset the password or change the email account, since the serial number of the Xeon Phi is tied to the wrong email address account and I don't have access to the wrong email address. Please advice."

Let me know how I can contact the appropriate staff. Thank you.

↧

Sobel Filter (OpenMP implementation for Knights Landing)

June 8, 2017, 7:56 am

Latest and popular articles on Intel Technologies

≫ Next: Knights Landing mesh hop cost X- vs Y-dir

≪ Previous: Need to Re-regester Xeon Phi Under Different Account

I am trying to implement a parallelized + vectorized version of Sobel Filter in C with OpenMP pragmas for the paralleization and #pragma simd for vectorization. My input is a .pgm image of 1024 by 1024. I am compiling this using Intel Compiler on a Xeon Knights Landing processor using the following command:

icc -qopenmp -O3 -qopt-report3 xeon.c -o xeon

So problems I am facing with the code in general are:

a) when do I parallelize and when do I vectorize. I have a nested for loop made up of four for loops -> should I parallelize or vectorize this piece of code

b) My 'min' and 'max' values are wrong. They are both shared variables and hence prone to race conditions, so I have added a #pragma omp critical around them. However, the values printed out for these two variables are still wrong and I have no idea why. I have even added a barrier before the print statement to make sure all threads pass through that critical section before the min and max values get printed out

c) the #pragma omp critical is making my program very very slow. In fact the execution time is even longer than the sequential runtime. Is there any way to avoid it?

Code:

**mypgm.h**

/* pgm file IO headerfile ------ mypgm.h */

/* Constant declaration */

#define MAX_IMAGEWIDTH  1024
#define MAX_IMAGEHEIGHT  1024
#define MAX_BRIGHTNESS  255 /* Maximum gray level */
#define GRAYLEVEL       256 /* No. of gray levels */
#define MAX_FILENAME    256 /* Filename length limit */
#define MAX_BUFFERSIZE  256

/* Global constant declaration */
/* Image storage arrays */
float image1[MAX_IMAGEWIDTH][MAX_IMAGEHEIGHT] __attribute__((aligned(64))),
image2[MAX_IMAGEWIDTH][MAX_IMAGEHEIGHT] __attribute__((aligned(64)));
int x_size1, y_size1, /* width & height of image1*/
x_size2, y_size2; /* width & height of image2 */

/* Prototype declaration of functions */
void load_image_data( ); /* image input */
void save_image_data( ); /* image output*/
void load_image_file(char *); /* image input */
void save_image_file(char *); /* image output*/


/* Main body of functions */

void load_image_data( )
/* Input of header & body information of pgm file */
/* for image1[ ][ ]，x_size1，y_size1 */
{
  char file_name[MAX_FILENAME];
  char buffer[MAX_BUFFERSIZE];
  FILE *fp; /* File pointer */
  int max_gray; /* Maximum gray level */
  int x, y; /* Loop variable */

  /* Input file open */
  printf("\n-----------------------------------------------------\n");
  printf("Monochromatic image file input routine \n");
  printf("-----------------------------------------------------\n\n");
  printf("     Only pgm binary file is acceptable\n\n");
  printf("Name of input image file? (*.pgm) : ");
  scanf("%s", file_name);
  fp = fopen(file_name, "rb");
  if (NULL == fp) {
    printf("     The file doesn't exist!\n\n");
    exit(1);
  }
  /* Check of file-type ---P5 */
  fgets(buffer, MAX_BUFFERSIZE, fp);
  if (buffer[0] != 'P' || buffer[1] != '5') {
     printf("     Mistaken file format, not P5!\n\n");
     exit(1);
  }
  /* input of x_size1, y_size1 */
  x_size1 = 0;
  y_size1 = 0;
  while (x_size1 == 0 || y_size1 == 0) {
    fgets(buffer, MAX_BUFFERSIZE, fp);
    if (buffer[0] != '#') {
      sscanf(buffer, "%d %d", &x_size1, &y_size1);
    }
  }
  /* input of max_gray */
  max_gray = 0;
  while (max_gray == 0) {
  fgets(buffer, MAX_BUFFERSIZE, fp);
  if (buffer[0] != '#') {
    sscanf(buffer, "%d", &max_gray);
  }
}
  /* Display of parameters */
  printf("\n     Image width = %d, Image height = %d\n", x_size1, y_size1);
  printf("     Maximum gray level = %d\n\n",max_gray);
  if (x_size1 > MAX_IMAGEWIDTH || y_size1 > MAX_IMAGEHEIGHT) {
     printf("     Image size exceeds %d x %d\n\n",
      MAX_IMAGEWIDTH, MAX_IMAGEHEIGHT);
     printf("     Please use smaller images!\n\n");
     exit(1);
  }
  if (max_gray != MAX_BRIGHTNESS) {
  printf("     Invalid value of maximum gray level!\n\n");
  exit(1);
}
/* Input of image data*/
#pragma simd
for (y = 0; y < y_size1; y++) {
  #pragma simd
  for (x = 0; x < x_size1; x++) {
     image1[y][x] = (unsigned char)fgetc(fp);
  }
}
printf("-----Image data input OK-----\n\n");
printf("-----------------------------------------------------\n\n");
fclose(fp);
}

void save_image_data( )
/* Output of image2[ ][ ], x_size2, y_size2 in pgm format*/

{
  char file_name[MAX_FILENAME];
  FILE *fp; /* File pointer */
  int x, y; /* Loop variable */

  /* Output file open */
  printf("-----------------------------------------------------\n");
  printf("Monochromatic image file output routine\n");
  printf("-----------------------------------------------------\n\n");
  printf("Name of output image file? (*.pgm) : ");
  scanf("%s",file_name);
  fp = fopen(file_name, "wb");
  /* output of pgm file header information */
  fputs("P5\n", fp);
  fputs("# Created by Image Processing\n", fp);
  fprintf(fp, "%d %d\n", x_size2, y_size2);
  fprintf(fp, "%d\n", MAX_BRIGHTNESS);
  /* Output of image data */
 #pragma simd
  for (y = 0; y < y_size2; y++) {
    #pragma simd
    for (x = 0; x < x_size2; x++) {
       fputc(image2[y][x], fp);
     }
   }
   printf("\n-----Image data output OK-----\n\n");
   printf("-----------------------------------------------------\n\n");
   fclose(fp);
}

void load_image_file(char *filename)
/* Input of header & body information of pgm file */
/* for image1[ ][ ]，x_size1，y_size1 */
{
   char buffer[MAX_BUFFERSIZE];
   FILE *fp; /* File pointer */
   int max_gray; /* Maximum gray level */
   int x, y; /* Loop variable */

   /* Input file open */
   fp = fopen(filename, "rb");
   if (NULL == fp) {
     printf("     The file doesn't exist!\n\n");
     exit(1);
   }
   /* Check of file-type ---P5 */
   fgets(buffer, MAX_BUFFERSIZE, fp);
   if (buffer[0] != 'P' || buffer[1] != '5') {
     printf("     Mistaken file format, not P5!\n\n");
     exit(1);
   }
   /* input of x_size1, y_size1 */
   x_size1 = 0;
   y_size1 = 0;
   while (x_size1 == 0 || y_size1 == 0) {
     fgets(buffer, MAX_BUFFERSIZE, fp);
     if (buffer[0] != '#') {
       sscanf(buffer, "%d %d", &x_size1, &y_size1);
     }
   }
   /* input of max_gray */
   max_gray = 0;
   while (max_gray == 0) {
     fgets(buffer, MAX_BUFFERSIZE, fp);
     if (buffer[0] != '#') {
        sscanf(buffer, "%d", &max_gray);
     }
   }
   if (x_size1 > MAX_IMAGEWIDTH || y_size1 > MAX_IMAGEHEIGHT) {
     printf("     Image size exceeds %d x %d\n\n",
        MAX_IMAGEWIDTH, MAX_IMAGEHEIGHT);
     printf("     Please use smaller images!\n\n");
     exit(1);
   }
   if (max_gray != MAX_BRIGHTNESS) {
     printf("     Invalid value of maximum gray level!\n\n");
     exit(1);
   }
   /* Input of image data*/
  #pragma simd
  for (y = 0; y < y_size1; y++) {
    #pragma simd
    for (x = 0; x < x_size1; x++) {
       image1[y][x] = (float)fgetc(fp);
     }
  }
  fclose(fp);
}

void save_image_file(char *filename)
/* Output of image2[ ][ ], x_size2, y_size2 */
/* into pgm file with header & body information */
{
  FILE *fp; /* File pointer */
  int x, y; /* Loop variable */

  fp = fopen(filename, "wb");
  /* output of pgm file header information */
  fputs("P5\n", fp);
  fputs("# Created by Image Processing\n", fp);
  fprintf(fp, "%d %d\n", x_size2, y_size2);
  fprintf(fp, "%d\n", MAX_BRIGHTNESS);
  /* Output of image data */
 #pragma simd
  for (y = 0; y < y_size2; y++) {
    #pragma simd
    for (x = 0; x < x_size2; x++) {
      fputc(image2[y][x], fp);
    }
  }
  fclose(fp);
}

**xeon.c**

/* sobel.c */
#include <stdio.h>
#include <stdlib.h>
#include <float.h>
#include <time.h>
#include <omp.h>
#include "mypgm.h"



void sobel_filtering( )
     /* Spatial filtering of image data */
     /* Sobel filter (horizontal differentiation */
     /* Input: image1[y][x] ---- Outout: image2[y][x] */
{
  /* Definition of Sobel filter in horizontal direction */
  float weight[3][3] __attribute__((aligned(64)))= {{ -1,  0,  1 },
              { -2,  0,  2 },
              { -1,  0,  1 }};
  float pixel_value;
  float min, max;
  int x, y, i, j;  /* Loop variable */

  /* Maximum values calculation after filtering*/
  printf("Now, filtering of input image is performed\n\n");


  min = DBL_MAX;
  max = -DBL_MAX;
  #pragma omp parallel shared(image2,weight,image1,min,max) private(y,x,j,i)
  {
    #pragma omp for collapse(2)

    for (y=0;y<y_size1;y++)  {
       for (x=0;x<x_size1;x++)  {
          image2[y][x]=0;
       }
    }
    #pragma omp for collapse(2) reduction(+:pixel_value)

    for (y = 1; y < y_size1 - 1; y++) {
      //#pragma simd
      for (x = 1; x < x_size1 - 1; x++) {
        pixel_value = 0.0;
        #pragma simd
        //#pragma omp for collapse(2)
        for (j = -1; j <= 1; j++) {
          #pragma simd
          for (i = -1; i <= 1; i++) {
             pixel_value += weight[j + 1][i + 1] * image1[y + j][x + i];
          }
        }
        image2[y][x] = (float)pixel_value;
        #pragma omp critical
       {
          if (pixel_value < min)
             min = pixel_value;
          if (pixel_value > max)
             max = pixel_value;
       }
   }
 }
 #pragma omp barrier
 #pragma omp single
 {
   if ((int)(max - min) == 0) {
    printf("Nothing exists!!!\n\n");
    exit(1);
   }
  printf("%f\n",min);
  printf("%f\n",max);
 }

  /* Generation of image2 after linear transformtion */

  #pragma omp for private(x) collapse(2)
     //#pragma simd
     for (y=1;y<y_size1-1;y++)  {
       //#pragma simd
       for (x=1;x<x_size1-1;x++)  {
         image2[y][x] = MAX_BRIGHTNESS * (image2[y][x] - min) / (max - min);
       }
     }
 } // ends the parallel section
} //end of sobel filtering function

int main( )
{
  load_image_data( );   /* Input of image1 */
  clock_t begin=clock();
  sobel_filtering( );   /* Sobel filter is applied to image1 */
  clock_t end=clock();
  double time_spent = (double)(end-begin)/CLOCKS_PER_SEC;
  printf("\n\nTiming result of multiplication of matrix-vector: %f\n",time_spent);
  save_image_data( );   /* Output of image2 */
  return 0;
}

Zone:

Modern Code

Thread Topic:

Help Me

↧

Knights Landing mesh hop cost X- vs Y-dir

June 14, 2017, 6:46 am

Latest and popular articles on Intel Technologies

≫ Next: Knights Landing's MCDRAM Address Mapping

≪ Previous: Sobel Filter (OpenMP implementation for Knights Landing)

in "Knights Landing: Second generation Xeon Phi product" by Sodani et al. [1],
the authors state, that "One hop on mesh takes one clock in the Y direction and two clocks in the X direction" without further explanation.
As I could not find any additional sources on this topic, it only remains for me to speculate about the reasoning behind it.
Does anyone know something about this issue and may even provide some references?

Thanks in advance,
Michael

[1]: http://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=7453080

↧

Knights Landing's MCDRAM Address Mapping

June 14, 2017, 7:36 am

Latest and popular articles on Intel Technologies

≫ Next: proformance issue for writing back to thread local matrix

≪ Previous: Knights Landing mesh hop cost X- vs Y-dir

I am interested to know how MCDRAM address mapping happens in the Kinghts Landing. For a given physical address how it decides MCDRAM's row, column, bank, and channel? Is there any MCDRAM architecture spec that describes this procedure?

Thanks in advance.

↧

proformance issue for writing back to thread local matrix

June 14, 2017, 8:41 am

Latest and popular articles on Intel Technologies

≫ Next: Xeon PHI MIA after flash update 3.8.2

≪ Previous: Knights Landing's MCDRAM Address Mapping

Hello Everyone,

I have a scientific program that computes integrals, then combined the integral with the input matrix to form the result and finally write the result to the output matrix. I found that the performance downgrades significantly (slow down for 8-10 times) when the result is writing back to the output matrix. However, the output matrix is private for the given thread, so it should not be false sharing?? I use the Intel® Xeon Phi™ coprocessors 5110P.

Here is example of a piece of code:

aKetExRho is the output matrix, it's memory also alligned to 64 bit and it's created inside the same thread in the upper level. The function of getPtr is to get the proper pointer to write the result back
abcd is the result raw integral, aBraDenPhi contains the input data to combined with the raw integral to form the result.

__attribute__((aligned(64))) Double abcd[36];
__attribute__((aligned(64))) Double aBraDenPhi[6];

   Double* aKetExRho = matrix_phi::getPtr(colLocBasOffset,iGrid,aKetAtomBlockExRho);
   for(UInt j=0; j<6; j++) {
       const Double* abcd_ptr = &abcd[j*6];
       Double result = ZERO;
       result += aBraDenPhi[0]*abcd_ptr[0];
       result += aBraDenPhi[1]*abcd_ptr[1];
       result += aBraDenPhi[2]*abcd_ptr[2];
       result += aBraDenPhi[3]*abcd_ptr[3];
       result += aBraDenPhi[4]*abcd_ptr[4];
       result += aBraDenPhi[5]*abcd_ptr[5];
       aKetExRho[j] += -1.0E0*result;
     }

I found if I comment out the line of " aKetExRho[j] += -1.0E0*result;" the performance increases significantly. However, the output matrix is also private to the thread. How can I solve this problem?

Thank you,

Phoenix

↧

Xeon PHI MIA after flash update 3.8.2

June 16, 2017, 6:43 pm

Latest and popular articles on Intel Technologies

≫ Next: strange behaviour with icpc 2016,2017 and _m512d arithmetic

≪ Previous: proformance issue for writing back to thread local matrix

I have been trying to get the Xeon Phi in my Microway Windows 7 SP1 Workstation to work with MKL Automatic OffLoad

As part of that process I upgraded to MPSS 3.8 and re-flashed - the process went without error. Of course I rebooted.

However the PHI card seems now non-functional , and crashes MKL when I call mkl_mic_enable()

Before upgrade

>micinfo
MicInfo Utility Log
Copyright 2011-2013 Intel Corporation All Rights Reserved.

Created Wed Jun 14 11:34:46 2017

System Info
HOST OS : Windows
OS Version : Microsoft Windows 7 Professi
Driver Version : 3.3.30726.0
MPSS Version : 3.3.30726.0
Host Physical Memory : 32709 MB

Device No: 0, Device Name: mic0

Version
Flash Version : 2.1.02.0390
SMC Firmware Version : 1.16.5078
SMC Boot Loader Version : 1.8.4326
uOS Version : 2.6.38.8+mpss3.3
Device Serial Number : ADKC32800563

Cores
Total No of Active Cores : 57
Voltage : 1039000 uV
Frequency : 1100000 kHz

After Upgrade

C:\Program Files (x86)\Microsoft Visual Studio 14.0\VC>micinfo
MicInfo Utility Log
Created Fri Jun 16 08:42:07 2017

        System Info
                HOST OS                 : Windows
                OS Version              : Microsoft Windows 7 Professional
                Driver Version          : 3.8.2.4191
                MPSS Version            : 3.8.2.4191
                Host Physical Memory    : 32709 MB

Device No: 0, Device Name: mic0

        Version
                Flash Version            : NotAvailable
                SMC Firmware Version     : NotAvailable
                SMC Boot Loader Version : NotAvailable
                Coprocessor OS Version   : NotAvailable
                Device Serial Number     : NotAvailable

        Board
                Vendor ID                : 0x8086
                Device ID                : 0x225d
                Subsystem ID             : 0x3608
                Coprocessor Stepping ID : 2
                PCIe Width               : x16
                PCIe Speed               : 5 GT/s
                PCIe Max payload size    : 256 bytes
                PCIe Max read req size   : 512 bytes
                Coprocessor Model        : 0x01
                Coprocessor Model Ext    : 0x00
                Coprocessor Type         : 0x00
                Coprocessor Family       : 0x0b
                Coprocessor Family Ext   : 0x00
                Coprocessor Stepping     : C0
                Board SKU                : C0PRQ-3120/3140 P/A
                ECC Mode                 : NotAvailable
                SMC HW Revision          : NotAvailable

        Cores
                Total No of Active Cores : NotAvailable
                Voltage                  : NotAvailable
                Frequency                : NotAvailable

        Thermal
                Fan Speed Control        : NotAvailable
                Fan RPM                  : NotAvailable
                Fan PWM                  : NotAvailable
                Die Temp                 : NotAvailable

        GDDR
                GDDR Vendor              : NotAvailable
                GDDR Version             : NotAvailable
                GDDR Density             : NotAvailable
                GDDR Size                : NotAvailable
                GDDR Technology          : NotAvailable
                GDDR Speed               : NotAvailable
                GDDR Frequency           : NotAvailable
                GDDR Voltage             : NotAvailable

↧

strange behaviour with icpc 2016,2017 and _m512d arithmetic

June 21, 2017, 2:37 am

Latest and popular articles on Intel Technologies

≫ Next: KNL cache performance using SIMD intrinsic

≪ Previous: Xeon PHI MIA after flash update 3.8.2

Hello, the following sample code is compiled with icpc (version 16.0.2 (gcc version 4.9.3 compatibility)

#include <stdio.h>
#include <immintrin.h>

/*
inline __m512d operator+(const __m512d  val1, __m512d  val2){
  return _mm512_add_pd(val1, val2);
}
*/

int main(int argc, char **argv)
{
  __m512d a={1,2,3,4,1,2,3,4},b={5,6,7,8,5,6,7,8},c;

  c=a+b;

  double *pc = (double *)&c;

  printf("c = %e %e %e %e %e %e %e %e \n",
         pc[0],
         pc[1],
         pc[2],
         pc[3],
         pc[4],
         pc[5],
         pc[6],
         pc[7]
         );

  return 1;
}

we get the following error:

$icpc -xMIC-AVX512 main.cpp
main.cpp(14): error: operation not supported for these simd operands
c=a+b;
^

compilation aborted for main.cpp (code 2)

If we uncomment the operator overloading compilation is fine.

If we move to intel 2017 (icpc version 17.0.0 (gcc version 5.4.0 compatibility) we have the same error when operator overload is commented (which seems to be coherent) BUT also when overload is not commented !

So my questions are:

Why arithmetic with __m512d does not works with intel2016 and intel217 (works perfectly with AVX/AVX2 types)

Why operator overloading does not work with intel2017?

Best regards

T. Guignon

Thread Topic:

Bug Report

↧

KNL cache performance using SIMD intrinsic

June 23, 2017, 5:33 pm

Latest and popular articles on Intel Technologies

≫ Next: CentOS 7.3 crashes after installation of MPSS 3.8.2

≪ Previous: strange behaviour with icpc 2016,2017 and _m512d arithmetic

Hi
I am very curious about the cache performance of KNL with SIMD intrinsic. I have the following observations.
I write a matrix to matrix multiplication program. I have two versions. The first one does gemm in a formal way, without intrinsic. And I wrote another one with intrinsic. Let's say the matrices are small ones, i.e., 16 * 16. I profile the two versions using VTune. I find that the first version really has a very small number of L1 cache misses. However, the second one has much more L1 cache misses than the first version, several times more.
The first version is compiled with -O1, so it is not vectrized. The second version is fully vectorized since I use the AVX512 intrinsic instructions. For the runtime, the first version takes much more time without doubt.
The question is why the cache miss number is so much different? The two versions should have the same memory access pattern. And all data (three 16*16 floats matrices) should be cached in the L1 cache. There should be only compulsory cache misses.

Could anyone help to explain why?

↧

CentOS 7.3 crashes after installation of MPSS 3.8.2

June 28, 2017, 8:53 am

Latest and popular articles on Intel Technologies

≫ Next: About installing CentOS on SSD in KNL 7250

≪ Previous: KNL cache performance using SIMD intrinsic

When I try to install mpss 3.8.2 for my Xeon Phi 31S1p coprocessor on CentOS 7.3 the system crashes. Is there anything I can do/try or a possible fix? Any help would be highly appreciated, thank you!

I downloaded mpss-3.8.2(released: April 25, 2017) from the page https://software.intel.com/en-us/articles/intel-manycore-platform-software-stack-mpss#lx38rel and followed the instructions as provided in the readme file. As the kernel of my system is slightly newer than what the mpss-download provides for, I had to recompile mpss, which worked fine. I can also install the rpm-packages, receiving the following error message (which I am not sure if it is related to the problem at all):

depmod: ERROR: failed to load symbols from /lib/modules/3.10.0-514.21.2.el7.x86_64/extra/nvidia-uvm.ko: Invalid argument

After having installed the mpss-software, however, I can no longer boot the system (see below).

When I execute "modprobe mic", I get the following error message three times:

NMI watchdog: BUG: soft lockup - CPU#32 stuck for 22s! [modprobe:17376]

After displaying this message three times, the command prompt reappears. I can execute "micctrl --initdefaults" without any messages being displayed.
If I then execute "micctrl -s" I get the error "mic0: reset failed".
If I try "/usr/bin/miccheck", the system freezes completely.

After having installed mpss, I get the errors below when rebooting the system. I. e. the system cannot boot anymore. I can correct the problem by entering recovery mode and executing the "uninstall.sh"-script delivered in the mpss-download. After that, I can reboot the system without problems.

The coprocessor is correctly identified by "lspci" as below and large BAR support has been enabled in the BIOS ("above 4G decoding"):

09:00.0 Co-processor: Intel Corporation Xeon Phi coprocessor 31S1 (rev 11)

---BASIC SYSTEM INFORMATION---

ASUS X99-E WS
Intel Xeon E5-2696V3
64 GB RAM
NVidia GForce 1080

---ERROR LOG WHEN REBOOTING---

[   12.1884] pcieport 0000:00:02.0 PCIe Bus Error: severity: Uncorrected (Non-Fatal), type=Tansaction Layer, id=0010(Requester-ID)
[   12.1885] pcieport 0000:00:02.0   device [8086:2f04] error status/mask=000040000/00000000
[   12.1886] pcieport 0000:00:02.0    [14] Completion Timeout    (First)
[   40.0710] NMI Watchdog: BUG: soft lockup - CPU#0 stuck for 23s! [modprobe:784]
[   68.0710] NMI Watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [modprobe:784]
[   72.2060] INFO: rcu_sched self-detected stall on CPU { 0}  (t=60001 jiffies g=135 c=134 q=2018)
[  100.0710] NMI Watchdog: BUG: soft lockup - CPU#0 stuck for 23s! [modprobe:784]
[  113.4049] ETC timer compensation(-1000000ppm) is much higherthan expected
[  113.4049] pcieport 0000:00:02.0:  device [8086:2f04] error status/mask=000040000/00000000
[  113.4049] pcieport 0000:00:02.0:   [14] Completion Timeout    (First)
...
[  120.8210] mce: [Hardware Error]: CPU 16: Machine Check Exception: 0 Bank 3: fe00000000800400
[  120.8210] mce: [Hardware Error]: TSC 0 ADDR ffe0000000000000 MISC ffffffff81060ff5
[  120.8210] mce: [Hardware Error]: PROCESSOR 0:306f2 TIME 1498668525 SOCKET 0 APIC 34 microcode 38
...
[  120.8210] mce: [Hardware Error]: CPU 22: Machine Check Exception: 5 Bank 18: be200000008c110a
[  120.8210] mce: [Hardware Error]: RIP !INEXACT! 10:<ffffffff81060fe6> {native_save_halt+0x6/0x10}
[  120.8210] mce: [Hardware Error]: TSC e627fde4082 ADDR e0900fc0 MISC 74fc381600402086
[  120.8210] mce: [Hardware Error]: PROCESSOR 0:306f2 TIME 1498661446 SOCKET 0 APIC 9 microcode 38
[  120.8210] mce: [Hardware Error]: Some CPUs didn't answer in synchronization
[  120.8210] mce: [Hardware Error]: Machine check: Processor context corrupt
[  120.8210] Kernel panic - not syncing: Fatal machine check on current CPU
[  120.8210] Shutting down cpus with NMI
[  120.8210] Rebooting in 30 seconds..

Thread Topic:

Help Me

↧

About installing CentOS on SSD in KNL 7250

June 29, 2017, 4:13 pm

Latest and popular articles on Intel Technologies

≫ Next: mpirun: command not found

≪ Previous: CentOS 7.3 crashes after installation of MPSS 3.8.2

Dear ALL

I encountered a problem of the CentOS installation in a Adam Pass server with KNL 7250 and a SSD disc of about 800GiB. I want to install CentOS 7 into this SSD disc and use CentOS-7-x86_64-DVD-1611.iso.

The installation of CentOS uses a USB flash drive, to which the installation ISO image has been directly written by using the dd command in Linux or by using UltraISO in Win.

After powering on the server, the SSD disc can be found in BIOS, but in the CentOS Installation Destination screen it did not appear.

I have tried several ways, but still cannot solve this problem, and therefore I have to write this post for asking helps.

Thread Topic:

Help Me

↧

mpirun: command not found

July 4, 2017, 5:34 am

Latest and popular articles on Intel Technologies

≫ Next: Intel MKL performance drop OpenMP vs TBB

≪ Previous: About installing CentOS on SSD in KNL 7250

Hi ! I am getting "mpirun: command not found" error on my xeon PHI card. Could you please help me to solve the problem ? The response of "which mpirun" on my main procesor is : /opt/intel/compilers_and_libraries_2016.1.150/linux/mpi/intel64/bin/mpirun.

Thread Topic:

How-To

↧

Intel MKL performance drop OpenMP vs TBB

July 4, 2017, 11:21 am

Latest and popular articles on Intel Technologies

≫ Next: Can Intel Xeon Phi get data direct from another PCI device?

≪ Previous: mpirun: command not found

Hi everyone,

I tried the below example program on KNL and I am puzzled about the huge performance difference. It computes a small matrix-matrix product using the MKL. In this (naive) example there is a 1000x performance difference when switching from OpenMP to TBB. The file was compiled with

 icc -std=c++11 -O3 -xmic-avx512 -mkl -qopenmp tbb_vs_omp.cpp -o omp
 icc -std=c++11 -O3 -xmic-avx512 -mkl -tbb tbb_vs_omp.cpp -o tbb

I tried a few things, e.g. using tbb::task_scheduler_init or OpenMP env variables, but nothing seems to make the TBB version nearly as fast as the OpenMP version, or the OpenMP version as slow. Does anyone know what might the problem and how to fix it, that is how to configure TBB? The gap gets smaller when increasing the problem size (only 10x for N=1024).

#include <iostream>

#include <mkl.h>

constexpr size_t N    = 64;
constexpr size_t RUNS = 20;

int main() {
  double* A = (double*)_mm_malloc(N * N * sizeof(double), 64);
  double* B = (double*)_mm_malloc(N * N * sizeof(double), 64);
  double* C = (double*)_mm_malloc(N * N * sizeof(double), 64);

  VSLStreamStatePtr stream;
  vslNewStream(&stream, VSL_BRNG_SFMT19937, 1337);
  vdRngUniform(VSL_RNG_METHOD_UNIFORM_STD, stream, N * N, A, -10, 10);
  vdRngUniform(VSL_RNG_METHOD_UNIFORM_STD, stream, N * N, B, -10, 10);
  vslDeleteStream(&stream);

  std::cout << "Created matrices, N = "<< N << ".\n";

  {
    double total = 0.0;
    cblas_dgemm(CBLAS_LAYOUT::CblasColMajor, CBLAS_TRANSPOSE::CblasTrans,
                CBLAS_TRANSPOSE::CblasNoTrans, N, N, N, 1.0, A, N /* lda */, B,
                N /* ldb */, 0.0, C, N /* ldc */);
    for (size_t i = 0; i < RUNS; ++i) {
      // A[0] = i;
      double start = dsecnd();
      cblas_dgemm(CBLAS_LAYOUT::CblasColMajor, CBLAS_TRANSPOSE::CblasTrans,
                  CBLAS_TRANSPOSE::CblasNoTrans, N, N, N, 1.0, A, N /* lda */,
                  B, N /* ldb */, 0.0, C, N /* ldc */);
      total += dsecnd() - start;
    }
    std::cout << "Time needed "<< total << ", ";
  }

  std::cout << C[0] << '\n';

  _mm_free(A);
  _mm_free(B);
  _mm_free(C);
  return 0;
}

Thread Topic:

Help Me

↧

Can Intel Xeon Phi get data direct from another PCI device?

July 4, 2017, 9:35 pm

Latest and popular articles on Intel Technologies

≫ Next: when mpirun to host mic, error while loading shared libraries: libmkl_intel_lp64.so

≪ Previous: Intel MKL performance drop OpenMP vs TBB

Hello,

Can Intel Xeon Phi be configured to receive data direct from FPGA board, process them and send result to host memory?

I have large flow of input data and don't want to have redundant transfers (FPGA board ->Host Memory->MIC->Host Memory) over PCI.I tried a lot to find out the solution by watching some intel product videos but wasn't satisfied. I want more elegant scheme (FPGA board-> MIC->Host Memory) Is it possible?

Please help me out.

Any help will be appreciated.
Thank you.

↧

when mpirun to host mic, error while loading shared libraries: libmkl_intel_lp64.so

July 6, 2017, 4:45 am

Latest and popular articles on Intel Technologies

≫ Next: OpenCL support for Xeon Phi processor (Knights Landing architecture)

≪ Previous: Can Intel Xeon Phi get data direct from another PCI device?

When I use "-host mic0" in the host, there is an error that the mic0 can not find the file libmkl_intel_lp64.so.

[yd@yd-ws3 ~]$ mpirun -host mic0 -iface mic0 -n 1 /yd_tools/binaries_mic/yd_binary_mpiicpc_mx008_shnao_iposlm_cmvlf_dwyg
/yd_tools/binaries_mic/yd_binary_mpiicpc_mx008_shnao_iposlm_cmvlf_dwyg: error while loading shared libraries: libmkl_intel_lp64.so: cannot open shared object file: No such file or directory

I need to set the LP_LIBRARY_PATH manually by "-env".

[yd@yd-ws3 ~]$ mpirun -host mic0 -iface mic0 -env LD_LIBRARY_PATH /opt/intel/mkl/lib/mic -n 1 /yd_tools/binaries_mic/yd_binary_mpiicpc_mx008_shnao_iposlm_cmvlf_dwyg
Usage：	/yd_tools/binaries_mic/yd_binary_mpiicpc_mx008_shnao_iposlm_cmvlf_dwyg [ work path ]

When I run mpirun in the host or in the mic, this error disappear.

[yd@yd-ws3 ~]$ mpirun -n 1 /yd_tools/binaries/yd_binary_mpiicpc_mx008_shnao_iposlm_cmvlf_dwyg
Usage：	/yd_tools/binaries/yd_binary_mpiicpc_mx008_shnao_iposlm_cmvlf_dwyg [ work path ]

[yd@yd-ws3 common]$ ssh mic0
[yd@yd-ws3-mic0 ~]$ mpirun -n 1 /yd_tools/binaries_mic/yd_binary_mpiicpc_mx008_shnao_iposlm_cmvlf_dwyg
Usage：	/yd_tools/binaries_mic/yd_binary_mpiicpc_mx008_shnao_iposlm_cmvlf_dwyg [ work path ]
[yd@yd-ws3-mic0 ~]$ exit
logout
Connection to mic0 closed.

This is the output of env. There is no LP_LIBRARY_PATH.

[yd@yd-ws3 ~]$ mpirun -host mic0 -iface mic0 -env LD_LIBRARY_PATH /opt/intel/mkl/lib/mic -n 1 /usr/bin/env |grep PATH
PATH=/opt/intel/compilers_and_libraries_2017.4.196/linux/bin/intel64:/opt/intel/compilers_and_libraries_2017.4.196/linux/mpi/intel64/bin:/opt/intel/debugger_2017/gdb/intel64_mic/bin:/usr/local/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/yd_tools/binaries:/home/yd/.local/bin:/home/yd/bin
MANPATH=/opt/intel/man/common:/opt/intel/compilers_and_libraries_2017.4.196/linux/mpi/man:/opt/intel/documentation_2017/en/debugger//gdb-ia/man/:/opt/intel/documentation_2017/en/debugger//gdb-mic/man/:/opt/intel/documentation_2017/en/debugger//gdb-igfx/man/:/usr/local/share/man:/usr/share/man:
LIBRARY_PATH=/opt/intel/compilers_and_libraries_2017.4.196/linux/ipp/lib/intel64:/opt/intel/compilers_and_libraries_2017.4.196/linux/compiler/lib/intel64_lin:/opt/intel/compilers_and_libraries_2017.4.196/linux/mkl/lib/intel64_lin:/opt/intel/compilers_and_libraries_2017.4.196/linux/tbb/lib/intel64/gcc4.7:/opt/intel/compilers_and_libraries_2017.4.196/linux/daal/lib/intel64_lin:/opt/intel/compilers_and_libraries_2017.4.196/linux/daal/../tbb/lib/intel64_lin/gcc4.4
MIC_LD_LIBRARY_PATH=/opt/intel/compilers_and_libraries_2017.4.196/linux/mpi/mic/lib:/opt/intel/compilers_and_libraries_2017.4.196/linux/compiler/lib/mic:/opt/intel/compilers_and_libraries_2017.4.196/linux/ipp/lib/mic:/opt/intel/mic/coi/device-linux-release/lib:/opt/intel/mic/myo/lib:/opt/intel/compilers_and_libraries_2017.4.196/linux/compiler/lib/intel64_lin_mic:/opt/intel/compilers_and_libraries_2017.4.196/linux/mkl/lib/intel64_lin_mic:/opt/intel/compilers_and_libraries_2017.4.196/linux/tbb/lib/mic
MIC_LIBRARY_PATH=/opt/intel/compilers_and_libraries_2017.4.196/linux/mpi/mic/lib:/opt/intel/compilers_and_libraries_2017.4.196/linux/compiler/lib/mic:/opt/intel/compilers_and_libraries_2017.4.196/linux/compiler/lib/intel64_lin_mic:/opt/intel/compilers_and_libraries_2017.4.196/linux/mkl/lib/intel64_lin_mic:/opt/intel/compilers_and_libraries_2017.4.196/linux/tbb/lib/mic
CPATH=/opt/intel/compilers_and_libraries_2017.4.196/linux/ipp/include:/opt/intel/compilers_and_libraries_2017.4.196/linux/mkl/include:/opt/intel/compilers_and_libraries_2017.4.196/linux/tbb/include:/opt/intel/compilers_and_libraries_2017.4.196/linux/daal/include
CLASSPATH=/opt/intel/compilers_and_libraries_2017.4.196/linux/mpi/intel64/lib/mpi.jar:/opt/intel/compilers_and_libraries_2017.4.196/linux/daal/lib/daal.jar
INFOPATH=/opt/intel/documentation_2017/en/debugger//gdb-ia/info/:/opt/intel/documentation_2017/en/debugger//gdb-mic/info/:/opt/intel/documentation_2017/en/debugger//gdb-igfx/info/
I_MPI_CMD=mpirun -host mic0 -iface mic0 -env LD_LIBRARY_PATH /opt/intel/mkl/lib/mic -n 1 /usr/bin/env

If I want to avoid setting the LP_LIBRARY_PATH manually by "-env". How can I do?

Thread Topic:

Help Me

↧

OpenCL support for Xeon Phi processor (Knights Landing architecture)

July 6, 2017, 4:57 am

Latest and popular articles on Intel Technologies

≫ Next: BUG in XPPSL 1.5.1/2

≪ Previous: when mpirun to host mic, error while loading shared libraries: libmkl_intel_lp64.so

Hi,

Is OpenCL supported on the newer Xeon Phi processors? The most definite information that I found that it's not supported is a post from 2 years ago (https://software.intel.com/en-us/forums/opencl/topic/697753). Has anything changed since?

Thanks,

Viktor

↧

BUG in XPPSL 1.5.1/2

July 6, 2017, 2:46 pm

Latest and popular articles on Intel Technologies

≫ Next: Does MIC really run faster than CPU

≪ Previous: OpenCL support for Xeon Phi processor (Knights Landing architecture)

Hi,

As I couldn't think of any other place where I should post this information to get someone to fix it:

XPPSL version 1.5.1 introduced a bug in its bundled hwloc. This results in wrong process binding using slurm when the KNL is configured in SNC4 + Flat. In particular when, e.g. running 4 processes per KNL, the first two processes are bound correctly, the third hower is bound to the hwthread id #2 of all 64 cores, the fourth process to hwthread id #3 of all cores. Thus process 3 and 4 are running one thread on every core instead of 4 threads on 16 cores only. SNC4 + Cache is not affected.

This bug has not been fixed in 1.5.2. I actually wanted to dig into the sources to find the exact bug, however as it seems like Intel still prefers to hide any changes to open source software as good as possible, I gave up and just inform you this way. A public github page or a simple bugtracker could be useful as well for users to submit bugs. If such thing actually exists (and I do not have to register any product to get access), please tell me where I can find it.

↧

Does MIC really run faster than CPU

July 13, 2017, 2:27 am

Latest and popular articles on Intel Technologies

≫ Next: Installation issue: 'modprobe mic' freezes server

≪ Previous: BUG in XPPSL 1.5.1/2

Hi!

I compared the speed of CPU and the MIC by running identical C++ programs using openmp(both fully occupied during operation). However, under the release version, the speed of CPU(9.7s) is nearly 3 times faster than the MIC(26.5s). How come!? If the MIC is actually slower than CPU, then what is the point of using it?

The testing code is as follows:

#pragma omp parallel for reduction(+:sum)

for(int i=0; i<100000; i++)

for(int j=0; j<100000; j++)

sum += sqrt(sqrt(j^2+1) + sqrt(sqrt(i^2+1)) + 1);

For MIC, I used offload pragma to run the code.

The MIC I used is:

Intel Xeon Phi Coprpcessor 7120

The CPU I used is:

Genuine Intel(R) CPU @ 1.80GHz 1.80GHz (2 processor)

Hopefully someone can tell me the reason.

↧

Installation issue: 'modprobe mic' freezes server

July 21, 2017, 4:53 am

Latest and popular articles on Intel Technologies

≫ Next: Intel cooprocessor error No mic cards found

≪ Previous: Does MIC really run faster than CPU

Dear all,

I am trying to install mpss 3.8.2 on a OpenSuse Leap 42.2 (in principle equivalent to Suse12.2). It comes with a newer kernel that the one stated in the 'readme.txt' file.

So I proceed with the instructions found in 'readme.txt' file. Everything ok until
# modprobe mic

At that point the machine freezes.
I am using the following kernel:
$ uname -r
$ 4.4.27-2-default

The mic.ko is correctly at '/lib/modules/4.4.27-2-default/extra/mic.ko'

For information:
lspci | grep -i Co-processor
05:00.0 Co-processor: Intel Corporation Xeon Phi coprocessor 31S1 (rev 11)
42:00.0 Co-processor: Intel Corporation Xeon Phi coprocessor 5100 series (rev 11)

Ideas are more than welcome...

↧

Intel cooprocessor error No mic cards found

July 25, 2017, 9:24 pm

Latest and popular articles on Intel Technologies

≫ Next: KNL - CHA addresses

≪ Previous: Installation issue: 'modprobe mic' freezes server

My computer has a xeon phi coprocessor. I get an error No mic cards found or specified in command line. I am using mpss 3.8.2. My kernel is 2.6.32-696.6.3.el6.x86_64. I have rebuild the kernel modules using source. I also tried reistalling mpss. When I run micrasd command error given is

Wed Jul 26 09:41:18 2017 MICRAS INFO : Open MCA filter log history.
Wed Jul 26 09:41:18 2017 MICRAS ERROR : No MIC device detected!

↧