Quantcast
Channel: Intel® Software - Intel® Many Integrated Core Architecture (Intel MIC Architecture)
Viewing all articles
Browse latest Browse all 1789

SMC inaccessible; Cannot get power or thermal readings

$
0
0

I am currently managing a CentOS host system with several Xeon Phi 5100P coprocessors. One of the coprocessors (mic0) is exhibiting issues with accessing the SMC buffers, making it difficult to (1) perform/verify firmware updates via micflash, (2) verify coprocessor operations via miccheck, and (3) get power and thermal information via micsmc. The other coprocessors in the system do not exhibit these issues.

 

Output from "micflash -update -device 0"

No image path specified - Searching: /usr/share/mpss/flash
mic0: Flash image: /usr/share/mpss/flash/EXT_HP2_B1_0390-02.rom.smc
mic0: Flash update started
mic0: Flash update done
mic0: SMC update started
micflash: mic0: SMC update failed: SMC buffer size exceeded (0x1)

mic0: Transitioning to ready state

Please restart host for flash changes to take effect

 

Ouptut from "miccheck -d 0"

MicCheck 3.3-r1
Copyright 2013 Intel Corporation All Rights Reserved

Executing default tests for host
  Test 0: Check number of devices the OS sees in the system ... pass
  Test 1: Check mic driver is loaded ... pass
  Test 2: Check number of devices driver sees in the system ... pass
  Test 3: Check mpssd daemon is running ... pass
Executing default tests for device: 0
  Test 4 (mic0): Check device is in online state and its postcode is FF ... pass
  Test 5 (mic0): Check ras daemon is available in device ... pass
  Test 6 (mic0): Check running flash version is correct ... pass
  Test 7 (mic0): Check running SMC firmware version is correct ... fail
    failed to get thermal information

Status: FAIL
Failure: failed to get thermal information

This fail appears to just be because of thermal information, not because of the firmware version. The output from "micsmc" and "micflash -getversion" verify this when checked against mic1:

 

Output from "micsmc -a mic0"

mic0 (info):
   Device Series: ........... Intel(R) Xeon Phi(TM) coprocessor x100 family
   Device ID: ............... 0x2250
   Number of Cores: ......... 60
   OS Version: .............. 2.6.38.8+mpss3.3
   Flash Version: ........... 2.1.02.0390
   Driver Version: .......... 3.3-1 (<hostname omitted>)
   Stepping: ................ 0x3
   Substepping: ............. 0x0
Error: mic0: while accessing device temperature data: thermal info: RAS: cmd 0x25: Error 0x7: SMC communication error
Error: mic0: while accessing device frequency data: power limits info: RAS: cmd 0x2a: Error 0x7: SMC communication error

mic0 (mem):
   Free Memory: ............. 7404.34 MB
   Total Memory: ............ 7697.61 MB
   Memory Usage: ............ 293.27 MB

mic0 (cores):
   Device Utilization: User:   0.00%,   System:   0.01%,   Idle:  99.99%
   Per Core Utilization (60 cores in use)
<output omitted: mic0 (cores) is okay>

 

Output from "micsmc -a mic1"

mic1 (info):
   Device Series: ........... Intel(R) Xeon Phi(TM) coprocessor x100 family
   Device ID: ............... 0x2250
   Number of Cores: ......... 60
   OS Version: .............. 2.6.38.8+mpss3.3
   Flash Version: ........... 2.1.02.0390
   Driver Version: .......... 3.3-1 (<hostname omitted>)
   Stepping: ................ 0x3
   Substepping: ............. 0x0

mic1 (temp):
   Cpu Temp: ................ 48.00 C
   Memory Temp: ............. 39.00 C
   Fan-In Temp: ............. 31.00 C
   Fan-Out Temp: ............ 39.00 C
   Core Rail Temp: .......... 36.00 C
   Uncore Rail Temp: ........ 38.00 C
   Memory Rail Temp: ........ 38.00 C

mic1 (freq):
   Core Frequency: .......... 1.05 GHz
   Total Power: ............. 103.00 Watts
   Low Power Limit: ......... 257.00 Watts
   High Power Limit: ........ 306.00 Watts
   Physical Power Limit: .... 326.00 Watts

mic1 (mem):
   Free Memory: ............. 7372.31 MB
   Total Memory: ............ 7697.61 MB
   Memory Usage: ............ 325.30 MB

mic1 (cores):
   Device Utilization: User:   0.00%,   System:   0.04%,   Idle:  99.96%
   Per Core Utilization (60 cores in use)
<output omitted>

 

Output of "micinfo -d 0"

MicInfo Utility Log
Copyright 2011-2013 Intel Corporation All Rights Reserved.

Created Wed Sep 24 21:01:13 2014


        System Info
                HOST OS                 : Linux
                OS Version              : 2.6.32-431.23.3.el6.x86_64
                Driver Version          : 3.3-1
                MPSS Version            : 3.3
                Host Physical Memory    : 32846 MB

Device No: 0, Device Name: mic0
micinfo: Failed to get thermal info: RAS: cmd 0x25: Error 0x7: SMC communication error: Success
micinfo: version info failed: RAS: cmd 0x25: Error 0x7: SMC communication error: Success

 

Output of "micinfo -d 1"

MicInfo Utility Log
Copyright 2011-2013 Intel Corporation All Rights Reserved.

Created Wed Sep 24 20:59:51 2014


        System Info
                HOST OS                 : Linux
                OS Version              : 2.6.32-431.23.3.el6.x86_64
                Driver Version          : 3.3-1
                MPSS Version            : 3.3
                Host Physical Memory    : 32846 MB

Device No: 1, Device Name: mic1

        Version
                Flash Version            : 2.1.02.0390
                SMC Firmware Version     : 1.16.5078
                SMC Boot Loader Version  : 1.8.4326
                uOS Version              : 2.6.38.8+mpss3.3
                Device Serial Number     : ADKC32601544

        Board
                Vendor ID                : 0x8086
                Device ID                : 0x2250
                Subsystem ID             : 0x2500
                Coprocessor Stepping ID  : 3
                PCIe Width               : x16
                PCIe Speed               : 5 GT/s
                PCIe Max payload size    : 256 bytes
                PCIe Max read req size   : 512 bytes
                Coprocessor Model        : 0x01
                Coprocessor Model Ext    : 0x00
                Coprocessor Type         : 0x00
                Coprocessor Family       : 0x0b
                Coprocessor Family Ext   : 0x00
                Coprocessor Stepping     : B1
                Board SKU                : B1PRQ-5110P/5120D
                ECC Mode                 : Enabled
                SMC HW Revision          : Product 225W Passive CS

        Cores
                Total No of Active Cores : 60
                Voltage                  : 934000 uV
                Frequency                : 1052631 kHz

        Thermal
                Fan Speed Control        : N/A
                Fan RPM                  : N/A
                Fan PWM                  : N/A
                Die Temp                 : 47 C

        GDDR
                GDDR Vendor              : Elpida
                GDDR Version             : 0x1
                GDDR Density             : 2048 Mb
                GDDR Size                : 7936 MB
                GDDR Technology          : GDDR5
                GDDR Speed               : 5.000000 GT/s
                GDDR Frequency           : 2500000 kHz
                GDDR Voltage             : 1501000 uV

 

miccheck on mic1 is okay. The firmware and SMC bootloader on mic1 is up to date, so the values reflected should be what is similar on mic0, assuming micflash did its job on mic0 with both the firmware update (verified via micsmc above, and micflash -getversion -device 0) and the bootloader update (not verified; don't know how except with micinfo).

I used these references, but they were of minimal help:

I hope I don't have to get a replacement for mic0, but it looks like that might be necessary if I want power and thermal readings from it.


Viewing all articles
Browse latest Browse all 1789

Trending Articles



<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>