I am currently managing a CentOS host system with several Xeon Phi 5100P coprocessors. One of the coprocessors (mic0) is exhibiting issues with accessing the SMC buffers, making it difficult to (1) perform/verify firmware updates via micflash, (2) verify coprocessor operations via miccheck, and (3) get power and thermal information via micsmc. The other coprocessors in the system do not exhibit these issues.
Output from "micflash -update -device 0"
No image path specified - Searching: /usr/share/mpss/flash mic0: Flash image: /usr/share/mpss/flash/EXT_HP2_B1_0390-02.rom.smc mic0: Flash update started mic0: Flash update done mic0: SMC update started micflash: mic0: SMC update failed: SMC buffer size exceeded (0x1) mic0: Transitioning to ready state Please restart host for flash changes to take effect
Ouptut from "miccheck -d 0"
MicCheck 3.3-r1 Copyright 2013 Intel Corporation All Rights Reserved Executing default tests for host Test 0: Check number of devices the OS sees in the system ... pass Test 1: Check mic driver is loaded ... pass Test 2: Check number of devices driver sees in the system ... pass Test 3: Check mpssd daemon is running ... pass Executing default tests for device: 0 Test 4 (mic0): Check device is in online state and its postcode is FF ... pass Test 5 (mic0): Check ras daemon is available in device ... pass Test 6 (mic0): Check running flash version is correct ... pass Test 7 (mic0): Check running SMC firmware version is correct ... fail failed to get thermal information Status: FAIL Failure: failed to get thermal information
This fail appears to just be because of thermal information, not because of the firmware version. The output from "micsmc" and "micflash -getversion" verify this when checked against mic1:
Output from "micsmc -a mic0"
mic0 (info): Device Series: ........... Intel(R) Xeon Phi(TM) coprocessor x100 family Device ID: ............... 0x2250 Number of Cores: ......... 60 OS Version: .............. 2.6.38.8+mpss3.3 Flash Version: ........... 2.1.02.0390 Driver Version: .......... 3.3-1 (<hostname omitted>) Stepping: ................ 0x3 Substepping: ............. 0x0 Error: mic0: while accessing device temperature data: thermal info: RAS: cmd 0x25: Error 0x7: SMC communication error Error: mic0: while accessing device frequency data: power limits info: RAS: cmd 0x2a: Error 0x7: SMC communication error mic0 (mem): Free Memory: ............. 7404.34 MB Total Memory: ............ 7697.61 MB Memory Usage: ............ 293.27 MB mic0 (cores): Device Utilization: User: 0.00%, System: 0.01%, Idle: 99.99% Per Core Utilization (60 cores in use) <output omitted: mic0 (cores) is okay>
Output from "micsmc -a mic1"
mic1 (info): Device Series: ........... Intel(R) Xeon Phi(TM) coprocessor x100 family Device ID: ............... 0x2250 Number of Cores: ......... 60 OS Version: .............. 2.6.38.8+mpss3.3 Flash Version: ........... 2.1.02.0390 Driver Version: .......... 3.3-1 (<hostname omitted>) Stepping: ................ 0x3 Substepping: ............. 0x0 mic1 (temp): Cpu Temp: ................ 48.00 C Memory Temp: ............. 39.00 C Fan-In Temp: ............. 31.00 C Fan-Out Temp: ............ 39.00 C Core Rail Temp: .......... 36.00 C Uncore Rail Temp: ........ 38.00 C Memory Rail Temp: ........ 38.00 C mic1 (freq): Core Frequency: .......... 1.05 GHz Total Power: ............. 103.00 Watts Low Power Limit: ......... 257.00 Watts High Power Limit: ........ 306.00 Watts Physical Power Limit: .... 326.00 Watts mic1 (mem): Free Memory: ............. 7372.31 MB Total Memory: ............ 7697.61 MB Memory Usage: ............ 325.30 MB mic1 (cores): Device Utilization: User: 0.00%, System: 0.04%, Idle: 99.96% Per Core Utilization (60 cores in use) <output omitted>
Output of "micinfo -d 0"
MicInfo Utility Log Copyright 2011-2013 Intel Corporation All Rights Reserved. Created Wed Sep 24 21:01:13 2014 System Info HOST OS : Linux OS Version : 2.6.32-431.23.3.el6.x86_64 Driver Version : 3.3-1 MPSS Version : 3.3 Host Physical Memory : 32846 MB Device No: 0, Device Name: mic0 micinfo: Failed to get thermal info: RAS: cmd 0x25: Error 0x7: SMC communication error: Success micinfo: version info failed: RAS: cmd 0x25: Error 0x7: SMC communication error: Success
Output of "micinfo -d 1"
MicInfo Utility Log Copyright 2011-2013 Intel Corporation All Rights Reserved. Created Wed Sep 24 20:59:51 2014 System Info HOST OS : Linux OS Version : 2.6.32-431.23.3.el6.x86_64 Driver Version : 3.3-1 MPSS Version : 3.3 Host Physical Memory : 32846 MB Device No: 1, Device Name: mic1 Version Flash Version : 2.1.02.0390 SMC Firmware Version : 1.16.5078 SMC Boot Loader Version : 1.8.4326 uOS Version : 2.6.38.8+mpss3.3 Device Serial Number : ADKC32601544 Board Vendor ID : 0x8086 Device ID : 0x2250 Subsystem ID : 0x2500 Coprocessor Stepping ID : 3 PCIe Width : x16 PCIe Speed : 5 GT/s PCIe Max payload size : 256 bytes PCIe Max read req size : 512 bytes Coprocessor Model : 0x01 Coprocessor Model Ext : 0x00 Coprocessor Type : 0x00 Coprocessor Family : 0x0b Coprocessor Family Ext : 0x00 Coprocessor Stepping : B1 Board SKU : B1PRQ-5110P/5120D ECC Mode : Enabled SMC HW Revision : Product 225W Passive CS Cores Total No of Active Cores : 60 Voltage : 934000 uV Frequency : 1052631 kHz Thermal Fan Speed Control : N/A Fan RPM : N/A Fan PWM : N/A Die Temp : 47 C GDDR GDDR Vendor : Elpida GDDR Version : 0x1 GDDR Density : 2048 Mb GDDR Size : 7936 MB GDDR Technology : GDDR5 GDDR Speed : 5.000000 GT/s GDDR Frequency : 2500000 kHz GDDR Voltage : 1501000 uV
miccheck on mic1 is okay. The firmware and SMC bootloader on mic1 is up to date, so the values reflected should be what is similar on mic0, assuming micflash did its job on mic0 with both the firmware update (verified via micsmc above, and micflash -getversion -device 0) and the bootloader update (not verified; don't know how except with micinfo).
I used these references, but they were of minimal help:
- Flash issues and remedies: https://software.intel.com/en-us/forums/topic/494772
- Flash version too old? https://software.intel.com/en-us/forums/topic/402175
- Cannot monitor MICs with micsmc: https://software.intel.com/en-us/forums/topic/402397
I hope I don't have to get a replacement for mic0, but it looks like that might be necessary if I want power and thermal readings from it.