Dear friends, can you help us with the following problem?
We have Intel Xeon Phi 5110 P installed on Asus p8z77ws motherboard.
OS - CentOS 6 with necessary kernel version.
We have installed mpss 3.3.2 and tried to switch Xeon Phi to online mode or to update its flash, but got reset failed or timeout messages.
Is it broken?
Here is the log to show details:
1. Ifconfig shows mic0 interface
[root@171202-1 openflow]# ifconfig eth4 Link encap:Ethernet HWaddr 40:16:7E:34:E8:08 inet addr:192.168.0.66 Bcast:192.168.0.255 Mask:255.255.255.0 UP BROADCAST MULTICAST MTU:1500 Metric:1 RX packets:0 errors:0 dropped:0 overruns:0 frame:0 TX packets:0 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:1000 RX bytes:0 (0.0 b) TX bytes:0 (0.0 b) Interrupt:18 Memory:f0700000-f0720000 eth5 Link encap:Ethernet HWaddr 82:50:FD:BC:9A:C7 inet addr:172.31.1.254 Bcast:172.31.1.255 Mask:255.255.255.0 inet6 addr: fe80::8050:fdff:febc:9ac7/64 Scope:Link UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 RX packets:11 errors:0 dropped:0 overruns:0 frame:0 TX packets:4 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:1000 RX bytes:660 (660.0 b) TX bytes:288 (288.0 b) Interrupt:17 Memory:f0800000-f0820000 lo Link encap:Local Loopback inet addr:127.0.0.1 Mask:255.0.0.0 inet6 addr: ::1/128 Scope:Host UP LOOPBACK RUNNING MTU:16436 Metric:1 RX packets:116 errors:0 dropped:0 overruns:0 frame:0 TX packets:116 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:0 RX bytes:9828 (9.5 KiB) TX bytes:9828 (9.5 KiB) mic0 Link encap:Ethernet HWaddr 82:50:FD:BC:9A:C7 inet addr:172.31.1.254 Bcast:172.31.1.255 Mask:255.255.255.0 inet6 addr: fe80::8050:fdff:febc:9ac7/64 Scope:Link UP BROADCAST RUNNING MTU:64512 Metric:1 RX packets:0 errors:0 dropped:0 overruns:0 frame:0 TX packets:0 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:1000 RX bytes:0 (0.0 b) TX bytes:0 (0.0 b)
2. after booting linux mpssd wasn't launched (miccheck shows it), therefore we start it.
[root@171202-1 openflow]# miccheck MicCheck 3.4-r1 Copyright 2013 Intel Corporation All Rights Reserved Executing default tests for host Test 0: Check number of devices the OS sees in the system ... pass Test 1: Check mic driver is loaded ... pass Test 2: Check number of devices driver sees in the system ... pass Test 3: Check mpssd daemon is running ... fail mpssd daemon not running Status: FAIL Failure: mpssd daemon not running ... [root@171202-1 openflow]# service mpss start Starting Intel(R) MPSS: [FAILED] [root@171202-1 openflow]# mpssd & [1] 3578 [root@171202-1 openflow]# Error aquiring lockfile /var/lock/mpss: File exists [root@171202-1 openflow]# ps -A | grep mpss 3566 pts/0 00:00:00 mpssd 3578 pts/0 00:00:00 mpssd 3579 pts/0 00:00:00 mpssd <defunct>
3. miccheck shows fail on test 4 - "{C}{C}{C}{C}Check device is in online state and its postcode FF"
[root@171202-1 openflow]# miccheck MicCheck 3.4-r1 Copyright 2013 Intel Corporation All Rights Reserved Executing default tests for host Test 0: Check number of devices the OS sees in the system ... pass Test 1: Check mic driver is loaded ... pass Test 2: Check number of devices driver sees in the system ... pass Test 3: Check mpssd daemon is running ... pass Executing default tests for device: 0 Test 4 (mic0): Check device is in online state and its postcode is FF ... fail device is not online: reset failed Status: FAIL Failure: A device test failed
4. we create dump of lscpci -vvv command.(complete file lspci_dump.txt is attached).
03:00.0 Co-processor: Intel Corporation Xeon Phi coprocessor 5100 series (rev 11) Subsystem: Intel Corporation Device 2500 Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+ Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx- Latency: 0, Cache Line Size: 64 bytes Interrupt: pin A routed to IRQ 16 Region 0: Memory at e00000000 (64-bit, prefetchable) [size=8G] Region 4: Memory at f0400000 (64-bit, non-prefetchable) [size=128K] Capabilities: [44] Power Management version 3 Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0+,D1-,D2-,D3hot-,D3cold-) Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME- Capabilities: [4c] Express (v2) Endpoint, MSI 00 DevCap: MaxPayload 256 bytes, PhantFunc 0, Latency L0s <4us, L1 <64us ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset- DevCtl: Report errors: Correctable- Non-Fatal- Fatal- Unsupported- RlxdOrd- ExtTag- PhantFunc- AuxPwr- NoSnoop+ MaxPayload 128 bytes, MaxReadReq 512 bytes DevSta: CorrErr+ UncorrErr- FatalErr- UnsuppReq+ AuxPwr- TransPend- LnkCap: Port #0, Speed 5GT/s, Width x16, ASPM L0s L1, Latency L0 <4us, L1 unlimited ClockPM- Surprise- LLActRep- BwNot- LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- Retrain- CommClk+ ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt- LnkSta: Speed 5GT/s, Width x16, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt- DevCap2: Completion Timeout: Range AB, TimeoutDis+, LTR-, OBFF Not Supported DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-, LTR-, OBFF Disabled LnkCtl2: Target Link Speed: 5GT/s, EnterCompliance- SpeedDis- Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS- Compliance De-emphasis: -6dB LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete-, EqualizationPhase1- EqualizationPhase2-, EqualizationPhase3-, LinkEqualizationRequest- Capabilities: [88] MSI: Enable- Count=1/16 Maskable- 64bit+ Address: 0000000000000000 Data: 0000 Capabilities: [98] MSI-X: Enable+ Count=16 Masked- Vector table: BAR=4 offset=00017000 PBA: BAR=4 offset=00018000 Capabilities: [100 v1] Advanced Error Reporting UESta: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol- UEMsk: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol- UESvrt: DLP+ SDES- TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol- CESta: RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr+ CEMsk: RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr+ AERCap: First Error Pointer: 00, GenCap- CGenEn- ChkCap- ChkEn- Kernel driver in use: mic
5. we try to reset Xeon Phi by micctrl
[root@171202-1 openflow]# micctrl -s mic0: reset failed [root@171202-1 openflow]# micctrl -rw mic0: resetting [Error] Timeout booting MIC, check your installation
6. during resetting in the linux log file "messages" (complete file is attached) we can see something like this
.... Oct 25 13:42:16 171202-1 kernel: mic0: Resetting (Post Code 3C) Oct 25 13:42:17 171202-1 kernel: mic0: Resetting (Post Code 3d) Oct 25 13:42:18 171202-1 kernel: mic0: Resetting (Post Code 3d) Oct 25 13:42:19 171202-1 kernel: mic0: Resetting (Post Code 3d) Oct 25 13:42:20 171202-1 kernel: mic0: Resetting (Post Code 3d) Oct 25 13:42:21 171202-1 kernel: mic0: Resetting (Post Code 3E) Oct 25 13:42:22 171202-1 kernel: mic0: Resetting (Post Code 3E) Oct 25 13:42:23 171202-1 kernel: mic0: Resetting (Post Code 3E) Oct 25 13:42:24 171202-1 kernel: mic0: Resetting (Post Code F2) Oct 25 13:42:24 171202-1 kernel: Reattempting reset after F2/F4 failure Oct 25 13:42:24 171202-1 kernel: mic0: Transition from state resetting to resetting Oct 25 13:42:26 171202-1 kernel: mic0: Resetting (Post Code 3C) Oct 25 13:42:27 171202-1 kernel: mic0: Resetting (Post Code 3d) Oct 25 13:42:28 171202-1 kernel: mic0: Resetting (Post Code 3d) Oct 25 13:42:29 171202-1 kernel: mic0: Resetting (Post Code 3d) Oct 25 13:42:30 171202-1 kernel: mic0: Resetting (Post Code 3d) Oct 25 13:42:31 171202-1 kernel: mic0: Resetting (Post Code 3E) Oct 25 13:42:32 171202-1 kernel: mic0: Resetting (Post Code 3E) Oct 25 13:42:33 171202-1 kernel: mic0: Resetting (Post Code 3E) Oct 25 13:42:34 171202-1 kernel: mic0: Resetting (Post Code F2) Oct 25 13:42:34 171202-1 kernel: Reattempting reset after F2/F4 failure Oct 25 13:42:34 171202-1 kernel: mic0: Transition from state resetting to resetting Oct 25 13:42:36 171202-1 kernel: mic0: Resetting (Post Code 3C) Oct 25 13:42:37 171202-1 kernel: mic0: Resetting (Post Code 3d) Oct 25 13:42:38 171202-1 kernel: mic0: Resetting (Post Code 3d) Oct 25 13:42:38 171202-1 kernel: mic0: Transition from state resetting to reset failed Oct 25 13:42:38 171202-1 kernel: MIC 0 RESETFAIL postcode 3d 25651
7. using minicom to connect to /dev/tty/MIC0, but we get only "Initialization modem"
8. micinfo results
MicInfo Utility Log Created Sat Oct 25 13:53:07 2014 System Info HOST OS : Linux OS Version : 2.6.32-431.el6.x86_64 Driver Version : 3.4-1 MPSS Version : 3.4 Host Physical Memory : 32555 MB Device No: 0, Device Name: mic0 Version Flash Version : NotAvailable SMC Firmware Version : NotAvailable SMC Boot Loader Version : NotAvailable uOS Version : NotAvailable Device Serial Number : NotAvailable Board Vendor ID : 0x8086 Device ID : 0x2250 Subsystem ID : 0x2500 Coprocessor Stepping ID : 3 PCIe Width : x16 PCIe Speed : 5 GT/s PCIe Max payload size : 128 bytes PCIe Max read req size : 512 bytes Coprocessor Model : 0x01 Coprocessor Model Ext : 0x00 Coprocessor Type : 0x00 Coprocessor Family : 0x0b Coprocessor Family Ext : 0x00 Coprocessor Stepping : B1 Board SKU : B1PRQ-5110P/5120D ECC Mode : NotAvailable SMC HW Revision : NotAvailable Cores Total No of Active Cores : NotAvailable Voltage : NotAvailable Frequency : NotAvailable Thermal Fan Speed Control : NotAvailable Fan RPM : NotAvailable Fan PWM : NotAvailable Die Temp : NotAvailable GDDR GDDR Vendor : NotAvailable GDDR Version : NotAvailable GDDR Density : NotAvailable GDDR Size : NotAvailable GDDR Technology : NotAvailable GDDR Speed : NotAvailable GDDR Frequency : NotAvailable GDDR Voltage : NotAvailable
We tried Xeon Phi with Red Hat* Enterprise Linux* 64-bit 7.0 (kernel 3.10.0-123) and got the same result. Also we tried it with Microsoft Windows Server 2012 R2 (64 bit) - in this case mpss doesn't install and roll back, installation log shows that it can't reset Xeon Phi too.