Quantcast
Channel: Intel® Software - Intel® Many Integrated Core Architecture (Intel MIC Architecture)
Viewing all articles
Browse latest Browse all 1789

Intel Xeon Phi 5110P - device is not online

$
0
0

Dear friends, can you help us with the following problem?

We have Intel Xeon Phi 5110 P installed on Asus p8z77ws motherboard.

OS - CentOS 6 with necessary kernel version.

We have installed mpss 3.3.2 and tried to switch Xeon Phi to online mode or to update its flash, but got reset failed or timeout messages.

Is it broken?

Here is the log to show details:

1. Ifconfig shows mic0 interface

[root@171202-1 openflow]# ifconfig
eth4      Link encap:Ethernet  HWaddr 40:16:7E:34:E8:08
          inet addr:192.168.0.66  Bcast:192.168.0.255  Mask:255.255.255.0
          UP BROADCAST MULTICAST  MTU:1500  Metric:1
          RX packets:0 errors:0 dropped:0 overruns:0 frame:0
          TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000
          RX bytes:0 (0.0 b)  TX bytes:0 (0.0 b)
          Interrupt:18 Memory:f0700000-f0720000

eth5      Link encap:Ethernet  HWaddr 82:50:FD:BC:9A:C7
          inet addr:172.31.1.254  Bcast:172.31.1.255  Mask:255.255.255.0
          inet6 addr: fe80::8050:fdff:febc:9ac7/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:11 errors:0 dropped:0 overruns:0 frame:0
          TX packets:4 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000
          RX bytes:660 (660.0 b)  TX bytes:288 (288.0 b)
          Interrupt:17 Memory:f0800000-f0820000

lo        Link encap:Local Loopback
          inet addr:127.0.0.1  Mask:255.0.0.0
          inet6 addr: ::1/128 Scope:Host
          UP LOOPBACK RUNNING  MTU:16436  Metric:1
          RX packets:116 errors:0 dropped:0 overruns:0 frame:0
          TX packets:116 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0
          RX bytes:9828 (9.5 KiB)  TX bytes:9828 (9.5 KiB)

mic0      Link encap:Ethernet  HWaddr 82:50:FD:BC:9A:C7
          inet addr:172.31.1.254  Bcast:172.31.1.255  Mask:255.255.255.0
          inet6 addr: fe80::8050:fdff:febc:9ac7/64 Scope:Link
          UP BROADCAST RUNNING  MTU:64512  Metric:1
          RX packets:0 errors:0 dropped:0 overruns:0 frame:0
          TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000
          RX bytes:0 (0.0 b)  TX bytes:0 (0.0 b)

2. after booting linux mpssd wasn't launched (miccheck shows it), therefore we start it.

[root@171202-1 openflow]# miccheck
MicCheck 3.4-r1
Copyright 2013 Intel Corporation All Rights Reserved

Executing default tests for host
  Test 0: Check number of devices the OS sees in the system ... pass
  Test 1: Check mic driver is loaded ... pass
  Test 2: Check number of devices driver sees in the system ... pass
  Test 3: Check mpssd daemon is running ... fail
    mpssd daemon not running

Status: FAIL
Failure: mpssd daemon not running
...
[root@171202-1 openflow]# service mpss start
Starting Intel(R) MPSS:                                    [FAILED]
[root@171202-1 openflow]# mpssd &
[1] 3578
[root@171202-1 openflow]# Error aquiring lockfile /var/lock/mpss: File exists

[root@171202-1 openflow]# ps -A | grep mpss
 3566 pts/0    00:00:00 mpssd
 3578 pts/0    00:00:00 mpssd
 3579 pts/0    00:00:00 mpssd <defunct>

3. miccheck shows fail on test 4 - "{C}{C}{C}{C}Check device is in online state and its postcode FF"

[root@171202-1 openflow]# miccheck
MicCheck 3.4-r1
Copyright 2013 Intel Corporation All Rights Reserved

Executing default tests for host
  Test 0: Check number of devices the OS sees in the system ... pass
  Test 1: Check mic driver is loaded ... pass
  Test 2: Check number of devices driver sees in the system ... pass
  Test 3: Check mpssd daemon is running ... pass
Executing default tests for device: 0
  Test 4 (mic0): Check device is in online state and its postcode is FF ... fail
    device is not online: reset failed

Status: FAIL
Failure: A device test failed

4. we create dump of lscpci -vvv command.(complete file lspci_dump.txt is attached).

03:00.0 Co-processor: Intel Corporation Xeon Phi coprocessor 5100 series (rev 11)
	Subsystem: Intel Corporation Device 2500
	Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+
	Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
	Latency: 0, Cache Line Size: 64 bytes
	Interrupt: pin A routed to IRQ 16
	Region 0: Memory at e00000000 (64-bit, prefetchable) [size=8G]
	Region 4: Memory at f0400000 (64-bit, non-prefetchable) [size=128K]
	Capabilities: [44] Power Management version 3
		Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0+,D1-,D2-,D3hot-,D3cold-)
		Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
	Capabilities: [4c] Express (v2) Endpoint, MSI 00
		DevCap:	MaxPayload 256 bytes, PhantFunc 0, Latency L0s <4us, L1 <64us
			ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset-
		DevCtl:	Report errors: Correctable- Non-Fatal- Fatal- Unsupported-
			RlxdOrd- ExtTag- PhantFunc- AuxPwr- NoSnoop+
			MaxPayload 128 bytes, MaxReadReq 512 bytes
		DevSta:	CorrErr+ UncorrErr- FatalErr- UnsuppReq+ AuxPwr- TransPend-
		LnkCap:	Port #0, Speed 5GT/s, Width x16, ASPM L0s L1, Latency L0 <4us, L1 unlimited
			ClockPM- Surprise- LLActRep- BwNot-
		LnkCtl:	ASPM Disabled; RCB 64 bytes Disabled- Retrain- CommClk+
			ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
		LnkSta:	Speed 5GT/s, Width x16, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
		DevCap2: Completion Timeout: Range AB, TimeoutDis+, LTR-, OBFF Not Supported
		DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-, LTR-, OBFF Disabled
		LnkCtl2: Target Link Speed: 5GT/s, EnterCompliance- SpeedDis-
			 Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
			 Compliance De-emphasis: -6dB
		LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete-, EqualizationPhase1-
			 EqualizationPhase2-, EqualizationPhase3-, LinkEqualizationRequest-
	Capabilities: [88] MSI: Enable- Count=1/16 Maskable- 64bit+
		Address: 0000000000000000  Data: 0000
	Capabilities: [98] MSI-X: Enable+ Count=16 Masked-
		Vector table: BAR=4 offset=00017000
		PBA: BAR=4 offset=00018000
	Capabilities: [100 v1] Advanced Error Reporting
		UESta:	DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
		UEMsk:	DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
		UESvrt:	DLP+ SDES- TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
		CESta:	RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr+
		CEMsk:	RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr+
		AERCap:	First Error Pointer: 00, GenCap- CGenEn- ChkCap- ChkEn-
	Kernel driver in use: mic 

5. we try to reset Xeon Phi by micctrl

[root@171202-1 openflow]# micctrl -s
mic0: reset failed
[root@171202-1 openflow]# micctrl -rw
          mic0: resetting
  [Error] Timeout booting MIC, check your installation

6. during resetting in the linux log file "messages" (complete file is attached) we can see something like this

....
Oct 25 13:42:16 171202-1 kernel: mic0: Resetting (Post Code 3C)
Oct 25 13:42:17 171202-1 kernel: mic0: Resetting (Post Code 3d)
Oct 25 13:42:18 171202-1 kernel: mic0: Resetting (Post Code 3d)
Oct 25 13:42:19 171202-1 kernel: mic0: Resetting (Post Code 3d)
Oct 25 13:42:20 171202-1 kernel: mic0: Resetting (Post Code 3d)
Oct 25 13:42:21 171202-1 kernel: mic0: Resetting (Post Code 3E)
Oct 25 13:42:22 171202-1 kernel: mic0: Resetting (Post Code 3E)
Oct 25 13:42:23 171202-1 kernel: mic0: Resetting (Post Code 3E)
Oct 25 13:42:24 171202-1 kernel: mic0: Resetting (Post Code F2)
Oct 25 13:42:24 171202-1 kernel: Reattempting reset after F2/F4 failure
Oct 25 13:42:24 171202-1 kernel: mic0: Transition from state resetting to resetting
Oct 25 13:42:26 171202-1 kernel: mic0: Resetting (Post Code 3C)
Oct 25 13:42:27 171202-1 kernel: mic0: Resetting (Post Code 3d)
Oct 25 13:42:28 171202-1 kernel: mic0: Resetting (Post Code 3d)
Oct 25 13:42:29 171202-1 kernel: mic0: Resetting (Post Code 3d)
Oct 25 13:42:30 171202-1 kernel: mic0: Resetting (Post Code 3d)
Oct 25 13:42:31 171202-1 kernel: mic0: Resetting (Post Code 3E)
Oct 25 13:42:32 171202-1 kernel: mic0: Resetting (Post Code 3E)
Oct 25 13:42:33 171202-1 kernel: mic0: Resetting (Post Code 3E)
Oct 25 13:42:34 171202-1 kernel: mic0: Resetting (Post Code F2)
Oct 25 13:42:34 171202-1 kernel: Reattempting reset after F2/F4 failure
Oct 25 13:42:34 171202-1 kernel: mic0: Transition from state resetting to resetting
Oct 25 13:42:36 171202-1 kernel: mic0: Resetting (Post Code 3C)
Oct 25 13:42:37 171202-1 kernel: mic0: Resetting (Post Code 3d)
Oct 25 13:42:38 171202-1 kernel: mic0: Resetting (Post Code 3d)
Oct 25 13:42:38 171202-1 kernel: mic0: Transition from state resetting to reset failed
Oct 25 13:42:38 171202-1 kernel: MIC 0 RESETFAIL postcode 3d 25651 

7. using minicom to connect to /dev/tty/MIC0, but we get only "Initialization modem"

8. micinfo results

MicInfo Utility Log
Created Sat Oct 25 13:53:07 2014


	System Info
		HOST OS			: Linux
		OS Version		: 2.6.32-431.el6.x86_64
		Driver Version		: 3.4-1
		MPSS Version		: 3.4
		Host Physical Memory	: 32555 MB

Device No: 0, Device Name: mic0

	Version
		Flash Version 		 : NotAvailable
		SMC Firmware Version	 : NotAvailable
		SMC Boot Loader Version	 : NotAvailable
		uOS Version 		 : NotAvailable
		Device Serial Number 	 : NotAvailable

	Board
		Vendor ID 		 : 0x8086
		Device ID 		 : 0x2250
		Subsystem ID 		 : 0x2500
		Coprocessor Stepping ID	 : 3
		PCIe Width 		 : x16
		PCIe Speed 		 : 5 GT/s
		PCIe Max payload size	 : 128 bytes
		PCIe Max read req size	 : 512 bytes
		Coprocessor Model	 : 0x01
		Coprocessor Model Ext	 : 0x00
		Coprocessor Type	 : 0x00
		Coprocessor Family	 : 0x0b
		Coprocessor Family Ext	 : 0x00
		Coprocessor Stepping 	 : B1
		Board SKU 		 : B1PRQ-5110P/5120D
		ECC Mode 		 : NotAvailable
		SMC HW Revision 	 : NotAvailable

	Cores
		Total No of Active Cores : NotAvailable
		Voltage 		 : NotAvailable
		Frequency 		 : NotAvailable

	Thermal
		Fan Speed Control 	 : NotAvailable
		Fan RPM 		 : NotAvailable
		Fan PWM 		 : NotAvailable
		Die Temp		 : NotAvailable

	GDDR
		GDDR Vendor		 : NotAvailable
		GDDR Version		 : NotAvailable
		GDDR Density		 : NotAvailable
		GDDR Size		 : NotAvailable
		GDDR Technology		 : NotAvailable
		GDDR Speed		 : NotAvailable
		GDDR Frequency		 : NotAvailable
		GDDR Voltage		 : NotAvailable 

We tried Xeon Phi with Red Hat* Enterprise Linux* 64-bit 7.0 (kernel 3.10.0-123) and got the same result. Also we tried it with Microsoft Windows Server 2012 R2 (64 bit) - in this case mpss doesn't install and roll back, installation log shows that it can't reset Xeon Phi too.

AttachmentSize
Downloadlspci_dump.txt61.85 KB
Downloadmessages.txt470.38 KB

Viewing all articles
Browse latest Browse all 1789

Trending Articles



<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>