XRAY is unavailable

FutureGrid Hardware Outage Information

XRAY is unavailable

Status
Resolved
Type
Hardware
Impacted systems
xray
Start of outage
Thu, 17 Oct 2013 (All day)
Anticipated end of outage
Tue, 22 Oct 2013, 08:30 EDT

Description

XRAY is currently unavailable. Possibly a hardware fault. Support personnel are investigating the issue and will post updates when more is known.

UPDATE: 22-OCT 08:30: A faulty voltage regulator has been identifed and replaced. System is available.

UPDATE 21-OCT 18:00 Xray is entirely offline for further troubleshooting and diagnostics.

UPDATE 19-OCT 22:00: Xray is available in reduced capacity. Four compute blades (16 nodes) are currently disabled.

Update 18-OCT 21:30: Three blades will not boot. Cray engineers suspect failure of the L0 controllers on the blades. Replacements will be shipped. We attempted to disable the three blades and restart the system but the high-speed interconnect routing is not initializing properly. The problem is under investigation.

Update 18-OCT 10:45: Three blades (affecting 8 compute nodes and 2 service nodes) are offline. Storage I/O node is offline. Case has been filed with Cray for support.

Resolution

Replaced shorted VRM on node c0-0c1s3.