Sierra partial storage outage

FutureGrid Hardware Outage Information

Sierra partial storage outage

Status
Resolved
Type
Hardware
Impacted systems
sierra, storage at SDSC
Start of outage
Thu, 27 Sep 2012, 15:01 PDT
Anticipated end of outage
Tue, 02 Oct 2012, 16:00 PDT

Description

Sierra's storage is once again offline -- we will reboot and do some more testing before we bring it back online.

Update 9/28 11am PST: Sierra has been online again since around 11pm last night and we have working on backups of the existing data. The issue is we have been unable to get ZFS to re-integrate two replacement disks cleanly so we are getting fresh backups in case we need to rebuild the storage. The backups will take a good part of today and tomorrow to finish and storage has occasionally freezes up and needs to be rebooted. We apologize for the inconvenience.

Update 9/30 7am PST: Backups have been complete. We will be taking the system offline for a few hours to see if we can resolve the disk re-integration problem (above).

Update 9/30 9am PST: Unable to re-integrate disks. Will need to rebuild the filesystem from backups. Will start after 12pm.

Update 9/30 2pm PST: Rebuild is starting now. It will likely take a few days for all filesystems to be restored.

Update 9/30 10pm PST: Storage has been rebuilt. The compute nodes will be rebooted in the morning to resolve the stale NFS errors.

Resolution

All nodes were rebooted on 10/1 and the Nimbus install has also been updated today. From some initial tests, everything seems to be working. Please file a ticket with help@futuregrid.org if anything seems amiss.