Sierra partial storage outage
FutureGrid Hardware Outage Information
Sierra partial storage outage
- Status
- Resolved
- Type
- Hardware
- Impacted systems
- sierra, storage at SDSC
- Start of outage
- Thu, 27 Sep 2012, 15:01 PDT
- Anticipated end of outage
- Tue, 02 Oct 2012, 16:00 PDT
Description
Sierra's storage is once again offline -- we will reboot and do some more testing before we bring it back online.
Update 9/28 11am PST: Sierra has been online again since around 11pm last night and we have working on backups of the existing data. The issue is we have been unable to get ZFS to re-integrate two replacement disks cleanly so we are getting fresh backups in case we need to rebuild the storage. The backups will take a good part of today and tomorrow to finish and storage has occasionally freezes up and needs to be rebooted. We apologize for the inconvenience.
Update 9/30 7am PST: Backups have been complete. We will be taking the system offline for a few hours to see if we can resolve the disk re-integration problem (above).
Update 9/30 9am PST: Unable to re-integrate disks. Will need to rebuild the filesystem from backups. Will start after 12pm.
Update 9/30 2pm PST: Rebuild is starting now. It will likely take a few days for all filesystems to be restored.
Update 9/30 10pm PST: Storage has been rebuilt. The compute nodes will be rebooted in the morning to resolve the stale NFS errors.
Resolution
All nodes were rebooted on 10/1 and the Nimbus install has also been updated today. From some initial tests, everything seems to be working. Please file a ticket with help@futuregrid.org if anything seems amiss.