Write an incident report about a power outage in our data center that took some servers offline.
Incident Report: Data Center Power Outage
Incident Details
- Incident Type: Data center power outage
- Date/Time: July 8, 2023 at 2:34am PT
- Location: Las Vegas data center, Rack 4A
Description
A power surge caused an electrical fault and fire in Rack 4A of the Las Vegas data center, which took down 3 production database servers and 6 application servers. The halon fire suppression system activated and extinguished the fire.
Impact
- 8 customer-facing applications were offline for 2 hours and 43 minutes
- Approximately 1.2TB of customer data may have been lost due to database corruption from the hard shutdown of the database servers
- No customer data was exfiltrated or compromised
Resolution
- The affected servers were isolated from the network and thoroughly inspected by the infrastructure team
- Corrupted hard drives were replaced
- Data was restored from the last full backup from 1:00am PT
- Failover to the secondary data center was initiated for disaster recovery
- Root cause was determined to be a faulty power distribution unit in Rack 4A
- The incident was mitigated and all systems were fully restored by 5:17am PT
Root Cause Analysis
The root cause of the incident was traced to a malfunctioning power distribution unit (PDU) in Rack 4A of the Las Vegas data center. The faulty PDU caused a power surge which led to an electrical fire in the rack, damaging servers and triggering the fire suppression system. Key contributing factors:
- Lack of redundant power distribution to Rack 4A
- Inadequate real-time monitoring and alerting for electrical systems
- PDU was past recommended replacement cycle
Preventative Measures
To avoid similar incidents in the future, the following preventative measures are recommended:
- Upgrade all data center PDUs and implement redundant power distribution to all critical racks
- Setup comprehensive electrical monitoring and alerting to identify anomalies in real-time
- Review and enhance data center maintenance procedures, replacing equipment proactively based on recommended lifecycle
- Conduct regular failover drills to secondary data centers to minimize downtime
- Investigate options for real-time database replication to secondary data center to reduce RPO
Incident Timeline
- 2:34am PT: Power surge and electrical fire occurs in Rack 4A
- 2:35am PT: Halon fire suppression activates and extinguishes fire, outage begins impacting applications
- 2:40am PT: Incident detected and Ops team engaged, troubleshooting begins
- 3:15am PT: Corrupted servers isolated, root cause identified as PDU failure
- 3:50am PT: Decision made to failover to secondary data center
- 4:30am PT: Data restore from backup initiated
- 5:17am PT: Incident mitigated and all systems restored
Author
John Smith
Senior SRE, Acme Inc.