|
Outage Postmortem25 July 2014 This website was inaccessible for 54 days. It was by far the longest outage in this website's 19 year history. The root problem was that the outage occurred less than a week into an eight week vacation in Canada. A confounding problem was identifying the nature of the failure. BackgroundMy website was hosted on a five year old MSI Wind server sitting in our living room. It is connected to the Internet via a Comcast cable connection. This connection has a dynamic IP address that changes about once every 18 months. While not production-strength, this setup has been running quite reliably. Timeline
In the days that followed work continued to recover missing data from the original drive, update the scripts to run on the new operating system, and migrate to a more durable drive. ImpactThe outage lasted 54 days (plus a couple more days before the news articles were restored). At the rates logged in the days before the crash, this equates to a loss of 250,000 page views and 55 gigabytes of data. Much of this traffic is recreational in nature, but a significant amount is more important (such as DMP issue 102). Ultimately there was zero data loss. LessonsAn intermittent hardware failure in the motherboard or processor caused the crash. This is a rare and unexpected failure mode that took a week of debugging to diagnose. Only an entirely redundant server could have mitigated this event. I'll search for another surplus laptop or small server so that in future a system swap can bring the site back online faster. Difficulties were encountered restoring the news database. The CGI scripts had been written in Perl back in 1999. The most recent version of Perl is not backward compatible and can no longer run these scripts. Rather than update them, I chose to port over a thousand lines of Perl to Python. Migrating towards a single language reduces the dependency footprint of the site and makes resurrections easier.
|