Neil's News

Outage Postmortem

25 July 2014

This website was inaccessible for 54 days. It was by far the longest outage in this website's 19 year history. The root problem was that the outage occurred less than a week into an eight week vacation in Canada. A confounding problem was identifying the nature of the failure.

Background

My website was hosted on a five year old MSI Wind server sitting in our living room. It is connected to the Internet via a Comcast cable connection. This connection has a dynamic IP address that changes about once every 18 months. While not production-strength, this setup has been running quite reliably.

Timeline

26/May/2014
At 19:58:14 PDT a computer in San Francisco California loaded a sagittal plane animation of a brain from my website. This was the last logged activity on my server. Quynh and I were at Niagara Falls and were unaware of the crash.
28/May/2014
Assuming that the IP address has simply changed, I contact a neighbour and ask that he connect to my wifi and report back the public IP address. He informs me that the address is unchanged.
9/June/2014
Quynh returns from Canada to California as scheduled and attempts to reboot the server. Reboots fail, reporting "Unable to mount root fs". This appears to be a drive failure. Rebuilding a Linux server from bare metal is beyond Quynh's ken, so it must wait until my return.
13/July/2014
I finally return to California and confirm the apparent drive crash. Swapping in a new drive fails because the server requires SATA whereas my spare drives are both IDE. I order a pair of SATA drives (2 day shipping).
15/July/2014
The new SATA drives arrive. However, I'm unable to install Linux. The server appears flakey. The memory test fails at random locations. I order 1GB of replacement memory (2 day shipping).
17/July/2014
The new memory arrives. However, the flakey behaviour remains unchanged. Therefore the issue is the motherboard or the non-removable Atom processor. The MS-7418 motherboard in my server is non-standard and cannot be replaced. Thus the server is a write-off.
18/July/2014
An old Thinkpad T43p laptop is located in the great scrap heap at Google (MTV-42). Its hard drive is missing, but I temporarily get a USB external drive hooked up, install Ubuntu, and start restoring the website from backups.
19/July/2014
At 20:54:49 PDT a computer in Podolsk Russia loaded a compressed JavaScript diff library from my website. This was the first logged activity on my server (my own check that the website was up didn't occur until 50 seconds later).

In the days that followed work continued to recover missing data from the original drive, update the scripts to run on the new operating system, and migrate to a more durable drive.

Impact

The outage lasted 54 days (plus a couple more days before the news articles were restored). At the rates logged in the days before the crash, this equates to a loss of 250,000 page views and 55 gigabytes of data. Much of this traffic is recreational in nature, but a significant amount is more important (such as DMP issue 102).

Ultimately there was zero data loss.

Lessons

An intermittent hardware failure in the motherboard or processor caused the crash. This is a rare and unexpected failure mode that took a week of debugging to diagnose. Only an entirely redundant server could have mitigated this event. I'll search for another surplus laptop or small server so that in future a system swap can bring the site back online faster.

Difficulties were encountered restoring the news database. The CGI scripts had been written in Perl back in 1999. The most recent version of Perl is not backward compatible and can no longer run these scripts. Rather than update them, I chose to port over a thousand lines of Perl to Python. Migrating towards a single language reduces the dependency footprint of the site and makes resurrections easier.

[Server, router, modem, hard drive]

< Previous | Next >

 
-------------------------------------
Legal yada yada: My views do not necessarily represent those of my employer or my goldfish.