Outage Report & Learning Curve

Outage Report & Learning Curve I swiped their image, so I should probably link out to redbubble.com.

Recently we told you about our move to new server hosting on Amazon Web Services. We transitioned DNN and DerbyLife full to the new servers just under two weeks ago, although there’s still a lot of work to do to get it all working the way we want it.

Saturday night’s influx of site visitors looking for the Bay Area at Rose City bout led to our first outage on the new hosting. This points to one challenge of growth: while AWS has features that will make our sites more reliable and robust even under sudden, massive traffic increases, there’s still a learning curve associated with implementing those features. As with anything in life, sometimes we’re gonna learn the hard way, when something goes wrong.

We’ve identified the causes of Saturday’s outage, and we’re confident that we’ve reconfigured our setup to avoid the same issues in the future. What follows is a detailed breakdown in some fairly geeky gory (gorky?) detail, for those of you who are interested.

I’m going to start by describing the configuration we intend to put in place:

Load Balancer => Autoscaling group => 2x web servers => database server => database server hot spare, plus automatic backup snapshots for all

Once fully implemented, this represents several layers of improved reliability over our old configuration. DNN previously kept both web server and database functions on a single server; DerbyLife was split across a single box for each. Neither site had any failover capability. Backups were manual and sporadic.

By late last week, we’d accomplished about half of the overall configuration plan. We’d set up a dedicated database server to feed an identical pair of web servers, which we then placed behind a load balancer to distribute the load across both web servers evenly. Two web servers means a failure in one doesn’t take the sites down — when a server fails a (very frequent) health check, the load balancer automatically stops sending incoming connections to it.

Things came unspooled on Saturday for several reasons. First, in the process of making changes and updates, one of the web servers was removed from the load balancer and (inadvertently) not returned to it… leaving only one active web server. Which, should be ok, because that’s just like our previous situation, with just one server for DNN, right?

Well, not exactly. Because we expect to have a minimum of two web servers up, we’ve allocated less CPU and RAM to each than we had on the single server. A single smaller server on its own is fine for routine traffic, but under even a moderate load spike, it quickly reaches its limits and grinds to a halt under the barrage. That happened about a minute before first whistle, leaving no functioning web servers behind the load balancer to serve DNN and DerbyLife.

With the really excellent server management tools provided by AWS, diagnosing and fixing these problems downright trivial. If, that is, you’re aware that there’s a problem. If you already had the page up for the bout, the video stream continued to work just fine, because it came from livestream.com’s servers and didn’t have to touch DNN’s (failed) servers. We DNN folks all had the page up, video fullscreened, some time before the bout started, because that’s how we roll.

It took just over 20 minutes before one of us noticed that new pageloads weren’t working. I don’t remember exactly what tipped us off — might’ve been a glance at Facebook, might’ve been an idle look at the scores page, might’ve been Grand Poobah trying to enter scores into a Derbymatic interface that wouldn’t come up. In any case, one we noticed, we had the trouble diagnosed in about three minutes. It took perhaps five more minutes to return the second web server to the load balancer, then spin up a *third*, larger web server instance and place it behind the load balancer as well. Total downtime was just under 30 minutes.

Which, if you’re trying to watch a top bout, is just under 30 minutes too long. While our final configuration should handle stresses like this without intervention (by spinning up new web servers automatically if utilization hits a certain threshold), we’ve implemented additional steps to ensure we won’t have a repeat of this outage in the interim:

– Set up additional admin notifications if servers get hammered, now including SMS notifications instead of just email
– Increase minimum size of each web server
– Regularly check the load balancer setup to make sure all web servers are properly attached to it, particularly going into weekends
– Keep a close eye on server performance during high traffic times; add web server instances when needed

Within a few weeks, we should have a good handle on the appropriate sizing of our baseline configuration, at which point we’ll get the autoscaling set up and let AWS do all the magic for us. While we regret the down time (especially during such a hotly anticipated and exciting bout!), we’ve been pleasantly surprised by how smoothly this migration has progressed so far. Hopefully this will be our biggest hiccup!

Stay tuned for more updates on our infrastructure upgrades, and the coming site improvements they’ll enable.