This monday the core network device (thing in the first pic) froze up in one of our datacenters, instantly taking out 50% of Kinja's servers by rendering them unreachable.

You can see the point in the graph below where the blue line disappears, that was all those servers vanishing at once (the four humps are a code deploy that coincidentally went out right before the problem).

You can also see the red line, our other datacenter, immediately picks up some of the slack. That's our CDN Fastly automatically sending more traffic to the better origin/datacenter.

However not without some user impact, for the next 9 minutes just shy of 9% of requests to our front-end resulted in errors:


A few minutes later one of our Lead Developers noticed something was amiss and mentioned it in Slack:


After some investigation & rapid escalation I ran the command to force 100% of all traffic to our other working datacenter.

That in turn pushed the remaining 50% of servers as hard as they could go, leading to the occasional slow/timeout response. So just a few minutes later we increased the default cache times to buy us some safety margin:


From that point on, visitors were unaffected, editors had to wait a few more minutes for post updates to take affect, and developers could not deploy code. You can't even see the impact in NewRelic RUM (end user browser performance):

We immediately reached out to our networking support vendor Datagram, who dispatched a tech that was onsite within an hour. He troubleshot & restored the device, and everything came back up nicely.


Our database slaves then started doing quadruple duty to catch up to the master (hooray for spare capacity and Fusion-IO):

So they quickly caught up:


After that we shifted traffic back onto the datacenter, and were back to 50/50 active/active with cache times back to normal by 7pm.

Going forward there are a few things we can do to tighten this up, but overall we're pretty happy, it went a lot better than our last big one.