Kinja OPS

Gawker ops team

good read: "An informal survey of real-world communications failures"

http://queue.acm.org/detail.cfm?id=…

my favorite:

This 90-second network partition caused file servers using Pacemaker and DRBD (Distributed Replicated Block Device) for HA (high availability) failover to declare each other dead, and to issue STONITH (shoot the other node in the head) messages to one another. The network partition delayed delivery of those messages, causing some file-server pairs to believe they were both active. When the network recovered, both nodes shot each other at the same time. With both nodes dead, files belonging to the pair were unavailable.

Advertisement

Share This Story