Failure Detection in the Era of Gray Failures

Jon Currey

Director of Research at HashiCorp

Consul, Nomad and Serf use the SWIM gossip protocol for failure detection. However, in some circumstances, SWIM can produce large amounts of false positives - marking healthy nodes as failed. In this talk, we examine our solution to this problem, which we call Lifeguard, and consider its relationship to the problem of gray failures, where systems are neither completely healthy or failed.