Mystery: Revealed

[As promised, here’s the reveal writeup from last week’s mystery post by Max Wolffe. Thanks again, Max! ]

Last week Cliff posted a mysterious rainbow waterfall graph, of hosts for a service suddenly receiving no traffic, one after another.

blocked URL

It almost looks like a deployment, except that the traffic drop doesn’t occur for all hosts, and there’s no informed overlay event indicating that a deployment occurred.

Upon looking at the graphs for a single host, we can see that there’s a big single spike in 5xx errors across several endpoints that realtime-dispatcher exposes. Graph linked here.

blocked URL

Errors across a number of endpoints indicate that something strange happened with a downstream, let’s look at downstream graphs to see if that’s true.

realtime-dispatcher uses Espresso as its storage layer. Here are errors for requests to Espresso from a single host, Graph linked here.

blocked URL

Espresso errors lead to publish errors

blocked URL

Publish errors seem to lead to a traffic drop

blocked URL

What are the Espresso errors which spike so suddenly around that time?

com.linkedin.d2.balancer.ServiceUnavailableException: ServiceUnavailableException [_reason=Service: ESPRESSO_MSG is in a bad state (high latency/high error). Dropping request. Cluster: ESPRESSO_MSG, partitionId:0 (16 hosts), _serviceName=ESPRESSO_MSG]]

This error is thrown by D2, which has a mechanism for cutting off traffic to a service that’s experiencing high latency or error rates. It turns out that the Espresso cluster that realtime-dispatcher uses is getting into a bad state (due to GC, long queries, too much load, etc). This cause D2’s degrader mechanism to kick in and route traffic to that cluster to near zero. As the cluster degrades, realtime-dispatcher itself ends up throwing enough errors for d2 to mark it as unhealthy as well, which is why we see traffic drop to immediately after the spike in 5xx errors.

Traffic gets shed to the other nodes, which run into high error rate and shed traffic themselves, one after another, which results in the graph below. Graph linked here.

blocked URL

There are a few things that are still somewhat mysterious to me:

If ESPRESSO_MSG got into a bad state, why didn’t that impact all of the realtime-dispatcher hosts at the same time?
Why did ESPRESSO_MSG ServiceUnavailableExceptions not simply route requests to a non-degraded espresso host instead of causing realtime-dispatcher to also degrade?