Espresso Triage

This week I’ve got a series of inGraphs and a little “color” provided by _Nick Brown. These inGraphs are especially timely in light of _Greg Banks’ recent SRE [in]con talk on triaging Espresso problems - in fact, the very first question I asked was “did you find this because you went to the espresso triage talk at incon”. So…let’s dive right in:

Raised to learning-sre from invisualize [link]

Found via espresso-client-triage dynamic dashboard (useful!) [link]

But no significant change in call pattern (less interesting graphs) [link]

With help of espresso sre, we found a Kafka thing [link]

Which turned out to be a shard balancing issue with the Espresso cluster, since they added a new storage node [link]

Which aggregates down to this: [link]

Very cool stuff - in particular, the kafka inGraph is quite striking - and a good practical example of how to use the espresso-client-triage dynamic dashboard to get a better picture of what’s going on (before escalating to espresso-sre). If you’d like to give it a shot for your service, copy-paste the following link:

http://ingraphs.prod.linkedin.com/container/SERVICENAME/?dynamic=espresso-client-triage

…and replace SERVICENAME with…well…with the name of your service.