This week I’ve got a series of inGraphs and a little “color” provided by _Nick Brown. These inGraphs are especially timely in light of _Greg Banks’ recent SRE [in]con talk on triaging Espresso problems - in fact, the very first question I asked was “did you find this because you went to the espresso triage talk at incon”. So…let’s dive right in:
Raised to learning-sre from invisualize [link]
Found via espresso-client-triage dynamic dashboard (useful!) [link]
But no significant change in call pattern (less interesting graphs) [link]
With help of espresso sre, we found a Kafka thing [link]
Which turned out to be a shard balancing issue with the Espresso cluster, since they added a new storage node [link]
Which aggregates down to this: [link]
Very cool stuff - in particular, the kafka inGraph is quite striking - and a good practical example of how to use the espresso-client-triage dynamic dashboard to get a better picture of what’s going on (before escalating to espresso-sre). If you’d like to give it a shot for your service, copy-paste the following link:
http://ingraphs.prod.linkedin.com/container/SERVICENAME/?dynamic=espresso-client-triage
…and replace SERVICENAME with…well…with the name of your service.