Most folks reading this are probably aware of the recent load testing war-room effort…and if you’re not aware of it then you’ve either been on extended DTO or haven’t been paying attention. At any rate, the idea is the following: given traffic projections going into 2018 we need to be able to support X traffic (for a particular value of X - seriously, read your email) and we aren’t quite able to do that just yet. This effort has led to a whole pile of capacity uplifts, performance tuning, config changes, etc. …and, naturally, a concomitant pile of interesting inGraphs.
Let’s take a peek at some of these inGraphs. Perhaps worth noting: I literally picked the last 8-9 inGraphs I’ve saved off and pasted them into this post.
First: What load testing looks like wrt traffic being directed into a particular fabric (in this case, prod-ltx1):
Next up: A few examples of what it looks like when a service is doing Just Fine and then shits all over itself, succumbing to the pressure of the additional load:
There are a multitude of reasons why a service might see this kind of QPS dropoff. Maybe it’s a substantial increase in latency:
…or blowing out a thread pool:
…or blowing out a downstream connection limit (read: Espresso):
…or blowing out an upstream connection limit (read: L1):
…or maybe there are one or two Bad Actors that decide to go apeshit at the worst possible time:
I suppose the tl;dr of this It’s a complex problem space, with many moving parts.
A huge Thank You to everyone who has contributed to helping us understand how we’re going to support the anticipated load moving forward.