For those of you who don’t know what redliner is, it’s a tool for estimating the maximum per-node QPS that a given service can handle using empirical evidence. The fuck does that mean? Well, here’s how it works:
-
Redliner selects one node providing the service you want to “redline”
-
Redliner gradually increases the percentage of overall traffic that goes to that particular node
-
After each increase, redliner evaluates the relative health of the service on that node
-
When the health of the service being provided by that node has crossed some threshold, redliner resets to “normal” and emits a result
This result takes the form of a “redline number” - the maximum amount of per-node QPS that redliner thinks a service can handle before it starts to degrade.
Let’s take a look at what this looks like with an inGraph (courtesy of Jim Ockers):
The step-function increases in QPS - and the subsequent return to “normalcy” - show up pretty clearly. Pretty rad, right?
An open question: Something I’m not sure I grok at present is the dips between increases. Maybe redliner has some kind of a temporary backoff between weight increases? Or maybe it’s some implementation detail that introduces time lag between increases? Hell, maybe it’s a bug? I’m not sure…but if anyone knows I’d love to understand it.