How to alert on SLOs

Yuri Grinshteyn
Google Cloud - Community
8 min readSep 28, 2020

--

I’ve spent quite a bit of time looking at defining and configuring SLOs in Service Monitoring. And lately, I’ve been getting lots of questions about what happens next — once the SLO is configured, folks want to know how to use alerting to be notified about potential, imminent, and in-progress SLO violations. Service Monitoring provides SLO error budget burn alerts to accomplish just that, but using these alerts is not always intuitive. I set out to try these out for myself and document what I found along the way. Let’s see what happens!

In theory…

The SRE Workbook has a whole chapter devoted to alerting on SLOs. I don’t think I need to reproduce it here, but there two classes of considerations that are very important:

  • What is the relationship between your alert and the number of SLO events your service experiences?

— Precision is your true-positive rate — what fraction of your alerts will actually indicate an event you need to know about?

— Recall is your “sensitivity” — what fraction of your events will actually result in an alert?

  • What is the relationship between an alert and the event that triggered it?

— Detection time is the time between when the event started and the time of the alert firing

— Reset time is the time between when the event ended and the time of the alert resetting

As you can imagine, your SLO itself AND the alert you configure can affect all of these. It’s impossible to optimize all of these (driving precision and recall to 100% AND detection time and reset time to zero) AND not drown in alerts, most of which will not be actionable. At the same time, it is not practical to wait until you’re out of error budget to alert — advance notice is necessary to be able to take appropriate action to ensure that the service stays within its daily/weekly/monthly error budget. Alerting on error budget burn, rather than on whether your service is within SLO for any specific short time period, is the way to get there.

The next obvious question — just how much of the error budget should be burned before triggering an alert? The most advanced approach recommended in the SRE workbook chapter is to use multiwindow, multi-burn-rate alerts (see section 6). But to get started, I very much like the idea of two separate alerts with different burn rates, following the recommendation of “2% burn in 1 hour and 5% burn in 6 hours”.

Finally — and this is the hard part — how do you identify the exact burn rate and alerting window to get to the level of precision and reset time you need while minimizing operational overhead (responding to unneeded alerts) for your service? From the book:

“For burn rate-based alerts, the time taken for an alert to fire is:

time to fire = (1-SLO) / error ratio * alerting window size * burn rate

The error budget consumed by the time the alert fires is:

budget consumed = burn rate * alerting window size / period”

Let’s look at an example to clarify this. Let’s say you have an availability SLO of 99.9% over a rolling 28-day window. You’d like to alert if you’ve consumed 1% of your error budget in the previous hour. That means you want the alert to trigger if the error rate over the last hour is greater than target budget * (1-SLO) * (period/window).

However, Service Monitoring doesn’t allow you to specify an error rate — instead, you have to provide a burn rate threshold. The burn rate is the rate at which your service consumes error budget. If the burn rate is 1.0, then at the end of the SLO evaluation period, you will have consumed exactly 100% of your error budget. If the burn rate is 2.0, then at the end of the period you will have consumed 200% of your error budget (or you will have consumed all of your error budget half-way through your evaluation period) — and so on. Using the equation above to calculate the burn rate, we get

burn rate = budget consumed * period / alerting window

In our example:

  • SLO = 99.9%, or .999
  • Alerting window size = 1 hour
  • Budget consumed = 1% or .01
  • Period = 28 days or 672 hours

This means that

burn rate = .01 * 672 / 1= 6.72

If we configure our alert to have a 60 minute lookback period and a burn rate threshold of 6.72, we can then calculate how quickly we would get an alert based on the error ratio our service experiences. For example, if our error rate goes up to 1%, time to fire = (1-.999) / .01 * 1 * 672 = 67.2 hours, which is almost 3 days — this is an alert with low recall!

If we want to improve our recall and have the alert fire right at 1 hour, we calculate the burn rate using 1 = (1-.999) / .01 * 1 * burn rate, which means that burn rate = 10. To confirm — time to fire = (1-.999) / .01 * 1 * 6.72 = 1! Our alert will fire in exactly one hour.

In practice…

I wanted to look at using Service Monitoring to implement a simpler approach (as described in section 5) — two separate alerts with different burn rates, following the recommendation of “2% burn in 1 hour and 5% burn in 6 hours”.

The setup

Service and SLO

To start, I needed a simple application whose error rate I could control precisely without redeploying code. You can see the code for the service here. It’s a pretty basic Node.js Express application that writes log entries for every request and failure. I then configured log-based metrics to count both. I configured a service using the UI:

Next, I needed to define the SLO. Because my service is using two different metrics for the “good” and “bad” filters, I could not figure out how to create such an SLO in the UI. As such, I needed to use the API — here’s the API call:

This created an SLO for “95% availability over a rolling 28 day period”. Here it is in the UI:

Fault injection

Next, I needed to inject a desired error rate into the service to be able to validate that increasing the error rate would result in error budget burn and trigger alerts. After some internal discussion, I identified a couple of options:

  • Using Istio’s fault injection capabilities
  • Having my app read an environment variable that could be managed by a ConfigMap

I decided to use the latter. You can see my code here; the gist of it is that I used Node’s process.ENV.<variable name> to have my code read the variable when servicing the request. The value of the variable itself is set in the ConfigMap — I followed these instructions to create it from a basic .properties file. One additional complication I discovered is that the pod would only read the value of the environment variable at startup — so I needed to delete it and have the deployment recreate it when I wanted to change the value.

Alerts

Finally, I was ready to set up my alerts. As discussed earlier, I wanted to test two alerts:

  • 2% error budget burn in 1 hour
  • 5% error budget burn in 6 hours

As mentioned in the “theory” section, Service Monitoring expects two inputs for error budget burn alerts — a lookback duration and a burn rate threshold:

Because I knew I wanted to monitor error budget burn over an hour, the lookback duration was an easy decision — I set it to 60 minutes. However, I needed to calculate my burn rate threshold. I already showed how I calculated it for the “1% in 1 hour” burn example. This time, my inputs were as follows:

  • Alerting window size = 1 hour
  • Budget consumed = 2% or .02
  • Period = 28 days or 672 hours

To calculate burn rate, I used

burn rate = budget consumed * period / alerting window = .02 * 672 / 1 = 13.44

This is the burn rate threshold our alert policy expects:

Next, I needed to calculate the burn rate threshold for the “5% error budget burned in 6 hours” alert. This time:

  • Alerting window size = 6 hours
  • Budget consumed = 5% or .05
  • Period = 28 days or 672 hours

Which means that

burn rate = budget consumed * period / alerting window = .05 * 672 / 6 = 6.72 :

Now, both of my alerts were ready to go. I was ready to introduce errors into the service and test the alerts.

Testing alerts

The first question I needed to answer was — what should the error rate be to trigger the alert in a reasonable time frame? As discussed above:

time to fire = (1-SLO) / error ratio * alerting window size * burn rate

This makes

error ratio = (1 — SLO) * alerting window size * burn rate / time to fire

For the first alert, my values were as follows:

  • Desired time to fire = 1 hour
  • SLO = 95% or .95
  • Alerting window size = 1 hour
  • Burn rate = 13.44

This meant that my error ratio would need to be at least 0.672, or 67.2% — that’s pretty high! I recreated the configmap with that value and deleted the pod, which was automatically recreated by the deployment. Almost immediately, I saw a decrease in my SLI:

Because I set my SLO higher than I intended, the service was already out of error budget, but the immediate drop was still exactly what I expected. My service also hadn’t been running for a full month, so I thought that the alert would trigger much faster than my calculation, because I hadn’t actually accumulated enough error budget at this point, and burning 2% would take no time at all.

Sure enough, I soon saw the error budget burn rate cross the threshold:

And the alert generated:

Wrapping up…

I’m really glad I took the time to figure this out AND document my journey along the way, and I hope you find this helpful. In particular, I found that there is not a lot of information available on exactly what an error budget burn rate is in product documentation, and it took some digging to really understand the math that goes into calculating it. Nevertheless, I very much like this approach to setting up alerts on SLOs. Thanks for reading, and please let me know what you think!

--

--

Yuri Grinshteyn
Google Cloud - Community

CRE at Google Cloud. I write about observability in Google Cloud, especially as it relates to SRE practices.