Monitoring: 95th percentile, SLA/SLO/SLI & error budget

When working with metrics, understanding what percentile metrics represent is crucial for interpreting performance data effectively. Let’s dive deeper into the 95th percentile and its relationship to SLA, SLO, and SLI—terms often used interchangeably but distinctly different in practice.

What is the 95th Percentile?

The 95th percentile is a common metric for analyzing response times. It essentially means:

Sort all requests by their response times.
Remove the slowest 5% (outliers).
The remaining top value represents the 95th percentile—the point below which 95% of all response times fall.

This metric provides a more actionable insight into system behaviour than averages, which can be skewed by a few outliers. For instance, if most of your application’s requests are under 100 ms, but a few take seconds, focusing on the 95th percentile helps you prioritise optimisations for the majority of users.

However, exposing this metric directly in dashboards is not straightforward. Percentiles require access to raw data, not aggregated values, to ensure accuracy. Many monitoring systems approximate them, but these approximations can distort decision-making, leading to over- or underestimation of performance. In our implementation with Prometheus, the calculation relies on buckets, which divide response times into predefined ranges. To derive meaningful insights, these buckets must be tuned carefully to fit your application’s performance characteristics and timeouts. Misconfigured buckets can skew percentile calculations, making them less effective for monitoring.

*A graph showing the use of those buckets.*

Fortunately, we run Mimir, a scalable and efficient backend for Prometheus metrics. Mimir handles these complexities seamlessly, allowing us to query data at scale using functions like histogram_quantile() to extract meaningful percentile metrics from distributed systems.

SLA, SLO, SLI and Error Budget: What’s the Difference?

Service Level Indicator (SLI)
A SLI is a specific measurement, like latency or error rate, that quantifies service performance. Think of it as the “atomic metric” used to evaluate the service. For example:

Availability: “requests that return a HTTP 2xx status code.”
Latency: “requests that are served below 200ms.”

Service Level Objective (SLO)
The SLO is the objective you set for your SLIs, forming a commitment with your team. If the SLI is “requests that are served below 200ms,” the SLO could be “99% of the requests should be served below 200ms”.

Service Level Agreement (SLA)
An SLA is an external, contractual commitment to customers. It typically outlines the consequences of failing the agreed SLOs. For instance: “If latency falls below 99.9% in a billing cycle, credits will be issued.”

Error Budget
An error budget is a critical measure when working with Service Level Objectives (SLOs). If, for example, your latency SLO is 99%, the error budget is the remaining 1%. This allocation is intentional—it encourages teams to take calculated risks rather than aiming for absolute perfection. A fully utilized error budget signals that you’re pushing your system’s boundaries and innovating effectively.

Designing for an error budget means building fault tolerance into your application. For instance, applications should gracefully handle duplicate requests. This allows clients to retry operations in case of transient failures, such as internal server errors or connectivity issues, without adverse effects. By anticipating and preparing for these scenarios, you can create more resilient systems capable of maintaining high availability, even under unexpected conditions.

Why Do These Metrics Matter?

SLAs, SLOs, SLIs, and error budgets play a crucial role in guiding operational priorities and decision-making. SLIs provide precise measurements of service performance, while SLOs define the expected levels for those metrics. SLAs translate these objectives into customer-facing commitments, often with associated penalties for non-compliance. Error budgets, on the other hand, enable teams to balance reliability with innovation by allowing for acceptable levels of risk and failure.

Together, these metrics ensure that teams focus on delivering consistent performance while maintaining flexibility for growth and experimentation. Monitoring metrics like the 95th percentile can highlight small but impactful performance degradations. By aligning operational strategies with SLAs, SLOs, and error budgets, organizations can uphold technical excellence and meet both user expectations and business objectives.