Monitoring: dashboards and metrics queries.

Quickly identifying problems is, well, difficult. We have all been there. Metrics give you a general overview of various components, but metrics can also provide deeper insights into what is exactly going wrong. Having the right graphs available is crucial during an incident.

This blog is part of a Monitoring blog series, where we introduce you to the power of Grafana, metrics, monitoring and so on. Check out the introduction blog post here.

Together with some customers we organize a Cloudbear Day once a couple of months – in which we explain (new) developers various tips ‘n tricks while using Grafana with PromQL and Mimir. This has shown to be really useful, as developers will start to use the data available within these platforms to rapidly improve the applications. Lets do a short write-up of what we normally present during such a day, in our blog.

Lets dive into metrics, and answer the following questions:

  • What strategies are there for optimal data visualization?
  • How to effectively query data?

Follow us on LinkedIn to keep up-to-date about the series.

Overview and specifics.

As mentioned in the introduction, having an overview is important. This gives you a glance of what your application is doing, so you are able to spot problems quickly. However, when you have spotted the troubling service or component on the overview, you also need more fine-grade dashboards for inspecting the problematic service. With this in mind, we normally deploy:

  • A general overview dashboard.
  • One or more dashboards per service or component.

Having a clear separation of concerns for your dashboarding ensures that you keep them organized and uncluttered – as dashboards can quickly grow into a mess. Believe me.

Overview

By default we provide an overview dashboard that gives customers insights into their application’s performance and the usage of some services. Think:

  • HTTP response times in 50%, 90% and 95% percentile;
  • HTTP requests a second per HTTP status code (2xx, 3xx, 4xx, 5xx);
  • CPU and memory consumption;
  • Number of inquiries to Redis, MySQL and services alike;
  • Response times for the same services;
  • Specific graphs that are unique for each service and important for its operation. For example, with MySQL, we show the number of temporary objects.

Besides the additional benefits of being able to spot problems early or quickly by using such an overview, it also works wonders in different ways. For example, it might be a challenge to keep requests below 1 second or the CPU/memory consumption low. These improvements can cut costs due to lower resource usage and make the experience of visitors better with a more responsive application.

Most of these metrics speak for themselves. One however might need some explanation, which is the percentile metric. We’ll dive into later in this blog.

Specifics

The more specific dashboards are used for debugging problems within, for example, Redis or MySQL. Each of these services should have one or more dashboards, each for specific components within the services.

Items displayed on these dashboards vary widely and we would normally recommend to search for an official dashboard, or a community-created dashboard, and alter it to your own needs from there on. This will ensure that you at least have the necessary graphs from the start. We did exactly the same for our MySQL dashboards, where we have made various small quality of life improvements over the years.

Querying.

Maybe I should have put this paragraph before showing you dashboards with fancy graphs? Ah, well, now you are here – lets talk about actually querying data and making fancy graphs!

I always have one page that I visit multiple times a day in Grafana: the “Explore”-page! Just like you, exploring querying using PromQL using this blog, is the Explore page your place to be when writing queries.

Lets query some data. We deploy an Object Storage proxy on one of our Kubernetes clusters and for development purpose I would like to know how much memory its using. This metric is exposed by Kubelet and is named container_memory_working_set_bytes.

We deploy various other microservices, so we would first need to limit the query to specific criteria. If you fill in the metric name into the query field, and type {} and between the braces press ctrl + space you will get a beautiful dropdown of all possible labels. In this case I’d like to first limit the metrics to a specific Kubernetes namespace, so choose the namespace label. I’ll do the same for the pod name, to ensure that I do not have other pods within that namespace. For the pod name I use a regex, so we can make sure that all pods start with the name object-storage-. We end up with the following query:

container_memory_working_set_bytes{namespace="object-storage", pod=~"object-storage-.*"}

Tada! ✨ Your first graph. But wait.. you will most likely still see way more lines on the graph then you have pods running. That is because there is most likely duplicate data in here as our filters are not fully complete. To more easily debug this, I recommend to set “Format” to Table and “Type” to Instant. Then you’ll see a perfect overview of what labels you have, and how you can improve your filtering.

Lets add the last filter to only show the object-storage container, and we end up with a beautiful graph. Awesome!

That is it! Now you can add this graph to your dashboard using the Add to dashboard button in the right upper corner and customize it to your needs. The legend not be so useful here? Lets disable it! Or, you’d like to show the number of memory in MiB? Set the unit to bytes!