Our monitoring stack: The Grafana Stack

Our monitoring stack: The Grafana Stack

Effective infrastructure maintenance hinges upon the essential practice of monitoring. A thorough comprehension of the ongoing events and developments within one’s infrastructure marks the foundational step towards proficient infrastructure management. In this blog we dive into the components we use for monitoring.

At Cloudbear, we strategically employ The Grafana Stack to glean profound insights into the intricate workings of both our internal systems and those of our valued customers. This approach empowers us to deftly navigate and administer all facets of the infrastructure with precision and proficiency. The Grafana Stack consists of 3 components: Mimir, Loki and Tempo. These components are combined into a single platform: the Managed Monitoring platform by Cloudbear.

This blog post marks the inception of our Monitoring blog series, delving into the technical intricacies of the various components. Throughout this series, we will address pressing questions such as:

  • How to effectively query data?
  • Strategies for optimal data visualization.
  • In-depth insights into the internal workings of some of the components.

Follow us on LinkedIn to keep up-to-date about the series.

The Managed Monitoring platform

Cloudbear enhances its clients’ experience by seamlessly including The Grafana Stack in their infrastructure at no extra charge as part of the Managed Monitoring service. Clients receive pre-configured dashboards designed for their specific infrastructure, providing valuable insights through visualizations. Moreover, clients have the freedom to create and customize their dashboards, giving them the power to tailor their monitoring experience to meet specific needs. This combination of inclusivity and customization reflects our dedication to providing a versatile and comprehensive monitoring solution.

Being a fully managed service hosted externally, this platform serves as a central hub for monitoring. The Grafana Agent integrates seamlessly, serving as a conduit that reveals essential data from the customer’s infrastructure to the Managed Monitoring platform. We choose the Grafana Agent for its efficiency in resource usage, allowing us to save costs for our customers while maintaining effective monitoring capabilities.

More on these workings later in this blog.

A dashboard created by one of Cloudbear’s customers, fully tailored to their needs.

Mimir for Metrics

Mimir serves as a time-series database utilized for storing metrics, which are measurable data points providing insights into system (and occasionally business) behavior. With rich capabilities, Mimir offers cost-effective long-term data storage and is more easily scalable than its alternatives. It operates on the “push” concept, where external systems push metrics towards Mimir, facilitating scalable component adjustments based on the volume of received metrics—akin to how we scale customer applications.

Both our customers and us can query metrics from Mimir using the PromQL language by Prometheus. Crafting queries enables the retrieval of critical insights into system performance, behavior, and customer applications. These insights encompass various aspects, such as:

  • What is the 95 percentile response time of the application?
  • How many queries is MySQL currently handling?
  • How much tmp tables is MySQL creating?
  • How much requests is Redis currently handling and what is the hit/miss ratio?
  • For example for a chat SaaS application:
    • How much chats is my SaaS application handling?
    • In what stage are all the chats?

These insights empower our customers and us to enhance infrastructure, applications, and navigate towards a more performant future. Additionally, queries contribute to the creation of Service Level Objectives (SLOs) as part of our Service Level Agreement (SLA). This framework, rooted in Google’s SRE fundamentals, defines SLOs as a threshold for what is considered acceptable performance. When deviations occur, an error budget is consumed, permitting controlled failure within specified limits. This approach aids in designing fault-tolerant systems and applications.

Primarily, Mimir is dedicated to ingesting, processing, storing, and querying metrics. Complementing Mimir, we deploy AlertManager to continuously assess specific queries. Positive outcomes trigger alerts, and predefined rules for all components and customer applications guide this process. In some cases, customers are notified about certain alerts, enabling collaborative investigation. These capabilities prove beneficial for identifying issues like failing crons or consumers, providing timely awareness of potential errors in customer applications.

Loki for Logs

Loki is to logs what Mimir is to metrics. It ingests and stores logs, enabling us to easily query, view and graph logs. Loki operates as a centralized place for viewing all infrastructure and application logs, behaving just like Mimir with the same “push” concept. This is achieved by parsing logs and creating labels based on values from the log lines. These labels can then be queried, just like metrics. This feature significantly enhances the ability to gain profound insights into the various components and applications of the infrastructure.

In future blogs, we will delve deeper into topics such as SLA/SLO/SLI’s, alerts, and the intricate workings of Mimir, Loki and Tempo.