We migrated to Grafana’s LGTM stack, here is the story

Moray Baruh
Valensas
Published in
5 min readJan 17, 2024

--

Introduction

LGTM stands for Loki, Grafana, Tempo and Mimir. It’s Grafana’s tool stack that enables logs, metrics, and traces to be collected and visualized within a single stack of tools that works in harmony. At Valensas, we are always on the lookout for new and promising technology and LGTM became one of them. We gradually switched our observability stack to LGTM, let’s see how and why.

Grafana

Grafana is the visualization tool for the whole stack. It allows you to visualize metrics, logs, traces, and many more with its plentiful dashboards and integrations.

We first used Grafana while using Openshift, back in 2020, where Grafana comes bundled with Openshift Monitoring. Even though we mostly switched to Rancher Kubernetes distributions, Grafana remained.

Loki

Loki is the log aggregator of the LGTM stack. Logs are pushed to Loki, indexed based on given labels and stored within a defined storage. This can be the filesystem, a database (Cassandra, BigTable, DynamoDB), or an object storage. The logs can then be queried from Grafana or through Loki’s API using LogQL, a query language similar to Prometheus’ PromQL.

Back when we were still using Openshift, we were using Openshift Logging, which deploys the ELK logging stack (though it seems that Loki is now supported in newer versions). Elasticsearch quickly became an issue as it uses considerable amounts of memory, which we could not afford. One of our more senior developers recommended Grafana Loki with Promtail as collector.

After trying the setup, it became clear that Loki was a much better option compared to the ELK stack. It uses fewer resources and having a single tool for logs and metrics (Grafana) was much more comfortable compared to having two different (Grafana + Kibana). The fact that Loki uses LogQL as query language, which is very similar to PromQL, was also an advantage as we already knew PromQL. After switching to the Rancher stack, we switched our collector to Rancher Logging (FluentBit + FluentD), but Loki remained.

Grafana Agent

Grafana Agent is the unsung hero of the LGTM stack. It is a multi-purpose collector for the LGTM stack. It collects and processes logs, metrics, traces, and many more and sends them to the relevant tools. It is very easy to configure it with its River configuration language, similar to Terraform. Its UI shows a nice graph of all the pipelines configured and any debug information available. We also switched our log collector to Grafana Agent to finalize the migration of our logging stack. It supports many sources for logs including journal logs, syslog, Kubernetes Pod logs, and files. This was more than enough for all our use cases.

Mimir

Mimir is the long-term metric storage of the LGTM stack. It receives Prometheus metrics through its Prometheus remote write API and stores them on object storage. Metrics can be queried through Grafana or Mimir’s API using PromQL. Mimir is Prometheus-compatible, meaning that any application using Prometheus APIs can use Mimir without any change in the codebase.

When we saw Mimir’s announcement last year, we were very excited. It promised long-term metric management with high scalability. We were very happy about Loki as well, making us open to try out other Grafana products.

At that time, we were using Prometheus Operator together with Thanos for long-term metric storage. However, we regularly experienced issues with Thanos:

  • Queries were slow, especially for larger time windows, but also for smaller ones.
  • Thanos Ruler’s performance was even worse, where it would not be able to evaluate certain rules from kube-prometheus-stack.
  • Thanos Compactor would occasionally halt and stop compaction. This would require manual intervention.

With that in mind, we tried Mimir on our first occasion. We rapidly saw that its performance and reliability were much better compared to Thanos.

The only problem was that we used Prometheus Operator CRDs to create alerting and recording rules as well as discovery targets. This works well with Thanos, but Mimir does not support it. This was very important to us as many third-party Helm charts rely on these CRDs to configure Prometheus monitoring. This is where Gafana Agent comes to save the day: it can watch PrometheusRule resources and synchronize Mimir’s configuration accordingly.

Grafana Agent can also discover Prometheus targets from CRDs, scrape them for metrics, and send them to Mimir. Grafana Agent automatically shards the targets and allows the proper distribution of load between multiple replicas in the cluster. This way, we had no more use for Prometheus as well.

Finally, we also configured Grafana Agent to get metrics from our Postgres and Dnsmasq instances and got rid of the standalone Prometheus exporters.

Tempo

Tempo is the tracing backend of the LGTM stack. It collects traces in Jaeger, Zipkin, and OpenTelemetry formats. You can then query the traces from Grafana and visualize the path of every request within a distributed system.

Up until this point, we were using Jaeger for tracing. Our main problem was that Jaeger can only use Elasticsearch or Cassandra as storage backend. The resource usage and maintenance costs for both of these databases were quite high, especially since we don’t have any other use case for them. This was a good enough reason for us to try and switch to Tempo, especially since we already had a common object storage for Loki, Mimir, and other tools.

As Tempo is fully compatible with Jaeger, we just had to change the collector endpoint to Tempo on our Istio installation and everything worked out of the box.

Notes on Deployment

Loki, Mimir and Tempo all have three different modes of deployment:

  • monolithic: all components are run from a single workload
  • simple scalable: read and write operations are separated into different workloads
  • microservices: all components are separated into distinct workloads

This allows anyone to select the best mode depending on their expected load and available resources. Monolithic mode is the best solution available to quickly try out the tools or when data volume is relatively small. While microservices mode is the most performant deployment mode that can better scale for larger volumes of data. The simple scalable mode stands right in between.

We have used the monolithic mode of deployment for Loki and Tempo and the microservice mode for Mimir as it does not have an official Helm chart with monolithic support.

Conclusion

The LGTM stack is very versatile and we are very happy to have switched. Issues we had experienced with our older stack (mostly maintainability, performance, resource usage, and reliability) just went away. Grafana Agent is also a very handy addition as it can be configured to perform a wide variety of tasks very easily and helped us to get rid of additional workloads, making our observability stack much simpler

Our new observability stack

--

--