Sitemap
Photo by Mitchell Griest on Unsplash

The better alternative to Graphite monitoring tool

--

Graphite is a great open source monitoring solution and time series database (TSDB) for metrics. It has been released in 2008 and served well since then. Graphite has a simple architecture and isn’t hard to operate on small-to-medium scale. It consists of the following modular parts:

  • Carbon — it receives the ingested metrics and forwards them to the storage (whisper).
  • Whisper — it stores metrics received from Carbon to disk.
  • Graphite Webapp — it processes queries over the stored metrics and renders graphs for these queries.

These parts are written in Python — very popular programming language, which is easy to deal with. So, if you miss some functionality, you can add it there by writing some Python code :)

Graphite has a very simple text-based data ingestion protocol — just push newline-delimited metrics in the form <metric_path> <value> <timestamp> over TCP to Carbon and that’s it! For example:

app42.host123.cpu_usage_percent 25.4 1744239509
app42.host123.memory_usage_mb 1523 1744239509

Many companies such as Booking.com (the case study), GitHub (the case study), Grammarly, Reddit, Salesforce, etc. started using Graphite in their observability departments and were very happy with it.

What’s wrong with Graphite?

While Graphite fits well to start with when you need monitoring of your infrastructure and applications, it may reach its’ bottlenecks and issues on large scale:

  • The default data storage backend for Graphite — Whisper — hits disk IO limits when hundreds of thousands of unique metric paths are stored to it. This is because it stores the ingested samples (data points) per every metric path into a separate file on disk. So, the number of files created by Whisper on disk is proportional to the number of unique metric paths. While modern filesystems (such as ext4) may handle millions of files without issues, modern disks (both HDDs and SSDs) have physical limits on the number of read / write operations they can perform per second. HDDs usually may perform up to a few hundreds of IOPS. This means they can store up to a few hundred data points per second if every data point must be persisted to disk with fsync for durability and consistency purposes. SSDs may perform up to a few millions of IOPS, so they are limited by the data ingestion rate of a few millions of data points per second in theory. In reality, consumer SSDs are able to perform much smaller number of fsyncs per second comparing to the number of supported IOPS, since every fsync requires rewriting a full erase block at SSD (its’ size varies from 64KiB to serveral MiB), even if only a single byte is changed and must be fsync’ed. See this article for the real numbers.
  • Whisper pre-allocates disk space per every newly ingested metric path according to the configured data retention configs. For example, if Whisper is configured to store samples with 30-second interval with the retention of 30 days, then it creates a file with the size of 8bytes/sample * 3600*24*30/30 = 675KiB per every seen metric path. This may result in excess disk space usage when old metric paths are substituted with new metric paths at a high rate (aka high churn rate issue). For example, Whisper needs 10M*675KiB=6.4TiB of disk space for 10 millions of unique metric paths with 30 days retention and 30 second interval between data points.
  • Aggregate queries over big number of metric paths may become extremely slow, since Graphite needs to query data from millions of different files on disk.

Solutions

If you are stuck with Graphite issues when you are trying to store big number of metrics into it at a high rate, then the following solutions exist:

  1. To substitute Whisper with more efficient storage backend such as go-whisper or graphite-clickhouse. The downside is that these backends are harder to configure and manage than the Whisper.
  2. To migrate to other observability solutions, which are better optimized for big number of unique time series (metric paths) such as Prometheus or InfluxDB. The downside is that these backends do not support Graphite data ingestion protocol and Graphite querying API. They also have slightly different data model, which isn’t compatible with Graphite data model (e.g. they prefer labels / tags over metric path segments). This may require non-trivial efforts for the migration of Graphite observability stack to the new stack.
  3. To migrate to VictoriaMetrics. The migration path from Graphite to VictoriaMetrics shouldn’t be hard, since it supports data ingestion via Graphite plaintext protocol and it supports Graphite querying API. This means you don’t need to change the existing metrics’ collection and delivery pipelines. You also can continue using Grafana dashboards with Grafana datasource for Graphite.

Why VictoriaMetrics is the best alternative to Graphite?

VictoriaMetrics is an open source monitoring tool, which can be used as a drop-in replacement for Graphite, since it supports both data ingestion and querying APIs from Graphite. Unfortunately it cannot read Whisper data from disk, so you have to manually migrate historical data from Whisper to VictoriaMetrics. Another solution is to ingest data to both Graphite and VictoriaMetrics during the configured retention in Graphite and then switch off Graphite and drop all its’ data, since VictoriaMetrics will have the same data then.

VictoriaMetrics is optimized for handling hundreds of millions of active time series at the ingestion rate of millions of data points per second on a single node. It stores time series in a small set of files. It minimizes the number of IO operations needed for storing and querying the stored time series data. It compresses typical data points from production metrics to less than a byte per sample. See, for example, this real-world example. Another example shows how the migration from from Graphite to VictoriaMetrics allowed reducing infrastructure costs by more than 10x.

Like Graphite, VictoriaMetrics is very easy to setup and operate, since it consists of a single small executable, which doesn’t need any advanced configs — in fact it runs good with default configs (aka zero-config). If a single node isn’t enough for your workload, then you can migrate to cluster version of VictoriaMetrics.

--

--

Aliaksandr Valialkin
Aliaksandr Valialkin

Written by Aliaksandr Valialkin

Founder and core developer at VictoriaMetrics

Responses (1)