Photo by Iñaki del Olmo on Unsplash

Prometheus storage: technical terms for humans

Many technical terms could be used when referring to Prometheus storage — either local storage or remote storage. New users could be unfamiliar with these terms, which could result in misunderstandings. Let’s explain the most commonly used technical terms in this article.

Time Series

  • node_cpu_seconds_total — the total number of CPU seconds used
  • node_filesystem_free_bytes — free space on filesystem mount point
  • go_memstats_sys_bytes — the amount of memory used by Go app

Additionally to name each time series can have arbitrary number of label=”value” labels. For instance:

  • go_memstats_sys_bytes{instance=”foobar:1234",job=”node_exporter”}
  • prometheus_http_requests_total{handler=”/api/v1/query_range”}

Each time series is uniquely identified by its name plus a set of labels. For example, these all are distinct time series:

  • temperature
  • temperature{city=”NY”}
  • temperature{city=”SF”}
  • temperature{city=”SF”, unit=”Celsius”}

Data Point or Sample

Metric Types

  • Gauge — a value, which can go up and down at any given time. For example, temperature or memory usage.
  • Counter — a value, which starts from 0 and never goes down. For example, requests count or distance traveled. There is one exception, which is called counter reset, when counter resets to 0 if the service exposing the metric is restarted.
  • Summary — maintains a set of pre-configured percentiles for the value.
  • Histogram — maintains a set of counters (aka buckets) for different value ranges. The set of buckets can be pre-configured or can be static. See this article for details.

High Cardinality

Prometheus exposes the information about high cardinality time series at /status page starting from v2.14.0 — see this PR for details.

Active time series

Churn rate

Prometheus exposes prometheus_tsdb_head_series_created_total metric, which could be used for estimating the churn rate using the following PromQL query:

rate(prometheus_tsdb_head_series_created_total[5m])

Starting from v2.10 Prometheus exposes per-target scrape_series_added metric, which can be used for determining the source of high series churn rate:

sum(sum_over_time(scrape_series_added[5m])) by (job)

See also this article, which explains churn rate in Prometheus with more details.

Ingestion rate

Prometheus periodically scrapes the configured targets. All the configured targets can be viewed at /targets page — see for example http://demo.robustperception.io:9090/targets .

Each target exposes metrics. Metric values are appended to the corresponding time series after each scrape. For example, the target http://demo.robustperception.io:9090/metrics exposes 964 metrics:

curl -s http://demo.robustperception.io:9090/metrics | grep -vc '#'
964

Ingestion rate can be calculated from scrape_samples_scraped metric exposed by Prometheus using the following PromQL query:

sum_over_time(scrape_samples_scraped[5m]) / 300

Additionally ingestion rate can be estimated using the following formulas:

  • ingestion_rate = targets_count * metrics_per_target / scrape_interval
  • ingestion_rate = active_time_series / scrape_interval

Scrape interval

Lower scrape interval results in higher ingestion rate and in higher RAM usage for Prometheus, since more data points must be kept in RAM before they are flushed to disk.

Retention

The lowest supported retention in Prometheus is 2 hours (2h). Such a retention could be useful when configuring remote storage for Prometheus. In this case Prometheus simultaneously replicates all the scraped data into local storage and all the configured remote storage backends. This means that the retention for local storage can be minimal, since all the data is already replicated to remote storage and the remote storage can be used for querying from Grafana and any other clients with Prometheus querying API support.

Note that the configured retention must cover time ranges for alerting and recording rules.

Relabeling

Remote Storage

  • Collecting data from many Prometheus instances to a single remote storage, so all the data could be queried and analyzed. This is sometimes called global query view.
  • Storing long-term data into remote storage, so local tsdb in Prometheus could be configured with low retention in order to occupy low amounts of disk space.
  • Overcoming scalability issues for Prometheus, which cannot automatically scale to multiple nodes. Certain remote storage solutions such as VictoriaMetrics can scale both vertically (i.e. on a single computer) and horizontally (i.e. clustering over multiple computers).
  • Running Prometheus in K8S cluster with ephemeral storage volumes, which can disappear after pod restart (aka stateless mode). This is safe, since all the scraped data is immediately replicated to the configured remote storage backends.

Conclusion

P.S. I’m the author of VictoriaMetrics — open source cost effective remote storage for Prometheus with easy setup and operation. I’d recommend taking a look at it if you use Prometheus at work. And join our Slack chat.

Founder and core developer at VictoriaMetrics