Photo by Chris Liverani on Unsplash

How to optimize PromQL and MetricsQL queries

PromQL and MetricsQL are powerful query languages. They allow writing simple queries for building nicely looking graphs over time series data. They also allow writing sophisticated queries for SLI / SLO calculations and alerts. But it may be hard to optimize PromQL queries. This article shows how to determine slow PromQL queries, how to understand query costs and how to optimize these queries, so they execute faster and consume lower amounts of CPU and RAM.

How to determine whether the PromQL query is slow?

Unfortunately it is impossible to determine whether the PromQL query is slow or fast by just looking at it :( PromQL query performance depends on the following factors other than the query itself:

  • The number of time series the query selects (read this article if you aren’t sure what does “a time series” mean). If the query selects millions of time series, then it may take gigabytes of RAM, high number of disk IOPS and a big share of CPU time during the execution.
  • The number of raw samples the query needs to scan (read the same article if you aren’t sure what does “raw sample” mean). This number equals to the sum of raw samples for all the matching time series on the selected time range. The number of samples depends also on the interval between samples stored in the database (aka scrape_interval in Prometheus world)— shorter interval means higher number of samples.
  • The time range used for building the graph in Grafana. Longer time range means more data must be read from the database. Note that longer time ranges may increase both the number of samples to scan and the number of matching time series if old series are substituted with new series over time (aka series churn rate).
  • The total number of time series in the database on the selected time range. More series usually means slower lookups for the matching series. Note that the number of active time series can be much smaller than the total number of series on the given time range if old series are frequently substituted with new series due to high churn rate.

For example, simple-looking query up may execute in a blink of an eye if it matches a few time series with up to a few thousand samples on the selected time range. On the other hand, the same up query may take tens of seconds and may consume gigabytes of RAM if it matches hundreds of thousands time series with tens of billions of samples on the selected time range.

Query performance depends also on the query itself. For example, various functions have various performance. MetricsQL and PromQL have the following function types:

PromQL and MetricsQL have an additional feature, which may slow down the query significantly if improperly used — subqueries. See how subqueries work in order to understand how they can influence query performance and resource usage.

As you can see, it is quite hard to determine whether the query is fast or slow by just looking at the query. The best way to determine slow queries is to run them in production and see which queries are slow when executed on real data. VictoriaMetrics provides the following additional options, which may help determining slow queries:

  • Slow query log. VictoriaMetrics logs queries if their execution time exceeds the value passed to -search.logSlowQueryDuration command line flag.
  • Query stats. It is available at /api/v1/status/top_queries page. It shows the most frequently executed queries as well as the slowest queries. See these docs for additional details.

OK, slow queries are detected. How to determine why the detected queries are slow?

Why the PromQL query is slow?

Let’s repeat the most common cases why PromQL and MetricsQL query can be slow:

  • When it selects big number of time series.
  • When it selects big number of raw samples.
  • When individual label filters used in the query match big number of time series.

So you need to check every case in order to detect the case, which leads to slow query.

How many time series the given PromQL query selects?

Use the combination of count and last_over_time functions over series selector from the query. For example, if the query looks like avg(rate(http_requests_total[5m])) by (job), then the following query will return the number of time series the initial query needs to select:

count(last_over_time(http_requests_total[5m]))

This works for e.g. instant queries, which are used in alerting and recording rules. These queries return values for the given time only. If the query is used in Grafana for building a graph on the given time range, then range queries are used instead. They calculate independent results for each point on the graph. So they can select and process much more time series than the count(last_over_time(...)) shows on the graph under high churn rate. In this case you should put the selected time range in square brackets for obtaining the real number of time series the query needs to select. For example, the following query must be used for selecting the number of series the original query touches for building a graph for the last 24 hours:

count(last_over_time(http_requests_total[24h]))

If the count(last_over_time(...)) query returns values smaller than a few thousands, then the initial query has no bottlenecks in the number of selected time series. If the returned number exceeds tens of thousands, then you may be in a trouble. Try the following approaches for reducing the number of the selected time series:

  • Add more specific label filters to series selector, so it selects lower number of time series.
  • Detect and fix the source of high churn rate. This are usually labels with periodically changed values over time. Such labels can be detected via /api/v1/status/tsdb page.
  • Reduce the lookbehind window in square brackets or reduce the time range for the graph in Grafana. This may help reducing the number of selected time series under high churn rate.

How many samples the given PromQL query selects?

Use the combination of sum and count_over_time functions over series selector from the query. For example, if the query looks like avg(rate(http_requests_total[5m])) by (job), then the following query will return the number of samples the initial query needs to select:

sum(count_over_time(http_requests_total[5m]))

This works for alerting and recording rules. If you need determining the number of raw samples needed to be selected for building a graph in Grafana on the given time range, then the lookbehind window in square brackets must be changed to the given time range. For example, the following query returns the number of raw samples needed to build a graph for the last 24 hours:

sum(count_over_time(http_requests_total[24h]))

If the sum(count_over_time(...)) query returns values smaller than a few millions, then the initial query has no bottlenecks with the number of raw samples to scan. If the returned number exceeds hundreds of millions, then you may need optimizing the query via the following techniques:

  • Reducing the number of selected time series as explained in the previous chapter.
  • Reducing the lookbehind window in square brackets, so lower number of raw samples needs to be processed during the query.
  • Reducing the time range for the graph in Grafana.
  • Increasing the Resolution option for the graph in Grafana, so it requests lower number of points for building the graph. See these docs.

How many samples the given PromQL query processes?

Usually the number of processed raw samples matches the number of selected raw samples. But sometimes the number of processed raw samples may significantly exceed the number of selected raw samples. PromQL and MetricsQL can process the same raw sample multiple times if the following conditions are met:

  • The query is used for building graph in Grafana.
  • The lookbehind window in square brackets exceeds the interval between points on the graph (Grafana passes this interval via step query arg to /api/v1/query_range).

For example, if the following query is used for building a graph for the last hour: max_over_time(process_resident_memory_bytes[30m]), then it is likely Grafana will pass step query arg to /api/v1/query_range, which is much smaller than the 30 minutes used in the lookbehind window. This means that every raw sample is evaluated many times for calculating the max_over_time per each point on the graph. The following MetricsQL query can be used for determining the number of samples the query needs to process:

range_last(
running_sum(
sum(
count_over_time(process_resident_memory_bytes[30m])
)
)
)

If the returned number exceeds hundreds of millions, then it is time to optimize the query by reducing the lookbehind window in square brackets. Another optimization option is to instruct Grafana to increase the step query arg by increasing the Resolution option. See these docs for details.

High churn rate and PromQL query performance

The query performance may slow down under high churn rate for the time series. High churn rate increases the total number of time series in the database. This also increases the number of matching time series per each label filter used in the query. This, in turn, slows down the process of finding the matching time series.

This is a common issue for Kubernetes monitoring with frequent deployments. Each new deployment increases series churn rate, since metrics exposed by the deployed objects (pods, endpoints, services) may have new values for deployment-related labels such as deployment id, pod id, image id, etc. The solution for this is to remove or rewrite these labels with Prometheus relabeling rules, so they don’t change frequently with each deployment. The labels with the most frequently changed values can be detected via /api/v1/status/tsdb page. See these docs for details.

Optimizing complex PromQL queries

PromQL allows writing queries with multiple series selectors. For example, foo + bar contains two series selectors — foo and bar. These series selectors are evaluated independently of each other. This means that the optimization process should be performed independently per each series selector in the query.

There is a common performance issue with complex queries when the series selectors on the left and the right side of the binary operator return different number of time series with different sets of labels. PromQL and MetricsQL strips metric name and then applies the binary operation individually to series with identical sets of labels on the left-hand and right-hand series, while other series are just dropped. See these docs for details. This means that series without the corresponding matching pair on the other side of binary operator just waste compute resources (CPU time, RAM, disk IO). The solution is to add common label filters to series selectors on both sides of binary operator. This will reduce the number of selected time series and, consequently, will reduce resource usage. For example, the query foo{instance="x"} + bar can be optimized to foo{instance="x"} + bar{instance="x"}, so bar narrows down the search only to time series with {instance="x"} label.

Conclusion

MetricsQL and PromQL query optimization isn’t an easy task. It depends on both the query itself and on the queried data. Prometheus and VictoriaMetrics provide some tools for determining slow queries and detecting which part of the query works slowly. This article shows common techniques for determining slow parts of the query and provides possible optimization solutions. I hope this article will help optimizing PromQL queries in production.

Founder and core developer at VictoriaMetrics