Evaluating Performance and Correctness: VictoriaMetrics response

Recently the Evaluating Performance and Correctness article has been published by Prometheus author. The article points to a few data model discrepancies between VictoriaMetrics and Prometheus. It also contains benchmark results showing poor compression ratio and poor performance for VictoriaMetrics comparing to Prometheus. Unfortunately the original article doesn’t support comments to leave the response, so let’s discuss all these issues in the post below.

Bad compression ratio

VictoriaMetrics users report 0.4–0.6 bytes per sample for production data according to the following PromQL query for the exported metrics:

sum(vm_data_size_bytes{type=~”storage/.*”}) / sum(vm_rows{type=~”storage/.*”})

So how 4 bytes per sample from Brian’s benchmark can be converted to 0.4 bytes per sample for real-world data? The answer is:

  • The time series from the benchmark are far from real world.
  • Real-world measurements usually contain small number of decimal digits on limited range. For instance, the usual temperature range is from -460 F to 1000 F with 0.1 F precision, the usual speed range is from 0 m/s to 10K m/s with 1 m/s precision, the usual qps range is from 0 to 1M with 0.1 qps precision, the usual price range is from $0 to $1M with $0.01 precision.
  • The number of decimal digits becomes even smaller after applying delta-coding, i.e. calculating the difference between adjacent samples. The difference is small for Gauges, since they tend to change slowly. The difference is small for Counters, since their rate is usually limited by relatively small range. The number of decimal digits for Counters can be reduced further by applying double delta-coding to them.

VictoriaMetrics takes full advantage of these properties for real-world time series. Try storing real-world series into VictoriaMetrics such as metrics from node_exporter and enjoy much better on-disk compression ratio than Prometheus, Thanos or Cortex could provide. VictoriaMetrics compresses real-world node_exporter data up to 7x better than Prometheus — see this benchmark.

High-entropy data (aka random numbers with big number of decimal digits) has not so good compression ratio comparing to typical time series data. So it is recommended reducing the number of decimal digits for measurements stored in TSDB in order to improve compression ratio and reduce disk space usage. vmagent provides -remoteWrite.roundDigits command-line flag, which allows reducing storage requirements for the data written to VictoriaMetrics.

It is unclear why Brian decided to use random series instead of real-world series for measuring compression ratio in the article.

Precision loss

Should VictoriaMetrics users worry about this? Mostly no, since:

  • The precision loss can occur only on values with more than 12 significant decimal digits. Such values are rare in real world. Even summary counters for nanoseconds shouldn’t lose precision. Of course, if you work in NASA, then you would need up to 15 decimal digits :)
  • Real-world measurements usually contain small number of precise leading decimal digits. The rest of digits are just noise, which has little value because of measurement errors. For instance, the mentioned above metric — go_memstats_gc_cpu_fraction — contains only 4 or 5 precise digits after the point — 0.00308 in the best case — all the other digits are just garbage, which worsens series compression ratio.

Did you know that Prometheus also loses precision? Try storing 9.234567890123009 to it. It will be stored as 9.234567890123008. See the verification link. Prometheus, like any solution that works with float64 values, has precision loss issues — see this link.

Stale timestamps in /api/v1/query results

-search.latencyOffset duration
The time when data points become visible in query results after the collection. Too small value can result in incomplete last points for query results (default 1m0s)

Why Prometheus doesn’t have similar option? Because it controls data scraping — the data becomes visible for querying almost immediately after the scrape. VictoriaMetrics receives scraped data from Prometheus instances via remote_write API. The data can be delayed for extended periods of time before it becomes visible for querying in VictoriaMetrics. Additionally, non-zero -search.latencyOffset allows avoiding issues related to query isolation.

VictoriaMetrics treats vector with a single nameless series as scalar and vice versa

This deviation in behavior between Prometheus and VictoriaMetrics is deliberate — it simplifies using PromQL for users, who don’t know the difference between scalar, instant vector and range vector in Prometheus. Are there people except Prometheus developers who know the difference? :)

When developing PromQL-compatible engine for VictoriaMetrics I tried avoiding its rough edges based on my experience in order to make more user-friendly PromQL with expected behavior. For instance, VictoriaMetrics allows writing rate(q) instead of frequently used rate(q[$__interval]) in Grafana dashboards. The full list of additional features is available on this page.

Note that VictoriaMetrics is fully backwards-compatible with PromQL, i.e. all the valid queries from Prometheus should return the same results in VictoriaMetrics. There are a few exceptions like more consistent handling for increase() function — VictoriaMetrics always returns the expected integer value for increase() over counter without floating-point increases. See also VictoriaMetrics: PromQL compliance article for more details on intentional discrepancies between PromQL and MetricsQL.

Staleness handling

  • After a special NaN value is found. This value is inserted by Prometheus when the metric disappears from the scrape target or the scrape target is unavailable.
  • After 5 minutes of silence from the previous value.

This logic doesn’t work on time series with scrape intervals exceeding 5 minutes, since Prometheus mistakenly thinks the time series contains a gap in 5 minutes after each scrape.

VictoriaMetrics drops NaNs, but it still detects stale series with much easier logic without corner cases — it stops returning data points if they are missing during the last 2 scrape intervals. For instance, if Prometheus scrapes time series every 10 seconds, then VictoriaMetrics detects that the series is stale after 20 seconds of missing data points.

Contrary to Prometheus, staleness handling in VictoriaMetrics correctly handles time series with scrape intervals higher than 5 minutes. Additionally, it works the same not only for Prometheus remote_write API, but for any other supported ingestion methods — Graphite plaintext protocol, InfluxDB line protocol, OpenTSDB telnet and http protocols. These ingestion methods don’t know anything about Prometheus staleness detection, so, obviously, they don’t work with it.

Update: VictoriaMetrics gained Prometheus-compatible staleness handling in latest releases — see the changelog for details.

Slow time series lookups

  • Prometheus:
GOMAXPROCS=1 go test ./tsdb/ -run=111 -bench=BenchmarkHeadPostingForMatchers
  • VictoriaMetrics:
GOMAXPROCS=1 go test ./lib/storage/ -run=111 -bench=BenchmarkHeadPostingForMatchers

Brian updated performance numbers in the original article after I pointed him to real numbers via Twitter. Now the article claims VictoriaMetrics is 2x-5x slower than Prometheus on the modified end-to-end tests. Unfortunately Brian didn’t provide source code for the updated tests yet. The source code is required in order to reproduce the test and to determine why VictoriaMetrics is slower than Prometheus in these tests.

Usually VictoriaMetrics is much faster than competitors (InfluxDB and TimescaleDB) — see this article for details. VictoriaMetrics users report it is faster on real production data than Prometheus, Cortex and Thanos. They also report that VictoriaMetrics consumes lower amounts of RAM and disk space comparing to Prometheus, Cortex and Thanos. As I know, VictoriaMetrics users never return back to Thanos and Cortex. Additionally, they were frequently requesting to create stripped-down Prometheus without local storage, since Prometheus instances usually eat too much RAM in their highly loaded setups. So we created vmagent.

Conclusion

Contrary to the original article, this post can be commented below, so feel free leaving comments and questions.

P.S. Join our Slack channel and keep up to date with all the news regarding VictoriaMetrics.

P.S. See also this VictoriaMetrics vs Prometheus benchmark, which is based on real production data.

Founder and core developer at VictoriaMetrics