When size matters — benchmarking VictoriaMetrics vs Timescale and InfluxDB
Recently Timescale published Time Series Benchmark Suite (TSBS) — a framework for TSDB benchmarking. See TSBS on GitHub.
The TSBS may:
- generate the configured number of production-like timeseries;
- measure insert performance for the generated timeseries;
- measure select performance for various production-like queries.
The original TSBS supports the following systems:
- Timescale
- InfluxDB
- MongoDB
- Cassandra
Adding VictoriaMetrics to TSBS
We liked TSBS, so we quickly hacked support for Prometheus remote write API into TSBS and started using it. Initial results weren’t exciting — VictoriaMetrics was slow on some queries and required a lot of memory during the benchmark execution.
The root cause was remote read API. TSBS had been configured to query Prometheus, which, in turn, queried VictoriaMetrics via remote read API
. This didn't scale well, since VictoriaMetrics had to prepare and return huge amounts of data to Prometheus on heavy queries like double-groupby.
The solution was to create a PromQL engine directly in VictoriaMetrics, so all the heavy-lifting on complex queries could be implemented and optimized inside the engine. The end result is Extended PromQL engine with full PromQL
support plus additional useful features like WITH expressions.
Benchmark preparation
Which competitors to put against VictoriaMetrics?
- Cassandra has been disqualified, since it is much slower than Timescale.
- MongoDB has been disqualified for the same reason.
The remaining competitors — Timescale and InfluxDB.
The following TSBS queries couldn’t be translated to PromQL
, so they have been dropped from the benchmark:
lastpoint
- PromQL cannot return last point for each time series;groupby-orderby-limit
- PromQL doesn't supportorder by
andlimit
.
The high-cpu
queries have been modified to return the max(usage_user)
for each host, since PromQL
doesn't support SELECT *
. The cpu-max-all
queries have been dropped, since they weren't present in benchmark results from Timescale.
The benchmark was run in Google Compute Engine on two n1-standard-8
instances with 8 virtual CPUs, 30GB RAM and 200GB HDD - an instance for the client (TSBS), and an instance for the server. Timescale version - 0.12.1, InfluxDB version - 1.6.4.
Benchmark results
Insert performance for a billion of datapoints belonging to 40K distinct timeseries:
- VictoriaMetrics — 1.7M datapoints per second, RAM usage — 0.8GB, data size on HDD — 387MB.
- InfluxDB — 1.1M datapoints per second, RAM usage — 1.7GB, data size on HDD — 573MB.
- Timescale — 890K datapoints per second, RAM usage — 0.4GB, data size on HDD — 29GB.
Nothing interesting except Timescale data occupies whopping 29GB on HDD. That’s 50x more than InfluxDB and 75x more than VictoriaMetrics. Later we’ll see when this size matters.
Select performance:
- VictoriaMetrics wins InfluxDB and Timescale in all the queries by a margin of up to 20x. It especially excels at heavy queries, which scan many millions of datapoints across thousands of distinct timeseries.
- InfluxDB is on the second place. It wins Timescale on light queries and looses Timescale by up to 3.5x on heavy queries.
- Timescale is on the third place. Moreover, it was multiple orders of magnitude slower on all the queries when the required data wasn’t in page cache, while VictoriaMetrics and InfluxDB were only marginally slower in these cases.
Analysis
Why Timescale performed so poorly on select queries? The answer is in huge data size (29GB) and on-disk data layout not suited for storage with low iops. Google Cloud HDDs are limited in iops per GB and throughput per GB. 200GB disk is limited by 150 read operations per second and 24MB/s read/write throughput. Simple calculations show that 20 minutes is needed for loading 29GB into page cache at 24MB/s. VictoriaMetrics would load the same amount of data (a billion of datapoints) on the same HDD in 16 seconds.
The read throughput limit has been reached by Timescale only a few times during select queries. The rest of time it was limited by 150 read operations per second. This points to sub-optimal data layout for low-iops storage such as HDD.
Any workarounds for Timescale? The easiest workaround is to use more expensive storage with high bandwidth and high iops such as high-end SSD. Post other workarounds in comments.
Conclusion
Sometimes size matters :) It may be more expensive than you expect.
TSBS is a great benchmarking tool. It helped minimizing CPU usage and RAM usage for VictoriaMetrics on production workloads. We are planning to run benchmarks and publish results for higher cardinality (millions of unique timeseries) and higher number of datapoints (trillions). Stay tuned.
In the mean time read how we created VictoriaMetrics — the best remote storage for Prometheus.
Update: Docker images with single-server VictoriaMetrics are available here. The corresponding statically linked binaries are available here.
Update#2: Read the next article — High-cardinality TSDB benchmarks: VictoriaMetrics vs TimescaleDB vs InfluxDB.
Update #3: Read yet another article — Measuring vertical scalability for time series databases in Google Cloud.
Update #4: VictoriaMetrics is open source now!