Measuring vertical scalability for time series databases in Google Cloud

Time series databases usually have free single-node versions, which cannot scale to multiple nodes. Their performance is limited by a single computer. Modern machines may have hundreds of CPU cores and terabytes of RAM. Let’s test vertical scalability for InfluxDB, TimescaleDB and VictoriaMetrics on standard machine types in Google Cloud.

Setup

The following TSDB docker images were used in the test:

InfluxDB and VictoriaMetrics were run with default configs, while TimescaleDB config has been tweaked using this script for each n1-standard machine type in Google Cloud. This script tunes Postgresql for the available RAM on each machine type.

Time Series Benchmark Suite (TSBS) was used for the test. The following command has been used for generating test data:

./tsbs_generate_data -use-case="cpu-only" -seed=123 -scale-var=4000 \
-timestamp-start="2019-04-01T00:00:00Z" \
-timestamp-end="2019-04-04T00:00:00Z" \
-log-interval=10s \
-format=${TSDB_TYPE} | gzip > ${TSDB_TYPE}-data-40k.gz

The data has been generated for the following $TSDB_TYPE values:

VictoriaMetrics has been fed with $TSDB_TYPE=influx data, since it supports Influx line protocol.

The generated data consists of 40K unique time series. Each time series contains 25920 data points. The summary number of data points is 25920*40K=1036800000. This number may be rounded to a billion.

The following n1-standard machine types were used for both client and server machines:

We’ll see later the difference between vCPU and CPU.

Time series databases were configured to store data on a dedicated 2TB zonal standard persistent disk. The disk has the following characteristics:

Data ingestion

The data ingestion tests were run with --workers=2*vCPUs config on the client side. This guaranteed full CPU utilization on the TSDB side.

Ingestion rate, thousands of data points / sec (higher is better)

The following notable things are visible on the chart:

Why TimescaleDB didn’t scale on machines with more than 16vCPUs? The following chart answers the question:

Disk write bandwidth usage, MB/s (lower is better)

As you can see, TimescaleDB writes a lot to persistent disk comparing to competitors and reaches write bandwidth limit — 240MB/s — on n1-standard-8 machine. I have no idea why TimescaleDB exceeds the limit and reaches 260MB/s instead of 240MB/s.

The chart contains other notable info:

The final on-disk data sizes after ingesting 1B of test data points shows that TimescaleDB’s disk usage is far from optimal:

Let’s calculate how many data points each TSDB may store on a 2TB disk using data from the graph above:

TimescaleDB requires 77 (seventy seven) times more disk space for storing the same amount of data points comparing to VictoriaMetrics. In other words, it requires disk space costing $80*77=$6160/month for storing the same amount of data as VictoriaMetrics with a disk costing $80/month.

While 69 billions of data points look really big from the first sight, this isn’t true. Let’s calculate how many data points are collected from a fleet of 100 nodes with node_exporter installed on each node. Node_exporter exports 500 time series on average, so 100 nodes would export 100*500=50K unique time series. Let’s suppose these time series are scraped with 10s interval. This results in an ingestion stream of 5K data points per second. A month-worth data would contain 5K*3600*24*30=13 billions of data points. So, TimescaleDB would fill the entire 2TB storage in 69/13=5 months.

Both VictoriaMetrics and InfluxDB store data in LSM trees, while TimescaleDB relies on data storage from Postgresql. Charts above clearly show that LSM trees are better suited for storing time series data comparing to general-purpose storage provided by Postgresql.

Querying

TSBS queries may be split into two groups:

The following chart shows rpm (requests per minute) for double-groupby-1 queries serially performed by a single client worker:

double-groupby-1, single client, rpm (higher is better)

Interesting things from the chart:

The next chart shows the maximum possible rps for double-groupby-1 query. The corresponding test runs N concurrent client workers, where N is the number of vCPUs on the server. This results in full CPU utilization on the server.

double-groupby-1, clients=vCPUs, rpm (higher is better)

The chart shows good vertical scalability for all the contenders until n1-standard-32. TimescaleDB scales further on n1-standard-64, while InfluxDB and VictoriaMetrics have no gains in the maximum query bandwidth when switching from n1-standard-32 to n1-standard-64. It is likely TimescaleDB has better optimizations for NUMA nodes. But VictoriaMetrics without NUMA-aware optimizations serves 6.5x more queries comparing to TimescaleDB even on n1-standard-64 machine.

Conclusions

Vertical scalability is limited by per-machine resources — CPU, RAM, storage or network. Users have to switch to cluster solutions when single-node solution reaches scalability limits. TimescaleDB, InfluxDB and VictoriaMetrics provide paid cluster solutions for users that outgrow a single node. Guess which cluster solution is the most cost-effective from infrastructure and licensing point of view :)

Update: VictoriaMetrics is open source now!

Founder and core developer at VictoriaMetrics