Measuring vertical scalability for time series databases in Google Cloud

Time series databases usually have free single-node versions, which cannot scale to multiple nodes. Their performance is limited by a single computer. Modern machines may have hundreds of CPU cores and terabytes of RAM. Let’s test vertical scalability for InfluxDB, TimescaleDB and VictoriaMetrics on standard machine types in Google Cloud.

Setup

The following TSDB docker images were used in the test:

InfluxDB and VictoriaMetrics were run with default configs, while TimescaleDB config has been tweaked using this script for each n1-standard machine type in Google Cloud. This script tunes Postgresql for the available RAM on each machine type.

Time Series Benchmark Suite (TSBS) was used for the test. The following command has been used for generating test data:

./tsbs_generate_data -use-case="cpu-only" -seed=123 -scale-var=4000 \
-timestamp-start="2019-04-01T00:00:00Z" \
-timestamp-end="2019-04-04T00:00:00Z" \
-log-interval=10s \
-format=${TSDB_TYPE} | gzip > ${TSDB_TYPE}-data-40k.gz

The data has been generated for the following $TSDB_TYPE values:

  • influx

VictoriaMetrics has been fed with $TSDB_TYPE=influx data, since it supports Influx line protocol.

The generated data consists of 40K unique time series. Each time series contains 25920 data points. The summary number of data points is 25920*40K=1036800000. This number may be rounded to a billion.

The following n1-standard machine types were used for both client and server machines:

  • n1-standard-1: 1vCPU, 3.75GB RAM

We’ll see later the difference between vCPU and CPU.

Time series databases were configured to store data on a dedicated 2TB zonal standard persistent disk. The disk has the following characteristics:

  • IOPS: 1500 read, 3000 write

Data ingestion

The data ingestion tests were run with --workers=2*vCPUs config on the client side. This guaranteed full CPU utilization on the TSDB side.

Ingestion rate, thousands of data points / sec (higher is better)

The following notable things are visible on the chart:

  • Sub-optimal scalability between n1-standard-1 and n1-standard-2 machine types. The ingestion performance grew only by 1.6x-1.8x when switching from 1vCPU to 2vCPU for all the competitors. This is related to hyper-threaded vCPUs, which aren’t real CPU cores. Read more about hyper-threading pros and cons on Wikipedia. The following quote is from the official Google Cloud docs:
    vCPU is implemented as a single hardware hyper-thread

Why TimescaleDB didn’t scale on machines with more than 16vCPUs? The following chart answers the question:

Disk write bandwidth usage, MB/s (lower is better)

As you can see, TimescaleDB writes a lot to persistent disk comparing to competitors and reaches write bandwidth limit — 240MB/s — on n1-standard-8 machine. I have no idea why TimescaleDB exceeds the limit and reaches 260MB/s instead of 240MB/s.

The chart contains other notable info:

  • Both VictoriaMetrics and InfluxDB use much smaller amount of disk write bandwidth comparing to TimescaleDB. This means they may use cheaper disks with lower IO bandwidth comparing to TimescaleDB.

The final on-disk data sizes after ingesting 1B of test data points shows that TimescaleDB’s disk usage is far from optimal:

Let’s calculate how many data points each TSDB may store on a 2TB disk using data from the graph above:

  • VictoriaMetrics: 2TB/0.377(Bytes/data point) = 5.3 trillions

TimescaleDB requires 77 (seventy seven) times more disk space for storing the same amount of data points comparing to VictoriaMetrics. In other words, it requires disk space costing $80*77=$6160/month for storing the same amount of data as VictoriaMetrics with a disk costing $80/month.

While 69 billions of data points look really big from the first sight, this isn’t true. Let’s calculate how many data points are collected from a fleet of 100 nodes with node_exporter installed on each node. Node_exporter exports 500 time series on average, so 100 nodes would export 100*500=50K unique time series. Let’s suppose these time series are scraped with 10s interval. This results in an ingestion stream of 5K data points per second. A month-worth data would contain 5K*3600*24*30=13 billions of data points. So, TimescaleDB would fill the entire 2TB storage in 69/13=5 months.

Both VictoriaMetrics and InfluxDB store data in LSM trees, while TimescaleDB relies on data storage from Postgresql. Charts above clearly show that LSM trees are better suited for storing time series data comparing to general-purpose storage provided by Postgresql.

Querying

TSBS queries may be split into two groups:

  • “Instant” queries, which are performed in less than 100ms. Such queries have little opportunity to scale on machines with higher number of CPUs and RAM, so their results look similar. The only notable thing is extremely slow query performance on TimescaleDB if the queried data is missing in the OS page cache. “Extremely slow” means 100x-1000x slower comparing to the case when the queried data is in the OS page cache. This means that TimescaleDB scatters the queried data across the entire disk, so many I/O operations must be performed for gathering all this data from the disk.

The following chart shows rpm (requests per minute) for double-groupby-1 queries serially performed by a single client worker:

double-groupby-1, single client, rpm (higher is better)

Interesting things from the chart:

  • InfluxDB has poor vertical scalability for “heavy” queries, while TimescaleDB and VictoriaMetrics have better scalability for such queries.

The next chart shows the maximum possible rps for double-groupby-1 query. The corresponding test runs N concurrent client workers, where N is the number of vCPUs on the server. This results in full CPU utilization on the server.

double-groupby-1, clients=vCPUs, rpm (higher is better)

The chart shows good vertical scalability for all the contenders until n1-standard-32. TimescaleDB scales further on n1-standard-64, while InfluxDB and VictoriaMetrics have no gains in the maximum query bandwidth when switching from n1-standard-32 to n1-standard-64. It is likely TimescaleDB has better optimizations for NUMA nodes. But VictoriaMetrics without NUMA-aware optimizations serves 6.5x more queries comparing to TimescaleDB even on n1-standard-64 machine.

Conclusions

  • Modern time series databases have decent vertical scalability for both data ingestion and querying.

Vertical scalability is limited by per-machine resources — CPU, RAM, storage or network. Users have to switch to cluster solutions when single-node solution reaches scalability limits. TimescaleDB, InfluxDB and VictoriaMetrics provide paid cluster solutions for users that outgrow a single node. Guess which cluster solution is the most cost-effective from infrastructure and licensing point of view :)

Update: VictoriaMetrics is open source now!

Founder and core developer at VictoriaMetrics