Measuring vertical scalability for time series databases in Google Cloud

Setup

./tsbs_generate_data -use-case="cpu-only" -seed=123 -scale-var=4000 \
-timestamp-start="2019-04-01T00:00:00Z" \
-timestamp-end="2019-04-04T00:00:00Z" \
-log-interval=10s \
-format=${TSDB_TYPE} | gzip > ${TSDB_TYPE}-data-40k.gz
  • influx
  • timescaledb
  • n1-standard-1: 1vCPU, 3.75GB RAM
  • n1-standard-2: 2vCPU, 7.5GB RAM
  • n1-standard-4: 4vCPU, 15GB RAM
  • n1-standard-8: 8vCPU, 30GB RAM
  • n1-standard-16: 16vCPU, 60GB RAM
  • n1-standard-32: 32vCPU, 120GB RAM
  • n1-standard-64: 64vCPU, 240GB RAM
  • IOPS: 1500 read, 3000 write
  • Throughput: 240MB/s read, 240MB/s write
  • Price: $80 / month

Data ingestion

Ingestion rate, thousands of data points / sec (higher is better)
  • Sub-optimal scalability between n1-standard-1 and n1-standard-2 machine types. The ingestion performance grew only by 1.6x-1.8x when switching from 1vCPU to 2vCPU for all the competitors. This is related to hyper-threaded vCPUs, which aren’t real CPU cores. Read more about hyper-threading pros and cons on Wikipedia. The following quote is from the official Google Cloud docs:
    vCPU is implemented as a single hardware hyper-thread
  • Almost linear vertical scalability for VictoriaMetrics: the ingestion performance scales from 800K data points / sec on 2vCPU machine to 19M data points / sec on 64vCPU machine.
  • Sub-linear vertical scalability for InfluxDB. It scales from 320K data points / sec on 2vCPU machine to 4.4M data points / sec on 64vCPU machine.
  • TimescaleDB reaches scalability limit at 2.3M data points /sec on n1-standard-16 machine and doesn’t scale further.
  • VictoriaMetrics outperforms both InfluxDB and TimescaleDB for each machine type. The performance gap reaches 19100/4410=4.3 times for InfluxDB and to 19100/2220=8.6 times for TimescaleDB on n1-standard-64 machine type. This means a single-node VictoriaMetrics may substitute moderately sized cluster built with InfluxDB or TimescaleDB. Put it another way, single-node VictoriaMetrics saves infrastructure costs additionally to licensing costs for clustered versions.
Disk write bandwidth usage, MB/s (lower is better)
  • Both VictoriaMetrics and InfluxDB use much smaller amount of disk write bandwidth comparing to TimescaleDB. This means they may use cheaper disks with lower IO bandwidth comparing to TimescaleDB.
  • VictoriaMetrics has the best optimization for disk IO bandwidth usage. It uses only a half of the available disk IO bandwidth on n1-standard-64 while accepting 19M data points per second.
  • VictoriaMetrics: 2TB/0.377(Bytes/data point) = 5.3 trillions
  • InfluxDB: 2TB/0.566(Bytes/data point) = 3.5 trillions
  • TimescaleDB: 2TB/29(Bytes/data point) = 69 billions

Querying

  • “Instant” queries, which are performed in less than 100ms. Such queries have little opportunity to scale on machines with higher number of CPUs and RAM, so their results look similar. The only notable thing is extremely slow query performance on TimescaleDB if the queried data is missing in the OS page cache. “Extremely slow” means 100x-1000x slower comparing to the case when the queried data is in the OS page cache. This means that TimescaleDB scatters the queried data across the entire disk, so many I/O operations must be performed for gathering all this data from the disk.
  • “Heavy” queries, which usually require more than a second for execution. Such queries have good opportunities to scale with the number of CPUs, so let’s stick to a query from this group — double-groupby-1. This query scans 12 hours of data for 4K of time series. Each hour contains 360 data points, so the query must scan at least 360*12*4K=17.3M data points.
double-groupby-1, single client, rpm (higher is better)
  • InfluxDB has poor vertical scalability for “heavy” queries, while TimescaleDB and VictoriaMetrics have better scalability for such queries.
  • TimescaleDB performs poorly on n1-standard-1, n1-standard-2 and n1-standard-4, because the stored data doesn’t fit OS page cache — n1-standard-4 machine has only 15GB of RAM, while TimescaleDB’s data occupies 29GB on disk. This translates to heavy disk IO.
  • VictoriaMetrics shows the best vertical scalability with the number of CPU cores — from 62 rpm to 283 rpm, 4.5x scalability.
  • VictoriaMetrics outperforms both contenders on “heavy” queries:
    * InfluxDB by up to 23x
    * TimescaleDB by up to 9x
    Put it another way, VictoriaMetrics performs the query in 0.2s, while InfluxDB performs the same query in 4.9s on n1-standard-64 machine.
  • Performance for “heavy” queries stops scaling starting from n1-standard-32. This may be related to NUMA nodes, where different CPU cores have different latencies when accessing the same RAM regions.
double-groupby-1, clients=vCPUs, rpm (higher is better)

Conclusions

  • Modern time series databases have decent vertical scalability for both data ingestion and querying.
  • TimescaleDB quickly reaches disk bandwidth limit. The limit may be lifted by using more expensive disks with higher read / write bandwidth such as high-end SSDs.
  • TimescaleDB requires much more storage space comparing to VictoriaMetrics and InfluxDB for storing the same amount of data points. The most expensive part of the long-term storage for huge amounts of time series data is disk space. It is unclear how TimescaleDB deals with this issue.
  • TimescaleDB may perform poorly on queries touching data missing in the OS page cache, since it looks like it scatters data for a single time series across the entire disk.
  • VictoriaMetrics provides the best vertical scalability for both data ingestion and querying. This means that a free single-node VictoriaMetrics instance may easily substitute decent cluster built with InfluxDB or TimescaleDB. This saves money on both infrastructure and license costs.
  • Google Cloud provides good vertically scalable machines and durable disk storage with consistently high IOPS and read / write bandiwdth, which are suitable for modern time series databases.

--

--

--

Founder and core developer at VictoriaMetrics

Love podcasts or audiobooks? Learn on the go with our new app.

Recommended from Medium

How To Use Null Safety in Dart 2.12

Darts

Linux Core Concepts -2

MicroService Design — Part 1

How to perform CRUD operations using Blazor Server App Part-VI

Xamarin Notes — Xamarin.Forms : Pages

Jenkins and Groovy Scripts

Curiosity Lifetime Deals — Now on Appsumo

Developer Velocity

An image that visually represents the difference high developer velocity can make. This includes 5x annual growth rate, 20% higher operating margins, 60% more shareholder returns .

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Aliaksandr Valialkin

Aliaksandr Valialkin

Founder and core developer at VictoriaMetrics

More from Medium

Building And Deploying A Sample Application On A Kubernetes Cluster

Prometheus Metrics Scraping for Google Cloud Monitoring

Migrate for Compute Engine | AWS →GCP