Billy: how VictoriaMetrics deals with more than 500 billion rows

What is VictoriaMetrics?

  • It is easy to setup and operate.
  • It is optimized for low resource usage — RAM, CPU, network, disk space and disk io.
  • It understands PromQL, so it can be used as drop-in replacement for Prometheus in Grafana.
  • It accepts time series data via popular protocols — Prometheus, Influx, Graphite, OpenTSDB.

The setup

mkfs.ext4 -m 0 -E lazy_itable_init=0,lazy_journal_init=0,discard -O 64bit,huge_file,extent -T huge /dev/nvme0n1mount -o discard,defaults /dev/nvme0n1 /mnt/disks/billy
internal.ip.of.billy-server billy-server
temperature{sensor_id="12345"} 76.23 123456789

Data ingestion

CPU usage during data ingestion
Network usage during data ingestion, bytes/sec
Disk iops during data ingestion
Disk IO usage percent during data ingestion
Disk bandwidth usage during data ingestion
RAM usage during data ingestion
Disk space usage during data ingestion

Data querying: 130 billion of data points

max(max_over_time(temperature[90d]))
  • At the first step it selects maximum value for each time series over 90 days — max_over_time(temperature[90d]) .
  • At the second step it selects the maximum value over all the time series — max(...).
Disk space usage during querying of 130 billion of data points
Memory usage during querying of 130 billion of data points
Disk bandwidth usage during querying of 130 billion of data points
Disk iops usage during querying of 130 billion of data points
Disk iops percent usage during querying of 130 billion of data points
CPU usage during querying of 130 billion of data points
System and iowait CPU usage during querying of 130 billion of data points
  • Streaming the data required for query processing to a temporary file — see time ranges with growing disk space usage. CPU cores are mostly idle during this stage, since this is mostly disk IO-bound stage. VictoriaMetrics prepares data in temporary file for fast consumption during the second stage.
  • Parallel processing of the data from temporary file using all the available CPU cores — see CPU usage spikes on the graph above. This stage is usually CPU-bound. Scan speed during the second stage reaches one billion data points per second according to time ranges for high CPU usage on the graph above.

Data querying: 525.6 billion of data points

Disk space usage during querying of 525.6 billion of data points
Memory usage during querying of 525.6 billion of data points
Disk bandwidth usage during querying of 525.6 billion of data points
Disk bandwidth usage during querying of 525.6 billion of data points (excluding data deletion peaks)
  • The page cache is too small for holding all the data (384 GB of RAM vs 595 GB of data).
  • The previously cached data is evicted from the page cache by recently read data, so it is missing on subsequent runs.
Disk iops when querying 525.6 billion of data points
Disk iops percentage when querying 525.6 billion of data points
CPU usage during querying of 525.6 billion of data points
  • Low CPU usage and high disk io usage when fetching data from database into a temporary file.
  • High CPU usage during data scan from temporary files. The second stage scales to all the available CPU cores. Scan speed exceeds one billion of data points per second on a single node during this stage according to the graph above.

Query performance analysis

  • The amount of data that needs to be read from database during the query. Smaller amounts of data mean higher performance. Typical queries for time series applications usually scan relatively small amounts of data, i.e. select temperature trend for a dozen of sensors over short period of time, so they are executed with sub-millisecond latency on a database with trillions of data points.
  • Disk bandwidth and iops. Higher available bandwidth and iops improve performance for heavy queries over data that is missing in OS page cache. Note that recently added and recently queried data is usually available in the OS page cache, so VictoriaMetrics works fast on low-iops low-bandwidth HDDs for typical queries on recently added data.
  • The number of available CPU cores. Higher number of CPU cores improves performance for the second stage of the query.

Conclusions

--

--

--

Founder and core developer at VictoriaMetrics

Love podcasts or audiobooks? Learn on the go with our new app.

Recommended from Medium

Building resilient Distributed Systems at scale — Part 1

VS Code Mac Keyboard Shortcuts

Bloomberg — all about new-grad Software Engineer Interviews

Google Cloud Spanner Emulator Setup

How to add live chat to your website in 5 minutes !!

Class and Object in Java

Git Guidelines: Branches

Creating AWS Client VPN Endpoint Using AWS CDK

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Aliaksandr Valialkin

Aliaksandr Valialkin

Founder and core developer at VictoriaMetrics

More from Medium

How to Auto Scale Kubernetes for Streaming Server? Easy and Quick Guide

Integrate OPA with Kubernetes to Prevent Kubernetes Pods from using containers from untrusted image

Istio — In my experience…

Initializing the Kubernetes Cluster on the master node.