Billy: how VictoriaMetrics deals with more than 500 billion rows

What is VictoriaMetrics?

  • It is easy to setup and operate.
  • It is optimized for low resource usage — RAM, CPU, network, disk space and disk io.
  • It understands PromQL, so it can be used as drop-in replacement for Prometheus in Grafana.
  • It accepts time series data via popular protocols — Prometheus, Influx, Graphite, OpenTSDB.

The setup

mkfs.ext4 -m 0 -E lazy_itable_init=0,lazy_journal_init=0,discard -O 64bit,huge_file,extent -T huge /dev/nvme0n1mount -o discard,defaults /dev/nvme0n1 /mnt/disks/billy
internal.ip.of.billy-server billy-server
temperature{sensor_id="12345"} 76.23 123456789

Data ingestion

CPU usage during data ingestion
Network usage during data ingestion, bytes/sec
Disk iops during data ingestion
Disk IO usage percent during data ingestion
Disk bandwidth usage during data ingestion
RAM usage during data ingestion
Disk space usage during data ingestion

Data querying: 130 billion of data points

max(max_over_time(temperature[90d]))
  • At the first step it selects maximum value for each time series over 90 days — max_over_time(temperature[90d]) .
  • At the second step it selects the maximum value over all the time series — max(...).
Disk space usage during querying of 130 billion of data points
Memory usage during querying of 130 billion of data points
Disk bandwidth usage during querying of 130 billion of data points
Disk iops usage during querying of 130 billion of data points
Disk iops percent usage during querying of 130 billion of data points
CPU usage during querying of 130 billion of data points
System and iowait CPU usage during querying of 130 billion of data points
  • Streaming the data required for query processing to a temporary file — see time ranges with growing disk space usage. CPU cores are mostly idle during this stage, since this is mostly disk IO-bound stage. VictoriaMetrics prepares data in temporary file for fast consumption during the second stage.
  • Parallel processing of the data from temporary file using all the available CPU cores — see CPU usage spikes on the graph above. This stage is usually CPU-bound. Scan speed during the second stage reaches one billion data points per second according to time ranges for high CPU usage on the graph above.

Data querying: 525.6 billion of data points

Disk space usage during querying of 525.6 billion of data points
Memory usage during querying of 525.6 billion of data points
Disk bandwidth usage during querying of 525.6 billion of data points
Disk bandwidth usage during querying of 525.6 billion of data points (excluding data deletion peaks)
  • The page cache is too small for holding all the data (384 GB of RAM vs 595 GB of data).
  • The previously cached data is evicted from the page cache by recently read data, so it is missing on subsequent runs.
Disk iops when querying 525.6 billion of data points
Disk iops percentage when querying 525.6 billion of data points
CPU usage during querying of 525.6 billion of data points
  • Low CPU usage and high disk io usage when fetching data from database into a temporary file.
  • High CPU usage during data scan from temporary files. The second stage scales to all the available CPU cores. Scan speed exceeds one billion of data points per second on a single node during this stage according to the graph above.

Query performance analysis

  • The amount of data that needs to be read from database during the query. Smaller amounts of data mean higher performance. Typical queries for time series applications usually scan relatively small amounts of data, i.e. select temperature trend for a dozen of sensors over short period of time, so they are executed with sub-millisecond latency on a database with trillions of data points.
  • Disk bandwidth and iops. Higher available bandwidth and iops improve performance for heavy queries over data that is missing in OS page cache. Note that recently added and recently queried data is usually available in the OS page cache, so VictoriaMetrics works fast on low-iops low-bandwidth HDDs for typical queries on recently added data.
  • The number of available CPU cores. Higher number of CPU cores improves performance for the second stage of the query.

Conclusions

--

--

--

Founder and core developer at VictoriaMetrics

Love podcasts or audiobooks? Learn on the go with our new app.

Recommended from Medium

15 Ways to be a Successful Web Developer (Parody)

Automatic Differentiation

Distributed Tracing of Spring boot microservices with zipkin part-1

Get URL Of The Last Pull Request With Powershell

Why you should start using Kotlin today

EPISODE #7: Developments Influencing the Automation Standards of the Future

How to build a full responsive web app using Jquery, Ajax and Laravel (Part 2 Saving Users Data…

Traceback Walkthrough | HackTheBox

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Aliaksandr Valialkin

Aliaksandr Valialkin

Founder and core developer at VictoriaMetrics

More from Medium

Kind — fix missing Prometheus Operator targets

DevOpsCare by Mirantis, powered by Lens

argo-cd 2.3 upgrade and application-set CRD error

Envoy proxy -the Modern network load balancing and proxying