How VictoriaMetrics makes instant snapshots for multi-terabyte time series data
VictoriaMetrics is a fast time series database. It supports Prometheus queries aka PromQL and its’ core has the following prominent features:
- High insert performance on high-cardinality data. See this article for details.
- High select performance when big amounts of data is analyzed. See this article and this spreadsheet for details.
- High compression rate for typical time series data.
- Online instant snapshots without degrading database operations.
Let’s look into how instant snapshots work under the hood of VictoriaMetrics.
A few words about ClickHouse
VictoriaMetrics stores data in data structures similar to MergeTree table from ClickHouse. Let’s refresh our memories about ClickHouse. It is the fastest database for analytical data and for other event streams. It outperforms conventional databases such as PostgreSQL and MySQL by 10x-1000x on a typical analytical queries. This means a single ClickHouse server may substitute a big cluster of conventional databases. Hurry up to give it a try and reduce operational costs :)
ClickHouse is so fast thanks to its’ architecture with MergeTree table engine in the heart.
What is MergeTree?
MergeTree is a column-oriented table engine built on a data structure similar to Log Structured Merge Tree. MergeTree properties:
- Data for each column is stored separately. This reduces overhead during column scans, since there is no need in spending resources on reading and skipping data for other columns. This also improves per-column compression ratio, since individual columns usually contain similar data.
- Rows are sorted by a “primary key”, which may span multiple columns. There is no unique constraint on a primary key — multiple rows may have identical primary key. This allows for quick row lookups and range scans by a primary key or by its’ prefix. Additionally this improves compression ratio, since consecutive sorted rows usually contain similar data.
- Rows are split into moderately sized blocks. Each block consists of per-column sub-blocks. Each block is processed independently. This means close-to-perfect scalability on multi-CPU systems— just feed all the available CPU cores with independent blocks. Block size may be configured, but it is advisable to use sub-blocks with sizes in the range of 64KB-2MB, so they fit CPU caches. This improves performance, since CPU cache access is much faster than RAM access. Additionally this reduces overhead when only a few rows must be accessed out of a block with many rows.
- Blocks are merged into “parts”. These parts are similar to SSTables from Log Structured Merge (LSM) tree. ClickHouse merges smaller parts into bigger parts in the background. Unlike canonical LSM, MergeTree doesn’t have strict levels with similarly-sized parts. The merge process improves query performance, since lower number of parts are inspected with each query. Additionally the merge process reduces the number of data files, since each part contains fixed number of files proportional to the number of columns. Parts’ merging has yet another benefit — better compression rate, since it moves closer column data for sorted rows.
- Parts are grouped into partitions by “partitioning key”. Initially ClickHouse allowed creating per-month partitions on a Date column. Now arbitrary expressions may be used for building partitioning key. Distinct values for partitioning key result in separate partitions. This allows fast and easy per-partition data archiving / removal.
Impressed by MergeTree architecture? Time to return to instant snapshots in VictoriaMetrics.
Instant snapshots in VictoriaMetrics
VictoriaMetrics stores time series data in MergeTree-like tables, so it benefits from the features mentioned above. Additionally it stores inverted index for fast lookups by the given time series selectors. The inverted index is stored in a mergeset — data structure built on top of MergeTree ideas and optimized for inverted index lookups.
MergeTree has an additional important property, which wasn’t mentioned above — atomicity. This means the following:
- Newly added parts either appear in the MergeTree or fail to appear. MergeTree never contains partially created parts. The same applies to merge process — parts are either fully merged into a new part or fail to merge. There are no partially merged parts in MergeTree. Does this means that parts appear out of blue? No. Parts are assembled in temporary directories and then atomically moved to MergeTree. The same applies to the merge — old parts are atomically swapped with the new part when it is ready.
- Part contents in MergeTree never change. Never ever. Parts are immutable. They may be only deleted after the merge to a bigger part.
This means MergeTree is always in a consistent state. Have you already guessed how instant snapshots work? Continue reading if not :)
Many file systems provide such beasts as hard links. Hard linked file shares the contents of the original file. It is indistinguishable from the original file. Hard links do not occupy additional space on the file system. When the original file is deleted, the hard linked file becomes the “original”, since it becomes the only file that points to the original file contents. Hard links are created instantly irregardless of file size. Bingo!
VictoriaMetrics uses hard links to create instant snapshots for time series data and for inverted index, since they are both stored in MergeTree-like data structures. It just atomically hard links all the files from all the parts into a snapshot directory. It is safe, since parts never change. After that the snapshot may be backed up / archived to any storage (for instance, Amazon S3 or Google Cloud Storage) using any suitable tool (for instance,
cp for cloud storage). There is no hurry in archive/backup process — it may take as much time as needed, since it is fully decoupled from further VictoriaMetrics operations. There is no need in snapshot compressing before archiving, since it already contains highly compressed data — VictoriaMetrics provides the best compression rate for time series data.
The newly created snapshot takes no additional space on the file system, since its’ files point to existing files from MergeTree tables. This holds true even if the snapshot size exceeds 10 terabytes! Snapshot starts occupying additional space after the original MergeTree merges/deletes parts linked from the snapshot. So don’t forget removing old snapshots to free up disk space.
VictoriaMetrics built instant snapshots on a brilliant idea from MergeTree table engine in ClickHouse. These snapshots may be created at any time without any downtime or normal operations’ disruption on VictoriaMetrics.
Snapshot functionality may be evaluated on a single-server docker image or on a single-server binary. It is available via the following http handlers:
/snapshot/list— lists available snapshots
/snapshot/create— creates new snapshot
/snapshot/delete?snapshot=…— deletes the given snapshot
Created snapshots are available under
/<-storageDataPath>/snapshots directory, where
-storageDataPath is a command-line flag containing filesystem path where VictoriaMetrics stores data.
Single-server VictoriaMetrics may be fed with time series data via the following mechanisms:
- Prometheus remote write API — the primary mechanism
- InfluxDB line protocol
- Graphite plaintext prototol — just pass
-graphiteListenAddrcommand line flag.
We work on a SaaS solution built on top of VictoriaMetrics core. The SaaS offloads all the operational headache related to time series data storage from your shoulders to ours. Read this article for details and visit our homepage for more info. We are looking for investors and/or brave clients with high volumes of time series data which must be efficiently collected / analyzed, so don’t hesitate contacting us.
p.s. ClickHouse also supports fast snapshots via ALTER TABLE … FREEZE PARTITION.
Update: VictoriaMetrics is open source now, so you can investigate the code in order to understand better how it makes instant snapshots.