Photo by Markus Spiske on Unsplash

Speeding up backups for big time series databases

Suppose you have a time series database containing terabytes of data. How do you mange backups for this data? Do you think it is too big to backup and blindly rely on database replication for data safety? Then you are in trouble.

Why replication doesn’t save from disaster?

Replication is the process of creating multiple copies of the same data on distinct hardware resources and maintaining this data in consistent state. Replication saves from hardware failures — if certain node or disk goes out of service, your data shouldn’t be lost or corrupted, since there should remain at least one copy of the data. Are we safe? No:

How to protect from these issues? Use plain old backups.

Plain old backups

There are various options for data backups — nearby HDDs, magnetic tapes, dedicated storage systems, Amazon S3, Google Cloud Storage, etc.

S3 and GCS are the most promising storage options for backups. They are inexpensive, reliable and durable. But they have certain limitations:

Is there a way to overcome these limitations? Yes if certain conditions are met:

If the database stores all the data according to these conditions, then it is quite easy to setup inexpensive and fast incremental backups on top of S3 or GCS. Full backups also can be sped up with server-side copying of shared immutable files between old backup and new backup. Both GCS and S3 support server-side objects copy. This operation is usually fast when copying objects of any size in the same bucket, since only metadata is copied.

Which data structure adheres the principles mentioned above and can be used as building block for time series database? B-tree — the heart of the most databases? LMDB? PGDATA or TOAST from Postgresql?

No. All these data structures modify file contents on disk.

Log-structured merge trees and backups

LSM tree adheres all the conditions mentioned above:

LSM trees can be used for building key-value storage such as LevelDB or RocksDB. These building blocks can be used for creating arbitrary complex databases:

In theory all these databases can support incremental backups provided they store all the data in LSM-like data structures. But how to make backups from live data when new files are constantly added and old files are constantly removed from the database? Thanks to files’ immutability in LSM-like data structures, it is easy to make instant snapshot via hard links and then to backup data from the snapshot.

Backup tools

Full disclosure: I’m the core developer of VictoriaMetrics, so this chapter is dedicated to the recently published vmbackup tool. This command-line utility creates VictoriaMetrics backups on S3 and GCS. It takes full advantage of LSM tree properties mentioned above:

These features allow saving hours and terabytes of network bandwidth when performing backups for multi-terabyte time series database.

It is quite easy setting up smart backups with frequent incremental backups and less frequent full backups with server-side copy.

VictoriaMetrics can produce files exceeding 5TB. How does vmbackup handle such files in the face of 5TB object size limit mentioned in the beginning of the article? And how does it handle network errors at the end of uploading such a big file? The answer is simple — it just splits files into 1GB chunks and uploads each chunk independently. So it re-transfers only 1GB of data on network errors during the upload in the worst case.