Photo by Markus Spiske on Unsplash

Speeding up backups for big time series databases

Suppose you have a time series database containing terabytes of data. How do you mange backups for this data? Do you think it is too big to backup and blindly rely on database replication for data safety? Then you are in trouble.

Why replication doesn’t save from disaster?

Replication is the process of creating multiple copies of the same data on distinct hardware resources and maintaining this data in consistent state. Replication saves from hardware failures — if certain node or disk goes out of service, your data shouldn’t be lost or corrupted, since there should remain at least one copy of the data. Are we safe? No:

  • What if you made a mistake during routine database cluster upgrade or reconfiguration and this led to data loss? Replication doesn’t help in this case too.

Plain old backups

There are various options for data backups — nearby HDDs, magnetic tapes, dedicated storage systems, Amazon S3, Google Cloud Storage, etc.

  • Limited network bandwidth, so full backups can take a few days to complete. For instance, more than 27 hours is needed for transferring 10TB of data over a gigabit network.
  • Non-zero probability for network errors. What if network error occurs at the end of uploading of 10TB file? Spend yet another 27 hours on uploading it again?
  • Paid egress traffic from the datacenter where the database is located. Backup sizes and backup frequency increase network bandwidth costs.
  • Files must be immutable, i.e. their contents shouldn’t change over time. This allows uploading each file to backup storage only once.
  • New data must go into new files, so incremental backups could be cheap — just backup new files.
  • The total number of files shouldn’t be too high. This reduces overhead and costs on per-file operations and management.
  • Data must be stored in compressed form on disk in order to reduce network bandwidth usage during backups.

Log-structured merge trees and backups

LSM tree adheres all the conditions mentioned above:

  • Files are immutable.
  • New data goes into new files.
  • The total number of files remains low thanks to background merging of smaller files into bigger files.
  • Sorted rows usually have good compression ratio.

Backup tools

Full disclosure: I’m the core developer of VictoriaMetrics, so this chapter is dedicated to the recently published vmbackup tool. This command-line utility creates VictoriaMetrics backups on S3 and GCS. It takes full advantage of LSM tree properties mentioned above:

  • It supports fast full backups by employing server-side copy of shared files from already existing backups.

Conclusions

  • While replication provides availability during hardware issues, it doesn’t save from data loss. Use backups.
  • Large backups can be fast and cheap if proper database is used. I’d recommend VictoriaMetrics :)
  • VictoriaMetrics backups can be used for backing up data collected from many Prometheus instances.

Founder and core developer at VictoriaMetrics

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store