WAL usage looks broken in modern Time Series Databases?

Photo by Stephen Dawson on Unsplash

Write-ahead logging (WAL) is a common practice among modern time series databases — Prometheus uses WAL, InfluxDB uses WAL, TimescaleDB transiently uses WAL from PostgreSQL, Cassandra also uses WAL.

Let’s look into WAL theory. WAL is used as a protection against losing of recently added data on power loss. All the incoming data must be written into write ahead log before returning success to the client. This guarantees that the data may be recovered from WAL file after power loss. Looks simple and great in theory! What’s in the practice?

Page cache and WAL

What do DB devs do with slow fsync? They relax data safety guarantees in various ways:

  • Prometheus calls fsync only after big chunk of data (aka segment) is written into WAL, so all the segment data may be lost / corrupted on power loss before fsync. The data may be corrupted if the OS flushes a few pages with the written data to disk, but doesn’t flush the remaining pages. Prometheus fscync’s segments every 2 hours by default, so a lot of data may be corrupted on hardware reset.
  • Cassandra by default calls fsync on WAL only every 10 seconds, so the last 10 seconds of data may be lost / corrupted on power loss. Probably, replication can help in this case.
  • InfluxDB by default calls fsync on every write request, so it is recommended feeding InfluxDB with write requests containing 5K-10K data points in order to alleviate fsync slowness. It recommends setting wal-fsync-delay to non-zero value for workloads with high volume of writes and/or for slow HDDs, so data may be lost on power loss.
  • TimescaleDB relies on PostgreSQL’s WAL mechanism, which puts data into WAL buffers in RAM and periodically flushes them to WAL file. This means that the the data from unflushed WAL buffers is lost on power loss or on process crash.

So, modern TSDBs provide relaxed data safety guaranteesrecently inserted data may be lost on power loss. The following questions arise:

  • Don’t these relaxations defeat the main purpose of write ahead logging? IMHO, the answer to this question is “yes”. Sad, but true :(
  • Are there better approaches with similar data safety guarantees exist? Yes — SSTable.

SSTable instead of WAL?

Careful reader may notice the difference — "optimized WAL usage” can result in data corruption, while “write directly to SSTable” approach is vulnerable to process crash. IMHO, recently written data loss on process crash has lower severity comparing to data corruption. Properly implemented database shutdown procedure significantly reduces the risk of data loss. The shutdown procedure is quite simple — stop accepting new data, then flush in-memory buffers to disk, then exit.

The following databases prefer writing directly into SSTable instead of WAL:

  • ClickHouse. By default it writes incoming data directly do persistent storage in SSTable-like format. It supports in-memory buffering via Buffer table.
  • VictoriaMetrics. It buffers incoming data in RAM and periodically flushes it to SSTable-like data structure on disk. Flush interval is hard-coded to one second.

Conclusions

  • Write-ahead logging tends to consume significant portion of disk IO bandwidth. It is recommended to put WAL into a separate physical disk due to this drawback. “Write directly to SSTable” approach requires less disk IO bandwidth, so higher volumes of data may be consumed by the database without WAL.
  • WAL may slow down database startup times due to slow recovery step and even may lead to OOMs and crash loops.

Prometheus, InfluxDB and Cassandra already use LSM-like data structures with SSTables, so they may quickly switch to the new approach. It is unclear yet whether TimescaleDB could use the new approach, since it doesn’t use LSM.

Update: we open-sourced VictoriaMetrics, so you can inspect the code and verify that it doesn’t use WAL :)

Founder and core developer at VictoriaMetrics

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store