Let’s look into WAL theory. WAL is used as a protection against losing of recently added data on power loss. All the incoming data must be written into write ahead log before returning
success to the client. This guarantees that the data may be recovered from WAL file after power loss. Looks simple and great in theory! What’s in the practice?
Page cache and WAL
Database developers know that the operating system (OS) doesn’t write data to persistent storage on every write syscall. The data is just written into page cache residing in RAM unless direct IO is used. So data
successfully written into WAL file may disappear on power loss. How to deal with this? Either to use cumbersome direct IO or to explicitly tell the OS to flush recently written data from page cache to persistent storage via fsync syscall. But
fsync has a major downside — it is very slow on SSD (1K-10K rps) and extremely slooow on HDD (100 rps). For instance, 1M inserts per second may be easily slowed down to 100 inserts per second with
What do DB devs do with slow
fsync? They relax data safety guarantees in various ways:
- Prometheus calls fsync only after big chunk of data (aka
segment) is written into WAL, so all the
segmentdata may be lost / corrupted on power loss before
fsync. The data may be corrupted if the OS flushes a few pages with the written data to disk, but doesn’t flush the remaining pages. Prometheus
fscync’s segments every 2 hours by default, so a lot of data may be corrupted on hardware reset.
- Cassandra by default calls fsync on WAL only every 10 seconds, so the last 10 seconds of data may be lost / corrupted on power loss. Probably, replication can help in this case.
- InfluxDB by default calls
fsyncon every write request, so it is recommended feeding InfluxDB with write requests containing 5K-10K data points in order to alleviate
fsyncslowness. It recommends setting
wal-fsync-delayto non-zero value for workloads with high volume of writes and/or for slow HDDs, so data may be lost on power loss.
- TimescaleDB relies on PostgreSQL’s WAL mechanism, which puts data into WAL buffers in RAM and periodically flushes them to WAL file. This means that the the data from unflushed WAL buffers is lost on power loss or on process crash.
So, modern TSDBs provide
relaxed data safety guarantees — recently inserted data may be lost on power loss. The following questions arise:
- Don’t these relaxations defeat the main purpose of write ahead logging? IMHO, the answer to this question is “yes”. Sad, but true :(
- Are there better approaches with similar
data safety guaranteesexist? Yes — SSTable.
SSTable instead of WAL?
The idea is simple — just buffer data in memory and atomically flush it into SSTable-like data structure on disk without the need of WAL. The flush may be triggered either by timeout (i.e. every N seconds) or by reaching the maximum buffer size. This gives similar data safety guarantees as the “optimized WAL usage” described in the previous chapter — recently inserted data may be lost on power loss / process crash.
Careful reader may notice the difference — "optimized WAL usage” can result in data corruption, while “write directly to SSTable” approach is vulnerable to process crash. IMHO, recently written data loss on process crash has lower severity comparing to data corruption. Properly implemented database shutdown procedure significantly reduces the risk of data loss. The shutdown procedure is quite simple — stop accepting new data, then flush in-memory buffers to disk, then exit.
The following databases prefer writing directly into SSTable instead of WAL:
- ClickHouse. By default it writes incoming data directly do persistent storage in SSTable-like format. It supports in-memory buffering via Buffer table.
- VictoriaMetrics. It buffers incoming data in RAM and periodically flushes it to SSTable-like data structure on disk. Flush interval is hard-coded to one second.
WAL usage looks broken in modern time-series databases. It doesn’t guarantee data safety for recently inserted data on power loss. WAL has two additional drawbacks:
- Write-ahead logging tends to consume significant portion of disk IO bandwidth. It is recommended to put WAL into a separate physical disk due to this drawback. “Write directly to SSTable” approach requires less disk IO bandwidth, so higher volumes of data may be consumed by the database without WAL.
- WAL may slow down database startup times due to slow recovery step and even may lead to OOMs and crash loops.
Prometheus, InfluxDB and Cassandra already use LSM-like data structures with SSTables, so they may quickly switch to the new approach. It is unclear yet whether TimescaleDB could use the new approach, since it doesn’t use LSM.
Update: we open-sourced VictoriaMetrics, so you can inspect the code and verify that it doesn’t use WAL :)