What is WAL?

Write-Ahead Logging (WAL) is PostgreSQLs mechanism to ensure data durability and crash recovery.

The core principle of WAL is simple

Changes must be recorded in the WAL before corresponding data pages are written to the disk

When a transaction performs an INSERT, UPDATE, DELETE, or TRUNCATE operation, PostgreSQL generates WAL records describing those changes. These records are appended to the WAL stream and stored in WAL segment files located in the pg_wal directory.

When transaction commits, PostgreSQL Writes a commit record to the WAL and flushes the WAL records to durable storage. Once the WAL has been safely persisted, PostgreSQL acknowledges the transaction as committed and returns success to the client.

At this point, the modified table pages and index pages still reside only in memory (shared buffers). They are written to the actual data files later by the PostgreSQL background process. Note that the changes are not persisted to data files but the changes are persisted in WAL files as transaction logs. At this point if database crashes, the data changes are recovered from the WAL files.

Pages that have been modified in shared buffers but not yet written to disk are called dirty pages.

PostgreSQL periodically performs checkpoints. During a checkpoint dirty pages are flushed from shared buffers to the actual data or index files on disk. Checkpointing reduces the amount of WAL that must be replayed during crash recovery.

This design allows PostgreSQL to provide durability without forcing every modified data page to be flushed during each transaction commit.

how-postgres-utilize-index-for-data-retrieval-from-disk

Note - Since database transactions can happen in huge numbers, the WAL is stored in multiple files called WAL segments. By default each segment is 16 MB in size. Internally, PostgreSQL organizes storage into 8 KB pages.

Full Page Writes

While WAL protects committed transactions, another problem can occur during a crash: a data page write can be interrupted halfway through, leaving behind partially writen pages. This is called torn page. To protect againts this PostgreSQL uses Full Page Writes (FPW). After every checkpoint, the first page modication causes PostgreSQL to write the entire 8KB page image into the WAL. This ensures partially written pages are recovered safely and consistently.

Why do we need WAL ?

A natural question is:

Why not simply write every data change directly to the table data files on the disk and what is the need of having a WAL?

The answer is performance.

A transaction may modify multiple table pages and index pages scattered across disk. Flushing all of these pages during every commit would significantly increase the transaction latency.

Instead, PostgreSQL write compact WAL records sequentially to an append only log files. Sequential writes are much faster than repeatedly updating data pages across many location on the disk.

This allows PostgreSQL to acknowledge the commits quickly while deferring expensive data page writes until later.

WAL also provides crash recovery. If PostgreSQL crashes after WAL file has been persisted to the disk but before the corresponding data pages have been written to the disk, recovery replays the WAL records and bring the data files to a consistent state.

Every WAL record is associated with a Log Sequence Number (LSN), which identifies a specific position in the WAL stream. LSNs are used for crash recovery, replication, and Change Data Capture (CDC) systems.

WAL Storage and Segment files

WAL records are stored in PostgreSQL data directory under :

$PG_DATA/pg_wal

Example

root@5c7b10ae9ab3:/var/lib/postgresql/18/docker/pg_wal# ls
000000010000000000000052  00000001000000000000005E  00000001000000000000006A  000000010000000000000076  000000010000000000000082  00000001000000000000008E
000000010000000000000057  000000010000000000000063  00000001000000000000006F  00000001000000000000007B  000000010000000000000087  archive_status
000000010000000000000058  000000010000000000000064  000000010000000000000070  00000001000000000000007C  000000010000000000000088  summaries

These files are known as WAL segments. At first glance, the file names may look random, but they actually encode the WAL timeline ID and segment number.

Let’s inspect the wal location:

SELECT pg_current_wal_lsn();
 pg_current_wal_lsn
--------------------
 0/66D53950

Now let’s ask PostgreSQL which WAL segment contains this position

SELECT pg_walfile_name(pg_current_wal_lsn());
     pg_walfile_name
--------------------------
 000000010000000000000066
(1 row)

PostgreSQL uses LSN to determine:

  1. Which WAL segment file contains the record.
  2. The exact byte position of the record within that segment.

Inspecting WAL records with pg_waldump

WAL segment files are binary and cannot be read directly. PostgreSQL provides the pg_waldump utility to inspect the records stored inside the WAL segment The following command displays WAL records from a specific segment file: pg_waldump 000000010000000000000066

Below is an example of WAL record describing an insertion in a table:

rmgr: Heap
tx: 779

lsn:  0/66000058
prev: 0/65FFFFE8

desc: INSERT
off: 55
flags: 0x08

blkref #0:
  rel: 1663/16384/16389
  blk: 56840

This record tells us that transaction 779 inserted a row in the Heap Table. The record was written at WAL location 0/66000058 and references a page within the relation 16389

If the table contains indexes, PostgreSQL generates additional WAL records describing the corresponding index modification.

rmgr: Btree
tx: 779

lsn:  0/660000C0
prev: 0/66000058

desc: INSERT_LEAF
off: 205

blkref #0:
  rel: 1663/16384/16398
  blk: 16686

This record describes an insertion in btree index. Together, the heap and the index records ensure that PostgreSQL can reconstuct both the table and its indexes during crash recovery.

how-postgres-utilize-index-for-data-retrieval-from-disk

Closing Thoughts

Wrie Ahead Logging is the most fundamental mechanism in PostgreSQL that helps in high performance and data durability by decoupling the transaction commits from expensive data file writes on disk.

The key idea is simple: write changes to WAL files first, persist the WAL files, acknowledge the transaction, and write actual data pages later. If a crash occurs before those pages reach disk, PostgreSQL can recover the database by replaying the WAL records.

Understanding WAL stream also helps understanding the foundation for mechanisms like logical and physical replication, Point In Time Recovery, & Change Data Capture systems. All of these features rely on the WAL stream as the authoritative records of database changes.