When you issue a SQL statement, have you ever wondered what actually happens on the disk? What does Postgres write, how does it manages internal data files and what is really inside those files?

I had bunch of questions that kept nagging me, so I decided to deep dive into Postgres database internals. This post only touch the surface level of how database, tables, and data is persisted on disk in Postgres. There’s lot of ground to cover.

Object Identifiers (OIDs)

Postgres uses OIDs ie Object Identifiers to track each internal objects it creates. Databases, tables, indexes, types, views all of them get an OID the moment they are created. OIDs never change once assigned.

For example, I created a database named testdb and its OID is 16384

SELECT oid, datname FROM pg_database WHERE datname = 'testdb';

  oid  | datname
-------+---------
 16384 | testdb
(1 row)

How Postgres Organizes Data on Disk

Postgres stores everything inside a root directory called PGDATA. Inside it there is a /base sub directory where each database gets its own sub directory. The subdirectory is named after the OID

ls -l $PGDATA/base

drwx------. 2 postgres postgres 8192 Apr 27 18:13 1
drwx------. 2 postgres postgres 8192 Apr 27 18:37 16384
drwx------. 2 postgres postgres 8192 Apr 27 18:11 4
drwx------. 2 postgres postgres 8192 Apr 27 18:13 5

The directory 16384 is testdb Each table or index inside a database gets its own file inside the above mentioned directory Postgres stores data in one file upto 1GB, once the data size grows more than 1GB then postgres creates a new file to store the data For example, if data size of a table users is 10GB then postgres will store users table data in 10 separate files

What is relfilenode in postgres?

The file name on disk for each table or index is called a relfilenode It acts as a mapping between a logical postgres object (like a table) and its physical file on the disk.

SELECT oid, relname, relfilenode FROM pg_class where relname = 'users';

  oid  | relname | relfilenode
-------+---------+-------------
 16435 | users   |       16435
(1 row)

so the users table lives at $PGDATA/base/16384/16435.

When a table or index exceeds 1GB in data size, Postgres automatically creates new segment files relfilenode.1, relfilenode.2 and so on. These are called heap segments for tables and index segments for indexes.

Peeking Inside the Data File

These files are not text files, they are binary files in Postgre’s own internal format

INSERT INTO public.users(name) VALUES('Henry');
INSERT 0 1

You can peek inside the file using hexdump to inspect what the insert statement wrote to disk:

hexdump -C $PGDATA/base/16384/16435

00000000  00 00 00 00 b8 9e 9d 01  00 00 00 00 1c 00 d8 1f  |................|
00000010  00 20 04 20 00 00 00 00  d8 9f 44 00 00 00 00 00  |. . ......D.....|
00000020  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
*
00001fd0  00 00 00 00 00 00 00 00  e9 02 00 00 00 00 00 00  |................|
00001fe0  00 00 00 00 00 00 00 00  01 00 02 00 02 08 18 00  |................|
00001ff0  01 00 00 00 0d 48 65 6e  72 79 00 00 00 00 00 00  |.....Henry......|
00002000

Each data file is divided into 8KB logical pages (blocks). Above output is entire one page

8KB = 8 * 1024 = 8192 bytes in total

  • The first 24 bytes are the page header meta data like the LSN, free space boundaries, and page size.
  • Near the page end you can spot Henry in the ASCII column, that is actual row data. Strings are stored as raw ASCII bytes, which is why they are human readable in the dump. The surrounding bytes are binary data.
  • The data is written from the bottom of the page

Conclusion

This is just a starting point, underneath lies WAL, MVCC, vaccum and many more unexplored concepts shaping how the data evolves on the disk. There is lot of ground to cover here. Understanding these internals turns Postgres from black box into predictable system.