Stormatics
(678) 345-3456
380 Albert St, Melbourne, Australia

Blog Details

  • Stormatics
  • Blog
  • PostgreSQL Internals Part 1: Understanding Database Cluster, Database and Tables

PostgreSQL Internals Part 1: Understanding Database Cluster, Database and Tables

PostgreSQL is one of the most popular and powerful relational database management systems, renowned for its robust features and flexibility. Going deeper into its internals uncovers a complex world though. Understanding the core concepts is much needed for developers, database administrators, and anyone involved in managing or interacting with PostgreSQL databases. 

By understanding the core concepts users can optimize performance, troubleshoot issues effectively, and use the full potential of PostgreSQL for their projects. 

In this blog series, we are going to discuss the core concepts of PostgreSQL internals starting with database clusters, databases, and tables in PostgreSQL.

Let’s dive into the details!

Database Cluster, Database, and Database Objects

A database cluster is a collection of multiple databases managed by a single PostgreSQL server. It can be referred to as a data/base directory.

A database is a collection of database objects. Whereas a database object is a data structure used to store objects such as tables, views, indexes, extensions, Sequences functions, etc. In simple words, anything that we can create or store within a database is a database object

Source: interdb

What is an OID?

Object identifier(OID) is a number represented in unsigned 4-byte integers. Which is used to uniquely identify the object that we create inside the database, i.e, table, column, row, function trigger, etc

An unsigned 4-byte integer has a maximum value of (2^32 − 1) which is 4,294,967,295

When we initialize the data directory(PostgreSQL 16) it creates 3 databases by default postgres, template0, and template1 we can check their respective OIDs with the following query

postgres=# SELECT datname, oid FROM pg_database WHERE datname = 'postgres';
datname  | oid 
----------+-----
postgres |   5
(1 row)
postgres=# SELECT datname, oid FROM pg_database WHERE datname = 'template0';
 datname  | oid 
-----------+-----
template0 |   4
(1 row)
postgres=# SELECT datname, oid FROM pg_database WHERE datname = 'template1';
 datname  | oid 
-----------+-----
template1 |   1
(1 row)

We can also see the same subdirectories inside the data/base directory containing the OIDs of the databases

postgres@9e7108914f4c:~/data/base$ ls -lhrt 
total 12K
drwx------ 2 postgres postgres 4.0K Mar 14 14:08 4
drwx------ 2 postgres postgres 4.0K Mar 14 14:11 5
drwx------ 2 postgres postgres 4.0K Mar 14 14:12 1

Physical structure of database cluster

Source: interdb

In the diagram provided above, we observe two primary components comprising the database cluster.

  • Base: It is the default directory for the database cluster, which contains subdirectories for each database. Each database’s directory is represented by its respective OID. Within the base/DATA_BASE_OID directory, you’ll find specific files for tables, indexes, and other database objects.
  • Tablespace: A distinct physical storage location on disk, separate from the default data directory, used for storing database objects such as tables and indexes. data/pg_tblspc will contain the symlink to the tablespace inside the original data directory

What is relfilenode?

It’s a value stored in the pg_class system table for relation objects (tables, indexes, etc.) It acts as a numeric identifier that points to the actual physical disk file where the data for the relation is stored. For instance, for the table testing, we can see its OID and relfilenode value with the following query

postgres=# select oid,relname,relfilenode from pg_class where relname = 'testing';
 oid  | relname | relfilenode 
-------+---------+-------------
16385 | testing |       16385
(1 row)

Initially, oid and relfilenode can be the same and point to the same location on the disk 

postgres@9e7108914f4c:~/data/base/5$ ls -lhrt 16385
-rw------- 1 postgres postgres 0 Mar 14 20:59 16385

But if we issue REINDEX or TRUNCATE commands value of relfilenode can be changed

postgres=# truncate testing;
TRUNCATE TABLE
postgres=# select oid,relname,relfilenode from pg_class where relname = 'testing';
 oid  | relname | relfilenode 
-------+---------+-------------
16385 | testing |       16397
(1 row)

As we can see TRUNCATE changed the value of relfilenode from 16385 to 16397 and we can see a new file in our filesystem

ls -lhrt 16397
-rw------- 1 postgres postgres 0 Mar 14 21:04 16397

NOTE: The value of relfilenode can be 0 for system catalog objects

16389.1, and 16389.2 files

In PostgreSQL, database objects like tables and indexes are stored in files called segments. Each segment has a predetermined size, and a table may span multiple segments as its data volume increases. By default, if the –with-segsize option isn’t specified during PostgreSQL’s source code compilation, the segment size is usually set to 1GB. When the size of tables and indexes surpasses this limit, PostgreSQL generates additional files named after their respective relfilenode.1, relfilenode.2, and so forth, to accommodate the excess data. This behavior can be seen in example bellow

postgres=# CREATE TABLE my_table (
   id SERIAL PRIMARY KEY,
   number INT
);
CREATE TABLE
postgres=# INSERT INTO my_table (number)
SELECT generate_series(1, 50000000);
INSERT 0 50000000
postgres=# \dt+ my_table
                                    List of relations
Schema |   Name   | Type  |  Owner   | Persistence | Access method |  Size   | Description 
--------+----------+-------+----------+-------------+---------------+---------+-------------
public | my_table | table | postgres | permanent   | heap          | 1729 MB | 
(1 row)
postgres=# select oid,relname,relfilenode from pg_class where relname = 'my_table';
 oid  | relname  | relfilenode 
-------+----------+-------------
16385 | my_table |       16385
(1 row)
postgres@9e7108914f4c:~$ ls -lhrt data/base/5/16385 (Tab for auto-complete)
16385      16385.1  

Free Space Map(FSM) 

Each table and index relation, except for hash indexes, has a Free Space Map (FSM) to keep track of available space in the relation. It stores all free space-related information alongside primary relation

Visibility Map(VM)

The Visibility Map (VM) in PostgreSQL is a system designed to monitor pages within tables and indexes, identifying those that exclusively contain tuples visible to all active transactions. Each table possesses its own visibility map, storing the visibility status of individual pages within the table file. This information determines whether a page contains any dead tuples. During vacuum processing, pages without dead tuples can be efficiently bypassed using the associated Visibility Map (VM).

Source: interdb

In the below diagram, we can see a table has 2 pages and Page 0 contains four tuples. Tuple-2 is determined to be a dead tuple, PostgreSQL will remove it and rearrange the remaining tuples to address fragmentation or bloat. Subsequently, it updates both the Free Space Map (FSM) and Visibility Map (VM) associated with this page. PostgreSQL repeats this process until it reaches the final page.

Source: ashnik

Note: Indexes only have individual free space maps and do not have visibility maps.

Tablespaces 

Tablespace in PostgreSQL is a physical storage area separate from the default data directory. It allows organizing database objects on different storage devices for performance or administrative purposes. We can create a new tablespace with the following query

CREATE TABLESPACE my_tablespace
 OWNER postgres
 LOCATION '/var/lib/postgresql/tbs';
CREATE TABLESPACE

Upon creating a new tablespace on a specified directory, two key actions occur

Firstly, A database and system catalog version-specific directory gets created under the desired location i.e, PG_DATABASE_MAJORVERSION_CATALOGVERSION

ls -lhrt /var/lib/postgresql/tbs/PG_16_202307071/
total 0

The catalog version number in PostgreSQL is an internal mechanism for maintaining data directory compatibility and preventing inconsistencies. Its format is YYYYMMDDN representing the date the number was changed, with N representing a simple counter to accommodate for more than one change on the same date.

We can check the catalog version by 

postgres=# SELECT catalog_version_no FROM pg_control_system();
catalog_version_no 
--------------------
     202307071
(1 row)

Secondly, A symlink to the tablespace directory gets created inside the original data/pg_tbspc directory

postgres@9e7108914f4c:~$ ls -lhrt data/pg_tblspc/
total 0
lrwxrwxrwx 1 postgres postgres 23 Mar 15 01:53 16386 -> /var/lib/postgresql/tbs

Creating a table within a tablespace associated with the PostgreSQL database will result in the creation of a new directory inside the PG_16_202307071 directory, bearing the same OID as the PostgreSQL database, within the data/base directory.

postgres=# CREATE TABLE my_table (
   id SERIAL PRIMARY KEY,
   name VARCHAR(50)
) TABLESPACE my_tablespace;
CREATE TABLE

The OID of the PostgreSQL database is regenerated and remains unchanged.

postgres@9e7108914f4c:~$ ls -lhrt tbs/PG_16_202307071/5/16392 
-rw------- 1 postgres postgres 0 Mar 15 01:57 tbs/PG_16_202307071/5/16392

In this first part of our series, we’ve explored fundamental concepts of database clusters, databases, and tables.

Stay tuned for the second part, where we’ll delve deeper into internal layout of a table and how PostgreSQL performs read and write on a tuple.

1 Comment

  • Azamat

    March 22, 2024 - 9:04 am

    Great! Thanks for the good information.

Leave A Comment