PostgreSQL Internals Part 1: Understanding database cluster, database and tables

PostgreSQL is one of the most popular and powerful relational database management systems, renowned for its robust features and flexibility. Going deeper into its internals uncovers a complex world though. Understanding the core concepts is much needed for developers, database administrators, and anyone involved in managing or interacting with PostgreSQL databases.

By understanding the core concepts users can optimize performance, troubleshoot issues effectively, and use the full potential of PostgreSQL for their projects.

In this blog series, We are going to discuss the core concepts of PostgreSQL internals starting with database clusters, tables, and tablespaces in PostgreSQL.

Let’s dive into the details!

What is a Database Cluster?

A database cluster is a collection of multiple databases managed by a PostgreSQL server running on a single node or instance. It can be referred to as a data/base directory.

Base: It is the default directory for the database cluster, which contains subdirectories for each database. Each database’s directory is represented by its respective OID. Within the base/DATA_BASE_OID directory, you’ll find specific files for tables, indexes, and other database objects.

Tablespace: It is a distinct physical storage location on a disk, separate from the default data directory, used for storing database objects such as tables and indexes. data/pg_tblspc will contain the symlink to the tablespace inside the original data directory. Tablespaces allow you to manage storage by distributing data across multiple disk locations, optimizing performance, and balancing load. They also help segregate data by application, user, or environment, enhancing database structure and security.

What is a Database?

In PostgreSQL, a database is a logical collection of related data objects such as tables, indexes, views, functions, and other database objects. It is managed by a PostgreSQL server instance and provides a structured environment for storing, organizing, and retrieving data. Each database operates independently, with its own set of objects, permissions, and configurations. Although multiple databases can exist within a single PostgreSQL cluster, they do not share data directly with each other. When we install PostgreSQL first time it by default creates a database named Postgres.

What is a Database Object?

A database object in PostgreSQL refers to any distinct item or entity within a database that can be created, manipulated, or managed through SQL commands. These objects represent different aspects of data storage, organization, and functionality. for example

Tables: Structures that store rows of data in a relational format, with columns defining the data types for each field.
Indexes: Objects that improve the speed of data retrieval by creating a quick lookup structure for tables.
Views: Virtual tables that represent a specific query’s result set, allowing for reusable and abstracted queries.
Sequences: Objects that generate unique numeric values, often used for auto-incrementing primary keys.
Schemas: Logical containers for organizing and separating database objects, allowing for better data organization and security.
Functions: Custom scripts or procedures written in SQL or other languages (like PL/pgSQL) that perform specific operations or calculations.
Triggers: Actions automatically executed in response to specific events, such as insertions or updates on a table.
Constraints: Rules or conditions applied to tables to ensure data integrity, like primary keys, foreign keys, and unique constraints.

Each of these database objects plays a unique role in defining, manipulating, and maintaining the data within a PostgreSQL database.

What is an OID?

A short form of Object identifier it is a number represented in unsigned 4-byte integers. Which is used to uniquely identify the object that we create inside the database

An unsigned 4-byte integer has a maximum value of (2^32 − 1) which is 4,294,967,295

When we initialize the data directory(PostgreSQL 16) it creates 3 databases by default postgres, template0, and template1 we can check their respective OIDs with the following query

postgres=# SELECT datname, oid FROM pg_database WHERE datname = 'postgres';
 datname  | oid 
----------+-----
 postgres |   5
(1 row)

We can also see the same subdirectories inside the data/base directory containing the OIDs of the databases

postgres@9e7108914f4c:~/data/base$ ls -lhrt 
total 12K
drwx------ 2 postgres postgres 4.0K Mar 14 14:08 4
drwx------ 2 postgres postgres 4.0K Mar 14 14:11 5
drwx------ 2 postgres postgres 4.0K Mar 14 14:12 1

Similarly, if you have a table inside your database you can check its OID via

postgres=# SELECT oid, relname FROM pg_class WHERE relname = 'test';
  oid  | relname 
-------+---------
16393 | test
(1 row)

NOTE: Creating a table with OID was deprecated in PostgreSQL-12 and it is not supported now

What is relfilenode?

It’s a value stored in the pg_class system table for relation objects (tables, indexes, etc.) It acts as a numeric identifier that points to the actual physical disk file where the data for the relation is stored. For instance, for the table testing, we can see its OID and relfilenode value with the following query

postgres=# select oid,relname,relfilenode from pg_class where relname = 'testing';
  oid  | relname | relfilenode 
-------+---------+-------------
 16385 | testing |       16385
(1 row)

Initially, oid and relfilenode can be the same and point to the same location on the disk

postgres@9e7108914f4c:~/data/base/5$ ls -lhrt 16385
-rw------- 1 postgres postgres 0 Mar 14 20:59 16385

But if we issue REINDEX or TRUNCATE commands value of relfilenode can be changed

postgres=# truncate testing;
TRUNCATE TABLE
postgres=# select oid,relname,relfilenode from pg_class where relname = 'testing';
  oid  | relname | relfilenode 
-------+---------+-------------
 16385 | testing |       16397
(1 row)

As we can see TRUNCATE changed the value of relfilenode from 16385 to 16397 and we can see a new file in our filesystem

ls -lhrt 16397
-rw------- 1 postgres postgres 0 Mar 14 21:04 16397

NOTE: The value of relfilenode can be 0 for system catalog objects

16389.1, and 16389.2 files

In PostgreSQL, database objects like tables and indexes are stored in files called segments. Each segment has a predetermined size, and a table may span multiple segments as its data volume increases. By default, if the –with-segsize option isn’t specified during PostgreSQL’s source code compilation, the segment size is usually set to 1GB. When the size of tables and indexes surpasses this limit, PostgreSQL generates additional files named after their respective relfilenode.1, relfilenode.2, and so forth, to accommodate the excess data. This behavior can be seen in the example below

postgres=# CREATE TABLE my_table (
    id SERIAL PRIMARY KEY,
    number INT
);
CREATE TABLE
postgres=# INSERT INTO my_table (number)
SELECT generate_series(1, 50000000);
INSERT 0 50000000
postgres=# dt+ my_table 
                                     List of relations
 Schema |   Name   | Type  |  Owner   | Persistence | Access method |  Size   | Description 
--------+----------+-------+----------+-------------+---------------+---------+-------------
 public | my_table | table | postgres | permanent   | heap          | 1729 MB | 
(1 row)
postgres=# select oid,relname,relfilenode from pg_class where relname = 'my_table';
  oid  | relname  | relfilenode 
-------+----------+-------------
 16385 | my_table |       16385
(1 row)
postgres@9e7108914f4c:~$ ls -lhrt data/base/5/16385 (Tab for auto-complete)
16385      16385.1

Free Space Map(FSM) files

Each table and index relation, except for hash indexes, has a Free Space Map (FSM) to keep track of available space in the relation’s pages. It stores all free space-related information alongside primary relations. VACUUM and Autovacuum can change or update FSM value when they execute their operations

Visibility Map(VM) files

In PostgreSQL, the system includes a mechanism to monitor pages within tables and indexes, determining which ones contain only tuples that are visible to all ongoing transactions. Each table has its own visibility map, a structure that indicates whether individual pages within the table’s file are entirely clean or contain any dead tuples. This visibility map serves as a quick reference for vacuuming operations, allowing PostgreSQL to skip over pages that don’t require cleanup, thereby improving vacuuming efficiency. With the visibility map, PostgreSQL can optimize maintenance tasks, reducing the time and resources needed to manage the database’s health and performance.

In the below diagram, we can see a table has 2 pages and Page 0 contains four tuples. Tuple-2 is determined to be a dead tuple, PostgreSQL will remove it and rearrange the remaining tuples to address fragmentation or bloat. Subsequently, it updates both the FSM and Visibility Map associated with this page. PostgreSQL repeats this process until it reaches the final page.

Source: ashnik

Note: Indexes only have individual free space maps and do not have visibility maps.

Dead Tuples in PostgreSQL

In PostgreSQL, a dead tuple refers to a row in a table that is no longer needed but has not yet been physically removed from the database. Dead tuples are created as a natural byproduct of PostgreSQL’s Multiversion Concurrency Control (MVCC) system, which allows concurrent transactions to read and write without interfering with each other.

Dead Tuples can be generated by

Updates: When a row is updated, PostgreSQL creates a new version of the row (a new tuple) while keeping the old version. The old version becomes a dead tuple because it’s no longer part of the active dataset, but it’s retained temporarily to ensure transaction consistency and visibility for other transactions that might still reference it.
Deletes: When a row is deleted, it’s not immediately removed from the table. Instead, it becomes a dead tuple, allowing other transactions that might be reading the row to complete without error.

NOTE: Every action we do in PostgreSQL is append-only so we DO NOT perform in-place updates

What are the Tablespaces? and their Benefits

In PostgreSQL, it is a physical storage area separate from the default data directory. It allows organizing database objects on different storage devices for performance or administrative purposes. We can create a new tablespace with the following query

CREATE TABLESPACE my_tablespace
  OWNER postgres
  LOCATION '/var/lib/postgresql/tbs';
CREATE TABLESPACE

Upon creating a new tablespace on a specified directory, two key actions occur

Firstly, A new directory gets created by the name of PG_SERVER_CATALOG on desired location i.e, /var/lib/postgresql/PG_16_202307071

ls -lhrt /var/lib/postgresql/tbs/PG_16_202307071/
total 0

The catalog version number in PostgreSQL is an internal mechanism for maintaining data directory compatibility and preventing inconsistencies. Its format is YYYYMMDDN representing the date the number was changed, with N representing a simple counter to accommodate for more than one change on the same date.

We can check the catalog version by

postgres=# SELECT catalog_version_no FROM pg_control_system();
 catalog_version_no 
--------------------
      202307071 
(1 row)

Secondly, A symlink to the tablespace directory gets created inside the original data/pg_tblspc directory

postgres@9e7108914f4c:~$ ls -lhrt data/pg_tblspc/
total 0
lrwxrwxrwx 1 postgres postgres 23 Mar 15 01:53 16386 -> /var/lib/postgresql/tbs

Creating a table within a tablespace associated with the PostgreSQL database will result in the creation of a new directory inside the PG_16_202307071 directory, bearing the same OID as the PostgreSQL database, within the data/base directory.

postgres=# CREATE TABLE my_table (
    id SERIAL PRIMARY KEY,
    name VARCHAR(50)
) TABLESPACE my_tablespace;
CREATE TABLE

The OID of the PostgreSQL database is regenerated and remains unchanged.

postgres@9e7108914f4c:~$ ls -lhrt tbs/PG_16_202307071/5/16392 
-rw------- 1 postgres postgres 0 Mar 15 01:57 tbs/PG_16_202307071/5/16392

Benefits of Tablespaces

We can assign different storage properties (storage type, filesystem options) to individual tablespaces based on their access patterns and performance requirements.
Tablespaces can provide flexibility for horizontal scaling. You can add additional storage capacity by adding new tablespaces without affecting existing ones
Tablespaces can potentially contribute to a layered security approach. By placing sensitive tables in dedicated tablespaces with stricter access controls, you might add an extra layer of protection

In this first part of our series, we’ve explored fundamental concepts of database clusters, tables, and tablespaces. Read the next part in the PostgreSQL Internals series here.

Comments (3)

Fabrice

July 11, 2024 - 1:25 pm

Thanks for all theses informations. Why visibility map is a separate file and not a flag (hint) in the header of each page of a table?

- Semab Tariq
  
  July 11, 2024 - 4:26 pm
  
  Hi Fabrice
  Thank you for reaching out to us!
  
  Regarding your query about why PostgreSQL uses a separate file to store the visibility map instead of storing it in the header of each page, there are several potential issues with the latter approach:
  
  1) Storing visibility information in the page headers would require reading the entire page into memory each time the visibility information is needed, resulting in increased I/O operations. This would be less efficient compared to reading a compact visibility map file.
  
  2) Vacuuming and other maintenance operations would become more complex and slower, as they would need to inspect the header of each page individually to determine its visibility status.
  
  3) Page headers have limited space, and adding visibility information could lead to increased overhead, reducing the amount of space available for actual data storage.
  
  4) Updating visibility information in the page headers could lead to more frequent page writes and potential contention issues, as multiple transactions might need to update the same page headers concurrently.
  
  Due to these potential issues, it is more efficient and reliable to store the visibility map in a separate file.
  
  I hope this answers your question. If you have any other points in mind, please don’t hesitate to reach out. Thank you!
  
Azamat

March 22, 2024 - 9:04 am

Great! Thanks for the good information.