Transform PostgreSQL into a Columnar Database Using Citus

Columnar databases are transforming the way we handle large datasets by storing data in columns rather than rows. This approach enhances performance, especially for analytical queries, by allowing faster data retrieval and efficient storage. As businesses generate more data than ever, understanding the benefits of columnar databases becomes crucial. In this blog, we’ll explore how these databases work, their advantages over traditional row-based systems, and why they are becoming a popular choice for data-driven organizations.

Columnar databases, or column-oriented databases, are a type of database management system designed to optimize the storage and retrieval of data for analytical queries. Unlike traditional row-oriented databases, which store data sequentially by rows, columnar databases store data sequentially by columns

Row-Based vs. Column-Based Databases

The table below shows the main difference between Columner and Row-based databases

Feature/Aspect	Row-Based Databases	Column-Based Databases
Storage Format	Data is stored sequentially by rows.	Data is stored sequentially by columns.
Query Performance	Optimized for OLTP (transactional) workloads; efficient for frequent inserts, updates, and deletes.	Optimized for OLAP (analytical) workloads; efficient for read-heavy operations and aggregations.
Data Compression	Limited compression opportunities due to diverse data types in rows.	High compression ratios using techniques like run-length and dictionary encoding.
Disk I/O	Reads entire rows, even if only a few columns are needed, leading to higher I/O.	Reads only the necessary columns, reducing disk I/O and improving performance.
Cache Utilization	Less efficient cache utilization due to mixed data types in rows.	Improved cache efficiency by loading only relevant columns into memory.
Use Cases	Suitable for applications with frequent transactions, such as CRM and ERP systems.	Ideal for data warehousing, business intelligence, and analytics applications.
Data Retrieval	Retrieves entire rows, which can be inefficient for queries targeting specific columns.	Retrieves specific columns, enhancing performance for columnar queries.
Update Operations	More efficient for row-level updates and inserts.	Less efficient for frequent updates due to the need to maintain columnar storage.

Benefits of using Columnar Databases

Columnar databases leverage advanced compression techniques such as run-length encoding, dictionary encoding, and delta encoding. These methods take advantage of the homogeneity and repetition of data within columns, resulting in significantly reduced storage requirements and improved I/O performance.
By storing data in columns, these databases can read only the relevant columns needed for a query, minimizing disk I/O. This is particularly beneficial for analytical queries that involve scanning large datasets, leading to faster query execution times.
Columnar storage enhances cache efficiency by allowing more data to be loaded into memory at once. Since only the necessary columns are accessed, the database can make better use of CPU cache, reducing the need for repeated disk access.
Many columnar databases support vectorized query execution, which processes data in batches using SIMD (Single Instruction, Multiple Data) instructions. This approach takes advantage of modern CPU architectures to perform operations on multiple data points simultaneously, further boosting performance.
Columnar databases are specifically designed for OLAP (Online Analytical Processing) tasks, making them ideal for complex queries involving aggregations, filtering, and joins. Their architecture is tailored to handle large-scale data analytics efficiently, providing significant performance advantages over traditional row-oriented databases in these scenarios.

When to use the Columnar Database?

If you are

Running analytical queries on large datasets, like sales or user behavior data.
Monitoring applications like IoT sensor data, financial market data, or server logs.
Managing large-scale data for ETL (Extract, Transform, Load) processes.
Centralized log management, such as analyzing logs from multiple servers or applications (e.g., Elasticsearch, Prometheus).
Storing and querying genetic information (e.g., DNA sequences) for research or medical applications.

If you answered ‘Yes’ to any of the questions above, a columnar database could be a suitable option for you.

Transform PostgreSQL into a Columner Database with Citus

PostgreSQL is a versatile, open-source database known as the “one database for all.” It supports a wide range of use cases, making it ideal for diverse applications. Whether you’re building AI-powered solutions, handling time-series data, or even implementing columnar storage, PostgreSQL offers the flexibility and robustness to handle it all efficiently. With advanced features like JSONB, indexing, PostgreSQL adapts to the needs of modern data-driven applications.

Citus is an open-source extension that transforms PostgreSQL into a distributed database, enabling it to handle large-scale, high-performance workloads. One of its standout features is the ability to implement columnar storage, allowing you to create tables in a columnar format within PostgreSQL. This makes Citus ideal for handling analytics and reporting on massive datasets while still benefiting from PostgreSQL’s rich ecosystem and SQL capabilities. Whether you need scalability, parallelism, or columnar data optimization, Citus offers a powerful solution built on top of PostgreSQL.

Installing the Citus extension (Debian / Ubuntu)

Downloads the script to add the Citus repository for Debian-based systems.

curl https://install.citusdata.com/community/deb.sh > add-citus-repo.sh

Executes the downloaded script to add the Citus repository to your package manager.

sudo bash add-citus-repo.sh

Installs PostgreSQL 16 along with the Citus extension version 12.1 on your system.

sudo apt-get -y install postgresql-16-citus-12.1

Edit postgresql.conf file /etc/postgresql/16/main/postgresql.conf and add citus

Shared_preload_libraries = citus

Restart the PostgreSQL service

sudo service postgresql restart

Connect to psql and create Citus extension

create extension citus;

Create a table with an access method columnar

postgres=# CREATE TABLE events (
event_id BIGINT,
event_timestamp TIMESTAMP,
user_id BIGINT,
event_data JSONB
) using columnar;
CREATE TABLE

postgres=# \d+ events
Table "public.events"
Column | Type | Collation | Nullable | Default | Storage | Compression | Stats target | Description
-----------------+-----------------------------+-----------+----------+---------+----------+-------------+--------------+-------------
event_id | bigint | | | | plain | | |
event_timestamp | timestamp without time zone | | | | plain | | |
user_id | bigint | | | | plain | | |
event_data | jsonb | | | | extended | | |
Access method: columnar

We have now successfully created a columnar table in PostgreSQL, which can be used to perform regular OLAP operations for faster data retrieval.

In upcoming articles, I will explore columnar tables in more detail and demonstrate how they can outperform traditional row-based tables.