NUMA, Linux, and PostgreSQL before libnuma Support

PostgreSQL and NUMA, part 2 of 4

This series covers the specifics of running PostgreSQL on large systems with many processors. My experience is that people often spend months learning the basics when confronted with the problem. This series tries to dispel these difficulties by providing a clear background into the topics in question. The hope is that future generations of database engineers and administrators don’t have to spend months figuring things out through trial and error.

This article builds on the previous entry, “An Introduction to NUMA” and assumes a general working knowledge of NUMA not only in definition but in hardware and software implementation as well.

This entry in this series covers the general interaction of PostgreSQL and Linux on NUMA systems. This topic is complex and so there are cases of simplification. However this distills the general information about running PostgreSQL 17 and below (or Postgres 18 without libnuma support) on NUMA systems, with Linux as a reference point. By the end of this blog entry, you should both be able to run Postgres on a NUMA system and also understand why the libnuma support in PostgreSQL 18 is so important.

A Quick Recap

NUMA represents an effort at creating an architecture that localizes parallel memory access by allocating memory to specific controllers, which serialize access to the memory they control.  Memory controllers can ask other memory controllers for memory, and this is slower than asking for local memory.

The Linux kernel has several strategies available for allocating memory (including the executable memory for new threads and processes), and these can be tuned for specific processes by an administrator using the program numactl, though systemd also supports setting NUMA policies in the unit files. These include membind, interleave, and prefer, with a default of “prefer.”

The Linux kernel can also migrate processes to be closer to the memory they want to access, or migrate memory to processes that are repeatedly accessing it. These migrations can dramatically increase performance. Additionally, when a process is moved, all local allocations for that process are also moved so as to ensure maximum memory bandwidth.

PostgreSQL, NUMA, Linux, and Huge Pages

Linux’s default memory page size on most architectures is 4 KB. This means that these pages can be paged to disk in these sizes, and can be migrated to other NUMA nodes individually on this size. Internally each page (huge or normal) has an entry in a buffer called the Translation Lookaside Buffer, which does logical to physical address mapping for memory allocations.

PostgreSQL can use huge pages for shared memory, and the most common size for a huge page here is 2MB. This has a number of implications on NUMA systems which will be explored in more detail below. However a critical point to note is that with normal pages, it usually takes two Linux pages to make one buffer page in Postgres, but with huge pages, 256 buffer pages fit into a single Linux huge page.

The above assumes the standard 8kb page size in PostgreSQL, but the same basic dynamics would happen with other page sizes as well.

What this means is that since you have pairs of pages accessed as a unit, these would likely be migrated together (doubling the overhead for migrating pages, and dramatically increasing the overhead for deciding what to migrate). On the other hand, storing many buffer pages in one OS page means that this is more likely not to be migrated and so subject to the penalties of remote access.

Huge pages should not be confused with Transparent Huge Pages, a kernel feature which can use huge pages even when not directly requested. Because there have been some reports of negative performance effects on some workloads and kernel versions, the use of transparent huge pages is generally discouraged.

How PostgreSQL’s Architecture Intersects with NUMA

PostgreSQL uses a multiprocess architecture where each connection on the server is handled by a backend process, and where there are many other special processes which perform other related tasks including:

  • A background writer which flushes dirty pages to disk
  • A checkpointer which makes sure all dirty pages are flushed, and sets checkpoints in the WAL
  • Various processes for replication
  • Various maintenance tasks such as vacuum and analyze
  • Custom backends for specific tasks (for example in pg_cron or pg_partman)

Each of these processes will allocate its own memory and each of these groups of processes will also connect to shared memory which all processes must be able to access.

Why the Checkpointer is Often a Bottleneck

The checkpointer is responsible for ensuring that data can be recovered after a crash by replaying the write-ahead log from the last checkpoint. It does this by flushing dirty buffers to disk from shared memory and issues fsync to ensure durability. While the operating system cache is not a problem, the checkpointer does have to iterate through all of shared memory and depending on whether this is co-located with the checkpointer or not, this can become quite expensive.

One of the typical problems people see in NUMA systems is having the checkpointer running at 100% CPU load and everything else throttled. In this case, most of the time the checkpointer is likely to be accessing the memory across the interconnect (possibly attached to other CPU sockets) and so most of the time is spent waiting for memory. Typically this is resolved by using huge pages.

How Huge Pages Affect Checkpointer Performance

As mentioned above, when huge pages are not used, each buffer page spans multiple operating system pages (by default two). This means that each buffer page lookup maps internally to (usually) two operating system lookups. Opportunities for migrating processes or pages is limited in this configuration, so a critical bottleneck is then stuck accessing pages over the interconnect between the memory controllers.

With huge pages, however, a series of neighboring buffer pages is on the same allocation. This allows the operating system to either migrate the page to be closer to the checkpointer, or more likely migrate the checkpointer to be closer to the pages being accessed.

The result is that the process which is the bottleneck gets the privilege of much faster access, while query backends are more likely to have to access the pages over the interconnect. This removes a bottleneck and improves the speed of the system.

How NUMA Policies affect PostgreSQL

The Linux NUMA policies affect both thread creation and memory allocation. These can have somewhat unexpected outcomes on NUMA systems.

The membind policy

When membind is used, the Postmaster and all children will start up on the specified NUMA node and all their memory allocations will be stored there as well. If file caches are used, these may be stored elsewhere.

Membind effectively means not spreading the work out to other cores or memory segments.  That’s acceptable when Postgres is one process on the system and where using the other cores is not desired. It has the benefit of optimizing performance for small systems while sacrificing scalability.  In this case, all memory access is rapid, but performance is limited to the set of cores and RAM attached to the designated memory controller, and by the throughput of that memory controller itself.

The prefer policy

Under the prefer policy, the Postmaster, shared memory, and as many processes as possible will start up on the specified NUMA node and memory or processes may end up overflowing to other NUMA nodes. This basically acts like the membind policy above but with a more graceful degradation of performance when the NUMA node is exhausted.

This policy is the default on Linux and it is one of the reasons why the typical experience of running PostgreSQL on NUMA systems is as bad as it is. People expect a larger server to solve performance problems, but if you are still mostly limited to a single NUMA node, that often doesn’t make a difference.

The interleave policy

In the “interleave” policy, all allocations and all processes are spread across all specified NUMA nodes, round robin. Critically this spreads out pages within shared memory. Since normal Linux pages are small enough that buffer pages span multiple Linux pages, this means that a single buffer page will initially be spread across multiple NUMA nodes, ensuring that all buffer access is at least partly remote. While it may be possible to migrate pages over time to avoid this, this isn’t assured. Huge pages dramatically changes this layout and it means that access can be localized.

A second important aspect of the interleave policy is that local allocations for backends will also be spread out among the NUMA nodes. Over time, reused memory will be co-located with the backend process, but the same is not true of temporary allocations, for example hash tables used in joins.

Fortunately PostgreSQL is well optimized regarding memory reuse, so an 8 KB buffer for processing inbound rows is effectively reused for each row. It still means that interleave privileges overall memory throughput and distribution of shared memory over the performance of joins in particular.

Recommendations

Fortunately the number of recommendations for running PostgreSQL without NUMA awareness on NUMA systems is quite small: set huge_pages to on and use the interleave NUMA policy either through systemd or numactl.

Enable Huge Pages

In all PostgreSQL versions, with or without NUMA awareness, PostgreSQL should be configured to use huge pages, and enough huge pages should be pre-allocated to cover all of shared memory. This applies any time PostgreSQL is running on a NUMA system as long as the membind policy is not being used. In all other cases, huge pages are critically important for allowing memory colocation for the checkpointer.

More information on setting up huge pages including calculating the number required, setting them up in the operating system, etc. is found in Umair Shahid’s blog entry “Configuring Linux Huge Pages for PostgreSQL.”

Use the interleave policy

Most of the time when we are running PostgreSQL on a server with a large number of cores, we want PostgreSQL to be able to use whatever memory it needs, and we want PostgreSQL to be able to effectively use all the cores. This is best achieved with the interleave policy.

This recommendation however, has no effect on PostgreSQL 18 built with libnuma support.

The next article in the series will cover Tomas Vondra’s patch set which was committed for PostgreSQL 18 and how these patches improve things further.

Leave A Comment