June 2026 - Stormatics

MCP For PostgreSQL: Automated Health Checks & Performance Analysis

AI agents are becoming increasingly capable at operational tasks: summarizing logs, analyzing query plans, identifying anomalies, and assisting with incident response. For databases in particular, this creates an obvious opportunity. Much of day-to-day troubleshooting follows repeatable workflows that lend themselves well to automation. As someone who spends most of my time working with PostgreSQL, I find the interesting question isn't whether an LLM can help analyze a slow database. It can. The harder question is how to do that safely. Production databases sit behind layers of controls, processes, and accountability. Access is granted carefully because mistakes are expensive. When an engineer investigates an incident, that trust comes from experience and clearly defined responsibilities. Extending those capabilities to an AI agent raises a different challenge: how do you give it enough access to be useful without giving it enough access to be dangerous? That problem is exactly what Model Context Protocol (MCP) attempts to address. Rather than exposing a database directly to an LLM, MCP introduces a layer of controlled capabilities. Instead of unrestricted access, the model receives a set of predefined tools with well-defined boundaries.

PostgreSQL Wait Events: A Production Diagnostic Guide

A query plan tells you what PostgreSQL intended to do. A wait event tells you what it actually did with its time. Most performance work I see in production starts one layer too high, in the EXPLAIN output, when the cheaper answer is already sitting in pg_stat_activity.This is a refresh of a guide I wrote in late 2024. Two things have changed since then. PostgreSQL 17 added pg_wait_events as a first-class catalog view. PostgreSQL 18 shipped asynchronous I/O and changed how I/O wait should be read in production. Both shifts deserve their own treatment, and I have folded them into this update along with five production cases from work my team did over the last two quarters.

How To Run Logical Replication On A Production PostgreSQL Cluster

Upgrading PostgreSQL in production is not fun. You have real data, real users, and a version of Postgres that's probably several major releases behind. The obvious approach (dump, upgrade, restore) works fine on a 10GB database on a Saturday night. It doesn't work when you have TBs of data, applications that can't afford hours of downtime, and a team watching the monitoring dashboard at 2am. Logical replication solves this. I’ve run it in production, and it works well, but there are a few practical considerations that are easier to appreciate once you’ve been through the process yourself. Why Logical Replication You keep your source cluster running (old Postgres version, production traffic, and all), bring up a destination cluster with the new version, and replicate all data changes while both run in parallel. When the destination has caught up, you cut over. Downtime is seconds to minutes, not hours. Source is your existing cluster (say, Postgres 14). Destination is your new cluster running Postgres 17. The thing most guides skim over is that logical replication does not copy schema, it only copies data. PostgreSQL replicates INSERT, UPDATE, and DELETE operations, but not your table definitions, indexes, roles, or anything structural. An example would be if logical replication is setup then truncate operation on a single source table would also be replicated, but table1, table2, and table3, etc. won't be replicated. You have to handle that yourself before replication starts. Miss this, and your subscription will either fail silently or throw confusing errors that don't point you back to the actual problem. Step 1: The Primary Key Check Logical replication requires every replicated table to have a replica identity to replicate UPDATE and DELETE operations. By default, that's the primary key. Tables without one will still replicate INSERTs, but any UPDATE or DELETE will error on the publisher. Replica Identity A published table must have a replica identity configured so that UPDATE and DELETE operations can be replicated, because the replica identity is what allows the correct row to be located on the subscriber. By default, the replica identity is the primary key. A suitable unique index can also be used, subject to certain requirements. If no suitable key exists, the replica identity can be set to FULL, which uses the entire row as the key. Replica identity FULL should be treated as a fallback to be used only when no other option is available, because without a suitable index on the subscriber, the row search can be very inefficient. If a table without a replica identity is added to a publication that replicates UPDATE or DELETE operations, those operations will fail on the publisher. INSERT operations proceed regardless of the replica identity.

When Patroni Silently Deletes Your Replication Slots

If you have ever been in the middle of a database migration and suddenly found your logical replication slots missing, you know how unsettling that feeling is. No obvious error. No warning. Just gone. That is exactly what happened to us while migrating data from a Patroni-managed cluster to a standalone PostgreSQL cluster. What started as a smooth operation turned into a debugging session that taught us something important about how Patroni behaves after a restart.The SetupWe were running a data migration where the source was a Patroni cluster and the target was a another PostgreSQL cluster. Logical replication was our chosen approach, which meant we had created replication slots on the Patroni side to track the changes flowing out. Things were going well at the start. Data was moving, lag was low, and we felt good about the plan.Something Went WrongAt some point, the publisher node on the Patroni side got restarted. After the restart, our logical replication slots were simply gone. This was strange. These were not temporary slots. Temporary slots are expected to disappear after a session ends, but permanent slots are supposed to survive restarts. PostgreSQL itself has no reason to drop them.

File Descriptors: The OS Limit That Takes Down PostgreSQL

Most PostgreSQL outages that trace back to file descriptor exhaustion get misread as a database problem. The failure is one layer down: the kernel runs out of file descriptors and PostgreSQL takes the hit. This post covers how that happens under high connection counts, how to read the log sequence when it does, and how to fix it. What are file descriptors and why PostgreSQL burns through them In Linux, the kernel represents almost everything as a file descriptor: TCP sockets, open table files, index files, WAL segments, temp files for sorts and joins, log files. Every open() or accept() call increments a counter. The kernel enforces a system-wide ceiling called fs.file-max. When total FD usage across all processes on the machine hits that ceiling, every new open() fails; regardless of which process is asking. There's also a second, separate limit called the per-process ceiling (RLIMIT_NOFILE, controlled by ulimit -n), which caps how many FDs a single process can hold. Either limit can produce the "out of file descriptors" log message or a single backend hitting its per-process ulimit. Both need to be checked during the diagnosis. PostgreSQL is process-based. Each client connection spawns its own OS process. Each backend holds FDs for its client socket, the table and index files it's accessing (managed through PostgreSQL's internal VFD system, capped by max_files_per_process, default 1,000), WAL segments, and any temp files. An idle backend holds 10–15 FDs. An active write backend touching multiple tables with indexes can hold 50–200 or more. The theoretical worst case is max_connections x max_files_per_process. In practice you won't hit that ceiling, but even a fraction of it is dangerous when thousands of connections are open at once.

The Night Our Tables Wouldn’t Stop Growing

We were doing everything right. The migration plan was solid, the team was experienced, and we had done this kind of thing before. But somewhere around midnight, someone on the team noticed something strange. Tables on the destination side were ballooning at an unexpected rate with hundreds of gigabytes being used, while the source side tables sat quietly at just a few megabytes. Something was very wrong, and we had no idea what.

MCP For PostgreSQL: Automated Health Checks & Performance Analysis

PostgreSQL Wait Events: A Production Diagnostic Guide

How To Run Logical Replication On A Production PostgreSQL Cluster

When Patroni Silently Deletes Your Replication Slots

File Descriptors: The OS Limit That Takes Down PostgreSQL

The Night Our Tables Wouldn’t Stop Growing

Our Projects

Quick Links

Contact Info

USA: 5900 Balcones Drive Ste 100, Austin, Travis County, TX 78731

Singapore: 20A Tanjong Pagar Road, 088443

Follow us at

Archives

Categories