StormaticsStormatics

You have a Patroni leader election. You are only halfway to PostgreSQL high availability.

A PostgreSQL primary loses power at 2am. Writes resume in under thirty seconds. The on-call engineer reads the alert in the morning, sees that the cluster healed itself, and goes back to coffee. That is the outcome PostgreSQL high availability is supposed to deliver.A working Patroni cluster, on its own, gets you partway there. The leader election runs. A standby gets promoted. The cluster state in etcd stays consistent. Then the application keeps trying to reach an IP address that points at the wrong node now, the old primary needs a manual rejoin, and the on-call engineer is on a conference bridge instead of in bed.
Read More

The best PostgreSQL databases are boring on purpose

The calmest PostgreSQL deployments in production share one trait. They are boring. Pages stay quiet. Dashboards stay green. The on-call engineer reads a book on Tuesday night. And the people running those databases will tell you, plainly, that boring is the achievement. Think about flying for a minute. The flight everyone wants is the one where the captain says hello, the meal shows up on time, and a few hours later, the wheels touch down in the right city. That flight is boring. It is also a small miracle. Behind that boring flight sits decades of compounded discipline. Pilots with thousands of simulator hours.
Read More

PostgreSQL is Not Slow. Your Queries Are.

A field guide to the seven things that are actually making our database feel slow and how to stop blaming the wrong suspect. It usually starts with a Slack message: "The app feels slow". This is normally followed by a ticket, then an internal meeting, and finally someone, and there is always someone, saying: "I think we need to switch databases. PostgreSQL can't handle this load."
Read More

How to know when your team needs PostgreSQL specialist support?

PostgreSQL is one of the most powerful and reliable open-source relational databases in the world. But even the best technology can start to lag when the team managing it lacks deep expertise. Many engineering teams reach a point where their general-purpose knowledge is simply no longer enough to keep up with growing demands. So how do you know when it's time to bring in a PostgreSQL specialist? The answer lies in recognizing specific patterns in your database performance, team's confidence, number of incidents, and architecture health.
Read More

Cost of PostgreSQL performance issues

PostgreSQL is widely adopted because it removes licensing constraints and gives companies like OpenAI, Lovable, and Supabase, a reliable foundation for running production systems at scale. However, once deployed, the cost conversation of PostgreSQL shifts away from licensing and toward how efficiently the database supports the workload it is running.
Read More

The 1 GB Limit That Breaks pg_prewarm at Scale

Recently, we encountered a production incident where PostgreSQL 16.8 became unstable, preventing the application from establishing database connections. The same behavior was independently reproduced in a separate test environment, ruling out infrastructure and configuration issues. Further investigation identified the pg_prewarm extension as the source of the problem. This blog post breaks down the failure, the underlying constraint, why it manifests only under specific configurations, and the corresponding short-term mitigation and long-term fix.
Read More

Three Years of Stormatics: What Building a PostgreSQL Consultancy Looks Like

In three years, Stormatics has grown from a one-person bet on a market gap to a team of eleven serving 35+ customers across 20 countries on five continents. Here is what that journey looked like - including the part where I almost shut the whole thing down.On March 31, 2023, I incorporated Stormatics in Singapore as a private limited company. I had spent over two decades in the PostgreSQL ecosystem - 2ndQuadrant, EDB, OpenSCG, Percona - and for most of that time, I loved the work and the community around it. Starting a company was the furthest thing from my mind.
Read More

PostgreSQL High Availability on OCI: Why Your Failover Passes Every Test But Breaks in Production

If you have built PostgreSQL high availability clusters on AWS or Azure, you have probably gotten comfortable with how virtual IPs work. You assign a VIP, your failover tool moves it, and your application reconnects to the new primary. Clean. Simple. Done.Then you try the same thing on Oracle Cloud Infrastructure and something quietly goes wrong.The cluster promotes. Patroni (or repmgr, or whatever you are using) does its job. The standby becomes the new primary. But the VIP does not follow. Your application keeps sending traffic to the old node — the one that just failed. From the outside, it looks like the database is down. From the inside, everything looks green.
Read More

pgNow Instant PostgreSQL Performance Diagnostics in Minutes

pgNow is a lightweight PostgreSQL diagnostic tool developed by Redgate that provides quick visibility into database performance without requiring agents or complex setup. It connects directly to a PostgreSQL instance and delivers real-time insights into query workloads, active sessions, index usage, configuration health, and vacuum activity, helping DBAs quickly identify performance bottlenecks. Because it runs as a simple desktop application.
Read More

Thinking of PostgreSQL High Availability as Layers

High availability for PostgreSQL is often treated as a single, big, dramatic decision: “Are we doing HA or not?”That framing pushes teams into two extremes:- a “hero architecture” that costs a lot and still feels tense to operate, or - a minimalistic architecture that everyone hopes will just keep running.A calmer way to design this is to treat HA and DR as layers. You start with a baseline, then add specific capabilities only when your RPO/RTO and budget justify them.Let us walk through the layers from “single primary” to “multi-site DR posture”.Start with outcomesBefore topology, align on three things:1. Failure scope a. A database host fails b. A zone or data center goes away c. A full region outage happens d. Human error2. RPO (Recovery Point Objective) a. We can tolerate up to 15 minutes of data loss b. We want close to zero3. RTO (Recovery Time Objective) a. We can be back in 30 minutes b. We want service back in under 2 minutesHere is my stance (and it saves money!): You get strong availability outcomes by layering in the right order.
Read More