PostgreSQL
Replication
Postgres replicates WAL (Write-Ahead Log) records from a primary to one or more standbys. Standbys can be:
- Physical (streaming) replicas — byte-for-byte copies of the cluster (same major version).
- Logical replicas — row-level change streams to different schemas/versions (publications/subscriptions).
Reads on standbys are possible with hot standby; writes always go to the primary.
Physical (streaming) replication
How it moves data
- Primary generates WAL.
- A walsender process streams WAL to each standby.
- Each standby's walreceiver writes WAL to disk and replays it.
You can also do cascading replication: a standby can act as source for other standbys.
Core setup (modern PG, v12+ style)
On the primary:
- wal_level = replica (or logical if you also need logical).
- max_wal_senders, max_replication_slots sized appropriately.
- Optionally create replication slots to prevent WAL recycling before standbys consume it.
On the standby:
- Take a base backup (pg_basebackup -R is the easy path—creates standby.signal and config).
- Ensure primary_conninfo (conn string to primary) and (optionally) primary_slot_name.
- hot_standby = on for read-only queries.
Promotion: pg_ctl promote or SELECT pg_promote();.
Monitoring: on primary pg_stat_replication; on standby pg_stat_wal_receiver, pg_last_wal_replay_lsn().
Asynchronous vs synchronous
Asynchronous (default) Primary ACKs commit after local WAL flush; standbys pull changes as they can. Lowest latency, possible data loss on failover (recent commits not yet replicated).
Synchronous Primary waits for one or more standbys to confirm receipt before ACKing commit. Controlled by two settings:
- synchronous_standby_names — which standbys count and how many:
- Priority style:
'FIRST 1 (node_a, node_b)'
→ wait for the first available of those. - Quorum style:
'ANY 2 (node_a, node_b, node_c)'
→ wait for any 2 (quorum commit).
- Priority style:
- synchronous_commit — how far the primary waits:
- remote_apply: standby has applied (visible to queries there).
- on / remote_flush: standby flushed WAL to disk.
- remote_write: standby wrote but not flushed (rare).
- local: don’t wait on standbys (acts async).
- off: fastest, but even local durability is reduced (rarely recommended).
Tip: For most HA needs, use synchronous_commit=on with a quorum in synchronous_standby_names.
Replication slots
- Physical slots: keep WAL until a standby confirms receipt—prevents WAL loss at the cost of possible disk bloat if a standby is down too long.
- Good for stability; monitor pg_replication_slots and disk usage.
Read consistency on standbys
• Standbys are eventually consistent. • With sync + remote_apply, you can guarantee a commit is visible on at least one standby when primary returns success.
Logical replication
When to use
- Selective table replication.
- Cross-major-version upgrades.
- Integrations/ETL.
- Multi-master-ish patterns (careful with conflicts).
How it works
On the publisher (primary or any instance):
- wal_level = logical
- Create a publication:
CREATE PUBLICATION pub FOR TABLE my_table;
On the subscriber:
- Create a subscription with the publisher's DSN:
CREATE SUBSCRIPTION sub CONNECTION '...' PUBLICATION pub;
Changes are decoded via logical decoding and applied as SQL on the subscriber.
Failover & HA
- Manual failover: promote a standby (pg_promote()), repoint clients (VIP/DNS/HAProxy).
- Automated HA: use tooling (Patroni, repmgr, Stolon, Pacemaker/Corosync) for leader election, fencing, and reconfiguring replicas.
- After failover, remaining standbys must follow the new timeline; often easier with HA tooling.
Useful knobs & views
- Primary: pg_stat_replication (per-standby state, write/flush/apply LSNs, lag) synchronous_standby_names, synchronous_commit, wal_level, max_wal_senders, max_replication_slots
- Standby: pg_stat_wal_receiver pg_last_wal_receive_lsn(), pg_last_wal_replay_lsn(), pg_is_in_recovery()
Patterns & recommendations
- Most setups: primary + 1–2 async standbys for read scaling and DR.
- For zero-loss HA: add one synchronous standby (or quorum of 2) in the same AZ/region to minimize latency.
- Use replication slots + disk monitoring to avoid WAL gaps or bloat.
- For upgrades with minimal downtime: use logical replication.
- Keep base backups + WAL archiving for point-in-time recovery (PITR).
VACUUM
Overview
In PostgreSQL, VACUUM does several things:
- Reclaims space: marks dead tuples as reusable space inside the table file (though file size won't shrink unless you VACUUM FULL).
- Maintains visibility maps: helps the planner know which pages contain only visible tuples (useful for index-only scans).
- Advances relfrozenxid: protects against transaction ID wraparound by marking old tuples as frozen (always visible).
There are three flavors:
- VACUUM (standard): cleans dead tuples and updates metadata, non-blocking.
- VACUUM FULL: rewrites the table to a new file, reclaiming disk space back to the OS (blocks writers).
- ANALYZE: often paired with vacuum to update statistics for the query planner.
Postgres also runs autovacuum in the background, which triggers automatically based on thresholds (dead tuples count, table size, age of transaction IDs).
Example
Initial table
CREATE TABLE users (
id SERIAL PRIMARY KEY,
name TEXT
);
INSERT INTO users (name) VALUES
('Alice'), ('Bob'), ('Charlie');
Physical rows (tuples) inside the table file:
[1, "Alice"] visible
[2, "Bob"] visible
[3, "Charlie"] visible
After an UPDATE and DELETE
UPDATE users SET name = 'Bobby' WHERE id = 2;
DELETE FROM users WHERE id = 3;
Now the table file looks like this:
[1, "Alice"] visible
[2, "Bob"] dead tuple (old version)
[2, "Bobby"] visible (new version)
[3, "Charlie"] dead tuple (deleted)
To the queries, PostgreSQL shows only Alice and Bobby. But physically, the old version of Bob and Charlie's row are still on disk, taking space.
Run VACUUM
VACUUM users;
What happens:
- Dead tuples (Bob old, Charlie) are marked as free space inside the data pages.
- They are not returned to the OS yet, but PostgreSQL can reuse that space for future inserts/updates.
After vacuum:
[1, "Alice"] visible
[2, "Bobby"] visible
[ ] free space (was old Bob)
[ ] free space (was Charlie)
The table file size stays the same, but internal free slots are ready to be reused.
Run VACUUM FULL
VACUUM FULL users;
This rewrites the whole table into a new compact file:
[1, "Alice"] visible
[2, "Bobby"] visible
What happens:
- File size shrinks.
- All dead rows are gone physically.
- But it locks the table during the rewrite, so it's heavier.
Summary:
- After updates/deletes: old tuples remain until vacuum runs.
- VACUUM: cleans them logically, space reusable.
- VACUUM FULL: actually shrinks the file.
Partitioning
- Native partitioning is declarative (PARTITION BY) since version 10.
- It supports range, list, and hash partitions.
- Old versions used table inheritance + check constraints with constraint_exclusion.
Add a new index in production
Prefer concurrent builds, plan for IO/WAL/lag, precheck uniqueness, monitor with PG views, and clean up redundant indexes. This keeps production safe while improving query performance.
Challenges and how to mitigate them
- Locking & Blocking
- CREATE INDEX (non-concurrent) takes an ACCESS EXCLUSIVE lock → blocks reads & writes on the table until done. Dangerous in prod.
- Prefer CREATE INDEX CONCURRENTLY (CIC): keeps table online for reads/writes, but:
- Takes two brief SHARE UPDATE EXCLUSIVE locks (start/end).
- Cannot run inside a transaction block.
- Can fail with 'could not create unique index' late in the process.
- Leaves an invalid index behind on error—must DROP INDEX manually.
- Runtime & Resource Load
- Large tables → long builds (hours). Index build is CPU/IO heavy.
- Needs maintenance_work_mem; if too small, spills to disk (temp files) → slower IO.
- Generates lots of WAL → can fill disk and spike replication lag on standbys.
- Bloat & Space
- You need free space for the index itself plus sort/work files.
- On very busy tables, CIC makes multiple passes; dead tuples can cause extra work.
- Consider fillfactor and whether an existing index should be REINDEX CONCURRENTLY instead.
- Replication Concerns
- On streaming replicas, WAL from index build must be applied → replication lag risk.
- On logical replicas, DDL propagation timing matters (tooling-dependent).
- Query Planning Surprises
- New indexes can change plans unexpectedly (better or worse).
- Collations, data distribution, and statistics matter; you might need ANALYZE.
- For expression/partial indexes, only queries matching the exact expression/predicate can use them.
- Unique & Concurrent Builds
CREATE UNIQUE INDEX CONCURRENTLY
must validate uniqueness after building → can fail at the end if dupes exist.- Precheck duplicates first, or build a non-unique then enforce via app fixes + later migration.
- Operational Gotchas
- CIC can be canceled by conflicts (long-running transactions, vacuum conflicts).
- Requires DDL permissions; cannot be wrapped in a transaction (psql autocommit must be on).
- INCLUDE columns (covering indexes) increase size/writes—trade-off vs fewer heap hits.
- Write Amplification
- Every future INSERT/UPDATE must maintain the new index → higher write cost, more WAL, more vacuum work.
- Too many overlapping indexes hurt overall throughput.
Safer rollout checklist
- Design
- Pick the right type: btree (most common), gin/gist (full-text/geo), brin (append-only, large ranges).
- Use partial/expression indexes if your workload is skewed.
- Consider column order and INCLUDE for covering queries.
- Preflight
- Verify disk space: table size × (0.2–1.2) depending on type + temp.
- Set sane maintenance_work_mem for the session.
- Check duplicates if building UNIQUE.
- Ensure no ultra-long transactions are open (pg_stat_activity).
- Build
- Use
CREATE INDEX CONCURRENTLY ...
. - Set lock_timeout (small) and statement_timeout (generous).
- Schedule in off-peak hours.
- Monitor
- pg_stat_progress_create_index (build phase & tuples processed).
- pg_locks (waiting?), pg_stat_replication (lag), disk usage.
- OS metrics for IO / CPU.
- Validate & Adopt
- ANALYZE the table.
- Observe plans via EXPLAIN (ANALYZE, BUFFERS) and pg_stat_statements.
- Remove redundant/unused indexes after a watch period.
- Rollback
- If plans regress, disable use via query hints (not native in PG) or drop the index.
- For failed CIC leaving invalid indexes: DROP INDEX CONCURRENTLY IF EXISTS ….
Practical snippets
Create concurrently with safety timeouts:
SET lock_timeout = '2s';
SET statement_timeout = '4h';
SET maintenance_work_mem = '2GB';
CREATE INDEX CONCURRENTLY idx_orders_user_created_at
ON orders (user_id, created_at);
Watch progress:
SELECT * FROM pg_stat_progress_create_index;
Check size:
SELECT relname, pg_size_pretty(pg_relation_size(indexrelid))
FROM pg_stat_user_indexes
WHERE relname = 'idx_orders_user_created_at';
If unique, precheck duplicates:
SELECT user_id, created_at, COUNT(*)
FROM orders
GROUP BY 1,2
HAVING COUNT(*) > 1
LIMIT 10;