Reliability

Definition

Reliability coverage in this archive spans 18 posts from Jul 2016 to Jan 2026 and focuses on reliability, delivery speed, and cost discipline as one system, not three separate concerns. The strongest adjacent threads are architecture, sre, and ai. Recurring title motifs include production, ai, outage, and taught.

Key claims

Most posts prioritize predictable operations over feature breadth or stack novelty.
Early posts lean on systems and production, while newer posts lean on engineering and outage as constraints shifted.
This topic repeatedly intersects with architecture, sre, and ai, so design choices here rarely stand alone.

Practical checklist

Set SLOs first, then choose tooling that keeps deploy, observability, and rollback simple.
Start with the newest post to calibrate current constraints, then backtrack to older entries for first principles.
When boundary questions appear, cross-read architecture and sre before committing implementation details.

Failure modes

Adding platform layers faster than the team can operate and debug them.
Chasing throughput gains without proving they improve end-user reliability.
Applying guidance from 2016 to 2026 without revisiting assumptions as context changed.

References

AI Production Governance: A Maturity Model

Apr 2026

By mid-April 2026, the gap between teams shipping stable AI features and teams shipping chaos isn't tools—it's production governance. Here is how mature teams evaluate, deploy, and rollback.

Why Most Enterprise AI Architecture Fails in Year One

Apr 2026

In 2026, enterprise AI isn't failing because models are bad. It is failing because organizations are building brittle demos instead of bounded, operable systems.

Building Reliable AI Agents in Go

Jan 2026

Reliable agents aren't prompted into existence. They're engineered -- with bounded tools, validation at every step, explicit recovery paths, and the same discipline you'd apply to any production system. Here's how I build them in Go.

AI Incidents Don't Look Like Outages. That's the Problem.

Nov 2025

Your AI system can return 200 OK and still be wrong, unsafe, or confidently hallucinating. Here's how to detect, contain, and learn from AI incidents -- drawing from the same IR principles that work for traditional systems.

Agentic Workflows: From Demo Magic to Production Reality

Apr 2024

AI agents that can take actions are fundamentally different from chatbots. The engineering bar must match the blast radius.

Why I Run Multiple Models in Production

Mar 2024

Betting on a single model provider is like having a single database with no failover. Here is why multi-model is the only sane production strategy.

The AWS us-east-1 Outage Was Predictable. Your Architecture Was Not Ready.

Dec 2021

December 7 reminded everyone that us-east-1 is a single point of failure for half the internet. Again. I am annoyed.

What a 3 AM Outage Taught Me About Incident Management

Nov 2021

Good incident response is not about preventing failure. It is about failing well. Lessons from a decade of on-call, including NATO and telecom-scale operations.

Database Reliability Engineering: What I've Learned the Hard Way

Aug 2021

Practical database reliability from running Postgres at the fintech startup and at large enterprises. Includes config examples, migration patterns, and the operational habits that actually prevent outages.

Most Chaos Engineering Is Theater

Jun 2020

Teams love saying they do chaos engineering. Few actually have hypotheses. Even fewer fix what they find.

Zero Downtime Deploys Are a Team Habit, Not a Tool

Oct 2019

Every team says they want zero downtime. Few want to do the boring work that actually gets them there. Here's what that boring work looks like.

Your Load Tests Are Lying to You

Aug 2019

Most load tests produce comforting numbers instead of useful answers. Here's what I learned the hard way about getting honest results.

Your SLOs Are Probably Useless (Here's How to Fix Them)

May 2019

Most SLOs are dashboards nobody acts on. Here's how to pick indicators that reflect real users, set targets grounded in data, and make error budgets actually change how your team ships.

Design for Failure or It Will Design Your Weekend

May 2019

Failure is not an edge case. It is the default state you temporarily hold off with good engineering. A few hard-won rules for building systems that bend instead of shatter.

Async Job Processing: Patterns That Saved Us at a Fintech Startup

Dec 2018

Hard-won patterns for reliable background job processing -- queues, retries, idempotency, and the failures that taught me to care about all three.

What Building Distributed Systems at a Fintech Startup Taught Me About Failure

Sep 2018

Hard-won lessons from designing distributed systems that survive real-world failures -- timeouts, retries, bulkheads, and the operational habits that actually keep things running.

SRE Principles Are Great. The Cargo-Culting Is Not.

Apr 2018

The SRE hype train has everyone copying Google's playbook without asking whether it fits. Here's what actually matters when you're not running planet-scale infrastructure.

You Don't Need to Be Netflix to Break Things on Purpose

Aug 2017

Chaos engineering isn't just for the big players. Here's how a small team can start breaking things deliberately and actually learn from it.

How I Build Data Pipelines That Actually Survive Production

Apr 2017

Every pipeline I've built at the fintech startup broke at some point. Here's the design approach that made them recoverable instead of catastrophic.

Building Resilient Systems: Lessons from Production Failures

Jul 2016

Production incidents show where architecture bends and where it breaks. These lessons focus on designing for failure, limiting blast radius, and making recovery routine.