Writing

Long-form essays on production AI failure patterns, agent reliability, and operational discipline.

Essays on production-AI failure patterns. New posts roughly monthly.

Subscribe via RSS — no newsletter signup yet.

When You Bound an Operation, the Bound Is a Typed Error

production-ai

reliability

error-handling

code-review

Adding a timeout or a cap to a previously-unbounded operation is the easy part. The part teams get wrong is how the bound-exceeded condition surfaces. It is not a return value and it is not a generic Error. It is a typed error, and every bound in a module should use the same one.

2026-05-21

5 min

Your Agent’s Config File Is Executable. Treat It Like One.

production-ai

security

agents

supply-chain

Every coding-agent runtime loads files like AGENTS.md, .cursor/rules, CLAUDE.md, and .git/hooks with full operator-level trust. Those files live in repos anyone can PR to. They get copy-pasted from blog posts. They are not comments. They are code with a wider blast radius than your dependencies.

2026-05-20

4 min

Your Event Loop Already Knows It’s Starving. You’re Just Not Listening.

production-ai

reliability

observability

architecture

Event-loop lag is a measurable, first-class signal. Most systems capture it only incidentally on one code path, so every other timeout during a starvation window gets misattributed to the wrong subsystem. The fix isn’t less load. It’s listening to the signal that already exists.

2026-05-18

5 min

Three Bugs, One Shape: The Failure Signal That Existed and Got Discarded

production-ai

reliability

error-handling

architecture

Across a week of production-AI bug reports, three landed in completely different subsystems and turned out to be the same bug wearing three costumes. The shape: a structured failure signal that existed at one layer, got discarded before anything used it, and left a success record that disagreed with reality.

2026-05-17

8 min

Rate-Limit Detection Belongs at the Header Layer, Not the Error String

production-ai

reliability

error-handling

agents

Parsing a 429 out of an error message is reading at the wrong layer. The structured signal was sitting in the response headers, before the regex ran and before the idle timer won the race.

2026-05-16

4 min

5 Claude Code Patterns That Separate Power Users From Everyone Else

claude-code

agents

production-ai

workflows

mcp

Five patterns I see in Claude Code workflows that actually ship and scale past the first month: skill-driven workflow forks, PreToolUse hooks, subagent task delegation, headless mode + scheduled cron firings, and per-repo settings.json with command-level allowlists.

2026-05-13

4 min

5 Silent-Failure Patterns I Keep Finding in Production AI Systems

production-ai

monitoring

agents

mcp

Production AI systems fail in ways traditional monitoring can’t catch. Here’s the catalog of 5 patterns that keep appearing across every stack — and what to check for.

2026-05-07

12 min

Categories

When You Bound an Operation, the Bound Is a Typed Error

Your Agent’s Config File Is Executable. Treat It Like One.

Your Event Loop Already Knows It’s Starving. You’re Just Not Listening.

Three Bugs, One Shape: The Failure Signal That Existed and Got Discarded

Rate-Limit Detection Belongs at the Header Layer, Not the Error String

5 Claude Code Patterns That Separate Power Users From Everyone Else

5 Silent-Failure Patterns I Keep Finding in Production AI Systems