Posts on evals, failure modes, production agents, and how we scope engagements. Updated monthly.
A buyer's guide to AI automation agencies for ops teams — pricing models, how to tell a studio from a course, 6 archetypes compared, and what "custom" actually costs in 2026.
A head-to-head comparison from a senior engineering studio that ships both — developer ergonomics, cost, latency, tool-use reliability, and a framework for picking the right one per workflow.
The 5 patterns our studio uses to ship Claude Agent SDK agents that survive real traffic — subagents, tool-use retries, skills, orchestration, and evals-as-gate.
The four signals that tell us an ops workflow has outgrown Zapier — branching logic, retries and rate limits, human-in-the-loop review, and real observability — plus the typed, tested Claude-based replacement we ship when the signals fire.
A plain-English definition of "agentic AI" for ops teams, plus the 6 patterns we actually ship — triage, extraction, research, copilots, orchestrators, and human-in-the-loop review — with cost profiles and the eval hooks that keep each one honest in production.
A walkthrough of the 3-layer eval harness we ship with every production agent — prompt unit tests, property tests on outputs, and nightly drift detection on production traffic — with cost per layer, the specific failure modes each catches, and the decision to use LLM-as-judge only where it earns its keep.
We use a single cookie to measure anonymous site traffic. No ads, no third-party tracking. Privacy policy.