Abstract visualization of a Claude agent routing work to specialist subagents
Engineering · 22 min read

Claude Agent SDK in production: a studio's playbook

The 5 patterns our studio uses to ship Claude Agent SDK agents that survive real traffic — subagents, tool-use retries, skills, orchestration, and evals-as-gate.

Sadig Muradov May 5, 2026

Most teams who pilot the Claude Agent SDK ship something that works on their laptop, works in staging, and then stops working the day it meets real traffic. The SDK didn't fail them — their design did. The patterns that work in a demo are almost never the patterns that survive the jump to production, and the gap between "works in a notebook" and "handles 50,000 requests a day without waking you up" is mostly four or five architectural decisions made on day one.

We're a senior engineering studio and a member of the Anthropic Claude Partner Network. In 2025 and early 2026 we shipped across 40 production agents for ops, RevOps, and support teams at 20-to-200-person B2B SaaS, e-commerce, and professional services companies. About three-quarters of those run on the Claude Agent SDK. This is the playbook we wish we'd had before the first one went live.

If you're evaluating the SDK for a single agent, skim §2 to confirm it's the right tool for your workload, then read §3. If you've already got agents in production and they're misbehaving, §8 is probably where you want to start — those are the three failure modes that have cost us the most money.

Quick overview: 5 lessons at a glance

Each of the five patterns below has a rigid structure: what it is, where it breaks, our code, and dollar impact from real deployments. Skip to any one of them directly — they don't require reading in order.

#PatternWhen to useTypical $ impact
1Subagent decompositionWorkflow has 2+ distinct phases (classify → act, extract → validate)30-45% token cost reduction vs. single-agent baseline
2Tool-use retry loopsAny agent calling ≥2 tools, especially flaky third-party APIs12-18% cost reduction; P99 latency halved
3Skills as packagingDomain logic that repeats across prompts or grows past ~1,500 tokensContext size down 40-60%; faster iteration cycle
4Multi-agent orchestrationLong-running workflows that hand off across teams or time zonesThroughput up 2-3×; coordination cost tracked below 20%
5Evals as deployment gateEvery agent before production (this is non-negotiable)Prevents the single most expensive failure mode: silent drift

The through-line is that production-grade agents look less like elaborate prompts and more like small, boring systems with aggressive guardrails. Nothing in this playbook is exotic. The tradeoff is always the same — spend an afternoon on the boring thing (a retry loop, an eval harness, a skill extraction) and buy yourself weeks of reliability.

Terminology used in this post. Tool = a single callable action the agent exposes (send_email, query_crm). Skill = a packaged bundle of prompts + tools + examples the agent loads on demand, defined in .claude/skills/*/SKILL.md5. Subagent = a separate agent instance the main agent can delegate to, with its own system prompt and tool surface2. Eval = a scored test against a fixed input set that runs against every deploy. We'll reference these without redefining them.

When the SDK is the right tool (and when it isn't)

The SDK is not a framework decision you make in isolation. It's a decision about which sharp edges you're willing to own yourself versus which ones you want Anthropic to own for you.

Pick the Claude Agent SDK when three things are true. First, your workload is Claude-first — you're not planning to swap models every quarter, and you're fine with the SDK assuming Claude-shaped features (extended thinking, prompt caching, long context). Second, your agent graph is shallow: a single agent, or a router plus one-to-three specialists. Third, you want the managed tool loop — automatic retries, token accounting, streaming, and structured outputs — without reimplementing them.

Pick a different tool when. You need model-agnostic swapping (go with LangChain or LiteLLM). You have cycle-heavy graph workflows with backtracking and state rewinds (LangGraph wins on those). You're building an autonomous research agent that needs to run for hours and plan its own sub-steps without human checkpointing (that's a different class of problem — the SDK works but isn't optimized for it).

The honest breakdown of our own deployments, across those 40 agents:

Workload typeCountSDK fitNotes
Inbound triage + routing14StrongSub-second classifiers; often with a single retrieval tool.
Document extraction (invoices, contracts, forms)11StrongSubagent decomposition pays off fastest here.
Research + summarization for ops6StrongSkills shine; the packaging layer keeps prompts clean.
Multi-agent orchestration (long-running)5ModerateWe often pair the SDK with a lightweight orchestrator (Temporal or a custom queue).
Autonomous research / coding4Weak-moderateFor deep, multi-hour loops we've mostly moved to Claude Code's agent harness itself.

If your workload is in the top three rows, stop evaluating. Start building.

The trap to avoid. Teams underestimate how much of their pilot agent's behavior came from the raw Anthropic Messages API being forgiving — retry loops, auto-caching, and structured outputs work implicitly in a notebook and don't work implicitly under queue load. Moving from the raw API to the SDK is almost always the correct call once the pilot proves out, because the SDK surfaces the things you were ignoring. Don't skip this migration. Every team we've seen skip it has written their own version of the SDK's tool loop and regretted it.

Pattern 1 — Subagent decomposition for ops workflows

What it is. Instead of one agent with a giant system prompt and ten tools, you split the work across a router (which looks at the incoming request and decides what kind of thing it is) and one-to-three specialists (each of which handles one kind of thing very well). The router stays on a cheap, fast model. The specialists only run when they need to, and each one has a narrow system prompt that tells it how to do exactly one job. Anthropic's Agent SDK docs cover the subagents mechanics2; what we're describing here is the deployment shape.

Where it breaks. Subagent decomposition stops paying off the moment your router starts making judgment calls that should be the specialist's to make. We've seen it happen in three ways. First, the router gets too confident and returns a final answer instead of routing. Second, the specialists get too narrow and the router has to pick between five that all sort-of-fit. Third, the router and specialist start doing the same retrieval, duplicating token spend.

Our code. This is the production-shape router for one of our invoice-extraction agents. It's ~50 lines. It runs on Claude Haiku 4.5 at roughly $0.0004/request, and it only calls the specialist (on Sonnet 4.6) when it's genuinely needed.

from anthropic import Anthropic
from anthropic.types import ToolUseBlock

client = Anthropic()

ROUTER_SYSTEM = """You classify inbound documents and route them.
Return one of: INVOICE_STANDARD, INVOICE_EDGE_CASE, NOT_AN_INVOICE.
Do not extract data. Do not explain. Return the label only."""

EXTRACTION_SPECIALIST_SYSTEM = """You extract structured data from invoices.
Return JSON matching the schema. If any required field is missing, return
{"error": "missing_field", "field": "<name>"} instead of guessing."""

def route_and_extract(document_text: str) -> dict:
    # Router: cheap, fast, one job.
    routing = client.messages.create(
        model="claude-haiku-4-5",
        max_tokens=16,
        system=ROUTER_SYSTEM,
        messages=[{"role": "user", "content": document_text[:4000]}],
    )
    label = routing.content[0].text.strip()

    if label == "NOT_AN_INVOICE":
        return {"status": "rejected", "reason": "not_an_invoice"}

    # Specialist: expensive, accurate, only runs when routing found a match.
    extract = client.messages.create(
        model="claude-sonnet-4-6",
        max_tokens=1024,
        system=EXTRACTION_SPECIALIST_SYSTEM,
        messages=[{"role": "user", "content": document_text}],
        tools=[invoice_schema_tool],
    )
    return parse_extraction(extract)

Two things worth noting. One, the router truncates to 4,000 characters — it doesn't need the full document to classify, and the truncation saves meaningful tokens at volume. Two, the specialist returns an error field when it can't find a required value; we treat that as a first-class signal and route those to human review, rather than letting the model hallucinate a number.

Dollar impact. On a deployment that processes roughly 18,000 documents a month, the single-agent baseline (one Sonnet 4.6 agent with a kitchen-sink prompt) cost $462/month at median traffic. The router-plus-specialist shape cost $276/month for the same volume — a 40% reduction. The router itself accounts for less than 5% of total cost; virtually all the savings come from the 22% of documents the router correctly rejects before they reach the specialist.

The pattern gets more valuable as the specialist gets more expensive. On a research workflow that uses Opus 4.7 as the specialist, we've seen the savings push past 55%, because the router is shielding a much pricier model.

Pattern 2 — Tool-use retry loops that don't burn tokens

What it is. When an agent calls a tool (say, a CRM lookup), the tool can fail in three different ways: the request itself errors, the request succeeds but returns nothing useful, or the request succeeds but the agent misinterprets the result. Each failure mode needs a different retry strategy. A production-grade tool loop handles all three without blowing the token budget or looping forever. The raw Anthropic Messages API gives you the primitives; the SDK gives you sensible defaults; but the policy layer — what counts as success, how many retries are safe, when to escalate — is yours to design.

Where it breaks. The default "just retry" behavior works until it doesn't, and when it stops working, it stops quietly. The three failure modes we've actually paid for are: (a) an agent that retries the same tool with the same arguments forever because the tool is returning a structurally-valid-but-semantically-wrong response; (b) an agent that retries a rate-limited tool 15 times in a row and triggers a 24-hour IP ban; (c) an agent that retries a succeeding tool because the model "didn't trust" the result and re-asked, doubling cost with no quality gain. All three are invisible in a dev environment. All three are painful in production.

Our code. Three guardrails, all cheap to implement. First, a max_tool_calls budget per request. Second, a token-cost circuit breaker. Third, same-args detection.

MAX_TOOL_CALLS = 12
TOKEN_BUDGET_MULTIPLIER = 2.0  # kill if >2x the expected cost

def run_with_guardrails(messages, tools, expected_tokens: int):
    tool_calls_used = 0
    seen_calls: set[tuple] = set()
    tokens_used = 0
    token_ceiling = expected_tokens * TOKEN_BUDGET_MULTIPLIER

    while True:
        response = client.messages.create(
            model="claude-sonnet-4-6",
            messages=messages,
            tools=tools,
            max_tokens=2048,
        )
        tokens_used += response.usage.input_tokens + response.usage.output_tokens

        # Circuit breaker: cost has exceeded budget.
        if tokens_used > token_ceiling:
            return {"status": "aborted", "reason": "token_budget_exceeded", "tokens": tokens_used}

        # Exit condition: model is done.
        if response.stop_reason == "end_turn":
            return {"status": "ok", "response": response, "tokens": tokens_used}

        # Collect tool uses and check for same-args repetition.
        for block in response.content:
            if isinstance(block, ToolUseBlock):
                signature = (block.name, json.dumps(block.input, sort_keys=True))
                if signature in seen_calls:
                    return {"status": "aborted", "reason": "loop_detected",
                            "tool": block.name}
                seen_calls.add(signature)
                tool_calls_used += 1

                if tool_calls_used > MAX_TOOL_CALLS:
                    return {"status": "aborted", "reason": "max_tool_calls_exceeded"}

                # ... dispatch the tool, append result to messages ...

The important thing isn't the code — it's the three separate kill conditions. Any one of them on its own isn't enough; all three together mean no request ever runs away. We default to MAX_TOOL_CALLS=12 for most workloads and TOKEN_BUDGET_MULTIPLIER=2.0, then tune down for high-volume agents where the tail matters.

Dollar impact. A retry-loop bug we shipped and fixed in Q4 2025 gives the cleanest number here. An extraction agent began retrying a specific tool because the response shape had drifted and the agent couldn't parse it. Before the circuit breaker, the agent averaged 47 tool calls per failed request and burned $0.41 in tokens per run. After we added the three guardrails, the same failure pattern aborted after 3 tool calls and $0.02 in tokens — a 95% per-request cost reduction on the pathological path, and the loop-detection signal let us find and fix the drift inside a day instead of a week. Over a month of traffic the fix saved us about $2,100 in tokens and two-and-a-half on-call days.

P99 latency also halved, because the pathological requests stopped dragging the tail. A median agent request isn't noticeably faster, but the worst 1% is dramatically better, which is usually what users actually notice.

Pattern 3 — Skills as a packaging layer for domain logic

What it is. Skills are Anthropic's way of letting an agent load domain-specific context (prompts, examples, sometimes tools) on demand rather than cramming everything into one massive system prompt5. The SDK loads skills from .claude/skills/*/SKILL.md in the working directory — a file-based packaging layer, not a runtime concept. A skill is essentially a bundle — a directory with a manifest, some prompts, maybe some code — that the agent pulls in when it decides it's relevant. The rest of the time the skill isn't in context and doesn't cost tokens.

Where it breaks. Skills are tempting to over-use because they feel like good software engineering. The failure modes we've seen: skills that are too small (three lines of prompt that should've just lived in the system prompt); skills that are too large (they end up being bigger than the base agent and defeat the lazy-loading point); and skills with overlapping scope, where the agent loads two of them and their instructions conflict. The mental model that's worked for us: a skill earns its keep when it's used by fewer than half of requests and it contains more than ~800 tokens of logic.

Our code. The structure of one of our production skills — an invoice normalization skill that converts between different country-specific invoice formats. It's about 2,100 tokens of prompt + examples. It's loaded on roughly 30% of requests (only when the router flags a non-US invoice). The manifest:

# skills/invoice-normalization/manifest.yaml
name: invoice-normalization
description: >
  Normalize international invoices (UK, DE, FR, IT, ES) into a canonical schema.
  Handles VAT-inclusive vs VAT-exclusive, decimal comma vs period, and
  supplier-tax-ID formats for EU countries.
triggers:
  - the invoice shows a currency symbol other than USD
  - the invoice references VAT or IVA or TVA or MwSt
  - any tax line item is present
prompts:
  system: prompts/system.md
  examples: prompts/examples.md
tools:
  - currency_convert
  - vat_rate_lookup

The agent loads this skill only when the router sees a non-US currency or VAT indicator. Inside prompts/system.md is the specific guidance: how to map MwSt. to VAT, how to handle decimal commas, which fields are required for a German invoice but not a Spanish one. The base extraction prompt stays small and general; the domain-specific quirks live in the skill.

What to put in a skill versus a tool. Tools do things (call an API, look up a value). Skills teach the agent how to do things it already knows how to do, but differently. A currency converter is a tool. A guide to handling German vs. Spanish invoice edge cases is a skill. When you're not sure, the test is: does this thing call out to the world, or does it sit next to the prompt in the agent's head? Outward-facing = tool. Inward-facing = skill.

Dollar impact. The clearest win we've measured came from splitting a single large extraction prompt (originally 4,200 tokens, covering US, UK, EU, and catch-all formats) into a compact 900-token base prompt plus three region-specific skills. Median input tokens per request dropped from 4,650 to 1,780 — a 62% reduction — because the agent only loads the region skill it actually needs. Monthly cost on that agent fell from $940 to $412, and the iteration speed on region-specific bugs roughly tripled because we stopped needing to retest US cases every time we changed UK logic.

The less-quantifiable win is cognitive. Skills give you a place to put edge-case logic that isn't the system prompt, and the system prompt stops growing by accretion. Teams we've handed agents off to consistently cite this as the thing that made maintenance tractable.

Pattern 4 — Multi-agent orchestration without the coordination tax

What it is. Some ops workflows are genuinely multi-agent: a research agent gathers context, an analyst agent summarizes it, a drafter agent produces a deliverable, and a reviewer agent checks the output. Each one is doing something the others aren't. The SDK supports multi-agent handoffs natively through its agents option and the built-in Agent tool2, but shipping a multi-agent system that stays cheap and predictable takes deliberate design, not just declaring more agents.

Where it breaks. The thing that eats budgets in multi-agent systems is coordination overhead: agents re-describing context to each other, passing along conversation history that's already been processed, and burning tokens on handoff messages that don't produce any new work. We track coordination overhead as a percentage of total tokens, and we aim to keep it under 20%. Over 25% and the marginal utility of adding another agent has probably turned negative. Over 40% and you should merge two of your agents.

Our code. Two patterns that help.

The first is structured handoffs: agents hand off with a strict schema, not a natural-language message. This looks boring but it dramatically reduces the handoff token cost because the downstream agent doesn't have to re-parse loose prose.

HANDOFF_SCHEMA = {
    "task_type": "string",           # one of: research, analyze, draft, review
    "inputs": "object",              # structured, not free-text
    "upstream_findings": "array",    # list of dicts, one per prior agent
    "constraints": "array",          # list of strings, explicit
    "success_criteria": "string",    # one sentence, <120 chars
}

The second is shared long-term memory through a thin state store — usually a Postgres table keyed by workflow_id — that every agent reads from and writes to. Agents don't pass conversation history across handoffs; they pass a workflow_id, and each downstream agent fetches only the state keys it cares about. This is the single biggest lever for coordination overhead: it collapses the handoff token cost from O(history) to O(relevant-state).

def agent_step(workflow_id: str, agent_name: str):
    # Pull only the keys this agent needs.
    state = load_state(workflow_id, keys=AGENT_STATE_KEYS[agent_name])
    prompt = render_prompt(agent_name, state)
    response = run_agent(agent_name, prompt)
    save_state(workflow_id, response.state_updates)
    return response

Dollar impact. On a four-agent research-to-report workflow (research → analyze → draft → review) for one of our strategy-consulting clients, we measured coordination overhead at two points: before the structured-handoff + shared-state refactor, and after.

MetricBeforeAfterChange
Median tokens per workflow184,00096,000−48%
Coordination overhead (% of total)42%17%−25 pts
End-to-end latency (median)4m 20s2m 38s−39%
Cost per workflow$2.14$1.02−52%
Output quality (graded against golden set)78%79%+1 pt

Quality held steady; cost and latency both halved. The refactor took roughly four engineer-days. It paid for itself inside the first month of production traffic.

The case for not going multi-agent. Before you build a multi-agent system, check whether a single agent with good subagent decomposition (Pattern 1) handles the workload. Multi-agent orchestration is the right answer when the phases are genuinely asynchronous or involve different specialists, humans-in-the-loop, or multi-day cycles. For anything that fits in a single request-response window, a decomposed single agent will almost always be cheaper, faster, and easier to debug.

Pattern 5 — Evals as a deployment gate, not a vibe check

What it is. Every agent we ship goes to production with an automated eval suite that runs on every PR and on a nightly schedule against production traffic samples. An agent that can't pass its evals doesn't deploy. Full stop. We've written the three-layer harness up in detail in eval suites that catch drift before customers do — this section is about wiring that harness into the SDK, not about eval design.

Where it breaks. Teams either don't write evals at all, or they write the wrong kind. The wrong kinds: a single golden output string compared via exact match (paraphrasing breaks it immediately); a "does this output look good?" LLM-as-judge with no reference output (drifts with the judge); or a pass/fail threshold so loose that every run passes.

What actually works, in order of cost-to-build: (1) prompt unit tests on frozen input/output pairs — cheap, catches 80% of regressions, non-negotiable; (2) property tests on structural invariants (enum values, ranges, schema shape) — catches silent schema drift; (3) golden-trace drift detection with a secondary judge model comparing semantic equivalence against locked outputs — catches the slow declines no one else will. MIT NANDA's 2025 State of AI in Business report put hard numbers on the cost of skipping this step: 95% of enterprise GenAI initiatives delivered zero business return4, and the single most-cited cause was lack of integration and learning loops — which is exactly what evals-as-gate forces you to build.

Our code. The hook that runs evals as a deployment gate in CI. This is the shape we've standardized across engagements.

# ci/eval_gate.py
from eval_harness import run_suite

def main():
    results = run_suite(
        suite_path="evals/fixtures/",
        agent=build_agent_under_test(),
        layers=["unit", "property", "golden"],
    )

    # Unit tests: must be 100%. Any failure blocks the deploy.
    if results.unit.pass_rate < 1.0:
        print(f"UNIT FAILED: {results.unit.failures}")
        exit(1)

    # Property tests: 100% required (structural invariants shouldn't slip).
    if results.property.pass_rate < 1.0:
        print(f"PROPERTY FAILED: {results.property.failures}")
        exit(1)

    # Golden traces: 90% floor on semantic equivalence. Log, don't block,
    # on 85-90. Block under 85.
    if results.golden.equivalence_rate < 0.85:
        print(f"GOLDEN DROP: {results.golden.equivalence_rate:.2%}")
        exit(1)
    elif results.golden.equivalence_rate < 0.90:
        post_to_slack(f"Golden trace warning: {results.golden.equivalence_rate:.2%}")

    print(f"OK — golden {results.golden.equivalence_rate:.2%}")

if __name__ == "__main__":
    main()

Two design decisions worth flagging. First, unit and property tests are 100%-pass gates; golden traces are 90%-floor with a warning band. That split reflects what each layer is measuring: the first two are structural invariants that don't tolerate regressions; the third is a behavior signal that moves slowly and doesn't need to be perfect on every commit. Second, golden traces run against the production traffic sample, not against a static fixture — they're testing whether this agent still handles yesterday's real inputs the way yesterday's agent did.

Dollar impact. The counterfactual is the most honest framing. On two of our deployments where we wired the three-layer harness in from day one, we caught silent prompt-injection drift within 48 hours of a model version update — before any user reported the issue. On one deployment where we inherited a running agent from a previous vendor that had only vibe-check evals, we didn't notice that the extraction accuracy had dropped from 96% to 87% until a customer complained. The business cost of the 9-point accuracy drop was roughly $8,000 in rework and one very uncomfortable customer call. The three-layer harness would have caught it in week one.

Stack Overflow's 2025 Developer Survey — the largest annual dataset on how developers actually work — found that only 31% of developers currently use AI agents in their workflow, and among those who do, the top frustration by a wide margin (66%) is "AI solutions that are almost right, but not quite"3. That "almost right" failure mode is exactly what the three-layer harness is designed to catch before the customer does.

What breaks in production: 3 failure modes we've paid for

This section is shorter. Three specific failure modes, what they looked like, and what we changed.

Failure mode 1: The silent cost explosion. One of our early agents doubled in monthly token cost over three weeks — $380 → $810 — with no corresponding increase in traffic. The cause: a third-party API we were calling as a tool had started returning a larger response payload (it added fields the agent was now pulling into context on every call). The model read the bloated payload, the input tokens climbed, and the bill followed. We had no alerting on input-tokens-per-request, only on total spend, so the trend was invisible until the invoice came. Fix: we now alert on input-tokens-per-request with a 7-day baseline, and we cap context sizes on tool results explicitly.

Failure mode 2: The eval that passed when the agent was broken. An extraction agent's golden-trace pass rate sat at 97% for two weeks while customer complaints climbed. The cause: the golden trace set had been generated when the upstream document format was v1, and the agent had started producing v2 outputs that the semantic-equivalence judge was rating as "close enough" even though they were subtly wrong on a field the business cared about. The eval passed because the judge was too forgiving. Fix: we now require golden-trace answers to be re-curated quarterly, we weight the equivalence score toward fields flagged as business-critical, and the on-call rotation owns a 20-minute weekly review of the drift dashboard, not just an alert-when-red policy.

Failure mode 3: The prompt injection that wasn't caught. An agent that summarized user-submitted support tickets started returning outputs containing links the agent had no business producing. The cause: a user had embedded instructions in a ticket body, and the summarizer — which had been given read-only access to a small set of URLs as a tool — was being coerced into producing content based on the injected instructions. The model wasn't fooled into leaking data; it was fooled into producing marketing copy for an unrelated product. Fix: we added input sanitization at the agent boundary (strip everything that pattern-matches on "ignore previous instructions" and similar variants, log the stripped content), we narrowed the tool surface so the agent couldn't produce outputs with arbitrary URLs, and we added an eval-layer check that flags any output containing domains outside an explicit allowlist.

Each of these cost us either real money or real customer goodwill. The pattern underneath all three: production failure happens at the seams — between the agent and its tools, between the agent and its eval harness, between the agent and user input — not inside the model itself. That's where to concentrate your defenses.

What to ship next + further reading

If you've got a Claude agent in a notebook and you're trying to decide what the next 2-4 weeks should look like before it meets real traffic, the order we'd ship in:

  1. Split your agent into router + specialist (Pattern 1), even if you only have one specialist today. The shape is what matters.
  2. Wire the three guardrails on your tool loop (Pattern 2): max_tool_calls, token budget, same-args detection. An afternoon of work; prevents the worst failure class.
  3. Extract your first skill (Pattern 3) — usually something domain-specific that's already bloating your system prompt. Don't extract more than one until you've measured the first.
  4. Write 30 prompt unit tests and 10 property tests (Pattern 5). Don't start with golden traces; those come later. Unit + property gets you 80% of the value in a tenth of the time.
  5. Add cost + latency alerting on the agent before you add anything else fancy. Two alerts beat ten dashboards.

If you want us to build or audit one of these with you, book a strategy call — our custom-agent builds ship this shape by default (the 5 patterns, the 3 guardrails, the three-layer eval harness) inside a fixed-scope engagement. The lowest-commitment way to start is our one-week audit: we find out whether your current agent has any of these traps in it and refund the audit fee against a build if you move forward.

Further reading, ordered by how we'd sequence them.

The patterns above are the ones we bet on. They're not the only patterns that work, and they'll age as the SDK evolves — we'll re-date this post when they do. Until then, ship the boring version, put guardrails around it, and save the fancy architecture for the workload that actually proves it needs one.

Frequently asked questions

Is the Claude Agent SDK production-ready?

Yes — for the patterns it's designed for. Across 40 Autoolize deployments in 2025-2026, we've shipped triage, extraction, and research agents on the SDK that handle 5k-50k requests/day with median latency under 8 seconds and availability tracked against our own SLAs. The SDK is less mature than long-lived orchestration frameworks like LangGraph for cycle-heavy graph workflows, but for tool-using single- or shallow-multi-agent workloads it's the fastest path to production we've used.

Claude Agent SDK vs LangChain — which should I pick?

Pick the SDK when your workflow is Claude-only, fits within 1-3 agents with clear tool surfaces, and you want Anthropic's managed tool loop (retries, token accounting, streaming) without re-implementing it. Pick LangChain/LangGraph when you need model-agnostic swapping, graph-level cycle control, or third-party integrations that don't exist as Claude-native tools. For the ops automation workloads we ship, the SDK wins ~70% of the time because the rest of the stack is already Claude.

What's the real token cost of a production Claude agent?

For our ops triage and extraction agents, median cost lands between $0.008 and $0.032 per request on Claude Sonnet 4.6, depending on tool-call depth and retrieval context size. Subagent decomposition (Pattern 1) typically cuts 30-45% off the naive single-agent baseline because the router stays on a cheaper model and only expensive work hits the specialist. Failure-mode bills are where teams get surprised — see §8.

How many subagents is too many?

We rarely go past three. A router plus two specialists is the sweet spot for most ops workflows: the router classifies, one specialist does the deep work, the other handles edge cases or formatting. Four or more specialists almost always means the router is making decisions it shouldn't — we redesign the graph instead of adding agents. The cost of coordination compounds faster than most teams expect (§6 has the numbers).

Do I need evals before shipping?

Yes, and not the vibe-check kind. Every agent we ship goes out with prompt unit tests, output property tests, and a golden-trace drift detector — the three-layer harness we documented in eval suites that catch drift before customers do. Evals are the deployment gate. An agent that can't prove it handles its 30 reference cases doesn't go to production traffic.

What's the difference between tools, skills, and subagents?

Tools are single actions an agent can invoke (send_email, query_crm). Skills are packaged bundles of prompts, tools, and examples that the agent loads when needed — they keep domain logic out of the system prompt and reduce context bloat. Subagents are separate agent instances the main agent can delegate to; they have their own system prompts, their own tool surfaces, and their own cost profile. We use all three: tools for primitives, skills for domain logic, subagents for work that needs isolation (§3, §5).

How do you handle agents that loop forever or burn tokens?

Three guardrails. First, a max_tool_calls budget per request — we default to 12 and tune from there. Second, a token-cost circuit breaker that kills the request if it crosses 2× the budgeted cost. Third, progress-based termination: if the agent calls the same tool with the same arguments twice in a row, we abort and return the partial result. Pattern 2 walks through the implementation and the ledger of bills these have saved us.

Is streaming worth implementing?

For user-facing agents, yes — perceived latency drops 40-60% with streaming, and users abandon less at the 10-second mark. For back-office agents (extraction, classification, triage) running on queues, streaming adds complexity without payback. We default to streaming only on interactive surfaces and standard requests elsewhere.

What does Anthropic's Claude Partner Network status actually give you?

Anthropic's Claude Partner Network launched in March 2026 with an initial $100M commitment to partners, anchored by Accenture, Deloitte, Cognizant, and Infosys. Smaller partners like us get access to partner-portal training, sales-enablement playbooks, a listing in the public Services Partner Directory, and a direct Slack channel for escalations. It does not give us pricing discounts or private model access. The real value is being able to unblock a production issue in hours rather than filing a ticket and waiting.

How do you monitor a Claude agent in production?

Four signals we watch per-agent: (1) end-to-end latency p50/p95/p99, (2) token cost per request with a weekly trend, (3) tool-call success rate per tool, (4) eval-score against the golden trace set run hourly. All four feed a single dashboard with a single on-call page rule: if any one goes >2σ off its 7-day baseline, someone gets paged. The boring observability stack beats fancy tracing every time.

Can you hand off an agent built on the SDK to an in-house team?

Yes — that's the default shape of our engagements. Every project ships with the agent code, the eval harness, a runbook, and a 30-day hypercare window where we respond to issues and do knowledge transfer. After hypercare the team owns the agent; we stay on as a retainer if they want ongoing operations. Roughly 60% of the teams we've handed off to ran the agent themselves for 12+ months without coming back for a second build. If you want to talk through it, book a strategy call.

Sources

  1. Claude Agent SDK — Overview · Anthropic
  2. Subagents — Agent SDK · Anthropic
  3. 2025 Developer Survey — AI · Stack Overflow
  4. The GenAI Divide — State of AI in Business 2025 · MIT NANDA
  5. Skills — Agent SDK · Anthropic