Agentic AI in Production: What I Learned Shipping 14 Autonomous Agents in 2026
The unglamorous engineering reality behind the year's hottest AI architecture pattern.
Table of Contents
- The day a billing agent refunded $42,000 by accident
- Why 2026 is the year agents actually work
- The architecture that actually shipped
- Tool design is 80% of the job
- The failure modes that bite
- Goal drift on long horizons
- Sycophantic confirmation loops
- Cost explosion on retries
- Tool-name collisions and overloaded verbs
- Hidden state in tool outputs
- What it actually costs
- Evals or you are just vibing
- The org-chart shift nobody is talking about
- What this means for you
- FAQ
Table of Contents
- The day a billing agent refunded $42,000 by accident
- Why 2026 is the year agents actually work
- The architecture that actually shipped
- Tool design is 80% of the job
- The failure modes that bite
- Goal drift on long horizons
- Sycophantic confirmation loops
- Cost explosion on retries
- Tool-name collisions and overloaded verbs
- Hidden state in tool outputs
- What it actually costs
- Evals or you are just vibing
- The org-chart shift nobody is talking about
- What this means for you
- FAQ
The day a billing agent refunded $42,000 by accident
In February 2026, one of our internal agents — a refund-triage bot running on Claude Opus 4.7 — issued a perfectly polite, perfectly catastrophic chain of refunds totaling around $42,000 before our circuit breaker tripped. Nothing in the model was broken. The tools we exposed were broken. That single Tuesday taught me more about agentic AI than the previous twelve months of reading papers.
I have spent the last fourteen months at a Series B fintech shipping autonomous agents into real customer-facing workflows. Fourteen of them are live as of last week. This is what I wish someone had handed me on day one.
Why 2026 is the year agents actually work
Three things converged in the last nine months that finally made agents production-grade:
For people who want to think better, not scroll more
Most people consume content. A few use it to gain clarity.
Get a curated set of ideas, insights, and breakdowns — that actually help you understand what’s going on.
No noise. No spam. Just signal.
One issue every Tuesday. No spam. Unsubscribe in one click.
- Long-context models that actually reason at depth. Claude Opus 4.7 with its 1M-token window (released March 2026) and GPT-5.1's improved tool-call accuracy mean an agent can hold an entire customer history in context without lossy retrieval.
- Standardized tool protocols. Anthropic's MCP (Model Context Protocol) hit v1.2 in April 2026, and OpenAI's Responses API quietly became the default. We stopped writing bespoke JSON-schema glue.
- Evals as a first-class concept. Tools like Braintrust, LangSmith, and Inspect AI now treat evals the way Datadog treats metrics. You cannot ship an agent without them.
If you tried agents in 2024 with GPT-4 and gave up, try again. The error rate on multi-step tool use has dropped from roughly 18% to under 4% on our internal benchmark.
The other less-discussed shift is that pricing finally makes sense. Anthropic's prompt caching, generally available since late 2024 and significantly cheaper as of the April 2026 pricing update, means a multi-step agent that re-reads the same system prompt and tool catalog 30 times per run pays for that context once. On our refund-triage workflow, prompt caching alone cut per-run cost by 47%. Two years ago, that workflow was simply uneconomic.
The architecture that actually shipped
After three rewrites, our agent stack settled here:
| Layer | What we use | Why | |---|---|---| | Model | Claude Opus 4.7 (planning) + Haiku 4.5 (sub-tasks) | Cost split, ~6x cheaper | | Orchestration | Vercel AI SDK 5 with generateText + tool loops | Streaming, typed tools | | Tool layer | MCP servers behind an internal gateway | Auth, rate limits, audit | | Memory | Postgres + pgvector 0.7 | One DB beats four | | Eval / trace | Braintrust | LLM-as-judge plus humans | | Guardrails | Custom + Llama Guard 3 | Input and output filtering |
The big lesson: do not buy a framework. We tried LangGraph, CrewAI, and AutoGen. Every one of them became the bottleneck within a month. The Vercel AI SDK 5 stepCountIs stop condition with our own loop in roughly 200 lines of TypeScript outperformed all of them on debuggability.
Tool design is 80% of the job
The single highest-leverage skill in agent engineering is writing good tools. Not prompts. Tools.
Things I now do religiously:
- Every tool returns a structured error with a
hintfield. Whenrefundorderfails because the order is older than 90 days, the tool returns{ok: false, code: "REFUNDWINDOWEXPIRED", hint: "Use issuecredit_note instead"}. The agent self-corrects on the next step. Without that hint, it loops. - Idempotency keys on every mutation. Agents retry. Without keys, you get duplicate refunds. Ask me how I know.
- Dry-run modes as a separate tool.
refundorderpreviewreturns whatrefund_orderwould do. The planning model calls preview first 94% of the time when both exist. - Hard caps inside the tool, not the prompt. Prompts get ignored under pressure. The tool itself rejects refunds over $500 unless a
human_approvedtoken is present.
The failure modes that bite
A few patterns we now check every agent for before shipping:
Goal drift on long horizons
Past roughly 30 tool calls, even Opus 4.7 starts subtly reframing the original task. Our fix: a small re_anchor step every 10 calls that re-injects the original user goal as a system message. Eval pass rate on 50-step tasks went from 61% to 89%.
Sycophantic confirmation loops
Agents love saying "I have completed the task." Then you check and they have not. We added a verification tool — assert_state(expected) — that the agent must call before claiming completion. If the assertion fails, the agent keeps working. Cheap, brutal, effective.
Cost explosion on retries
A single misbehaving agent burned $1,800 in tokens overnight retrying a flaky webhook. We now wrap every agent in a per-run budget of dollars and steps, enforced outside the model. Non-negotiable.
Tool-name collisions and overloaded verbs
Early on we had updateorder, updatecustomer, and updatesubscription all in the same agent. The model picked the wrong one about 12% of the time on ambiguous prompts. Renaming them to orderupdatestatus, customerupdatecontactinfo, and subscriptionchangeplan — verb-last, namespaced, specific — dropped the error rate to under 2%. Models are good at reading careful names. They are bad at reading sloppy ones. This is the single cheapest accuracy win available.
Hidden state in tool outputs
Agents take cues from incidental data. We had an account_lookup tool that returned the customer's lifetime value among 40 other fields. The agent began factoring LTV into refund decisions in ways nobody intended. The fix was returning only what the current task needed. Tool output shape is part of your prompt whether you think about it or not.
What it actually costs
Real numbers from last month's production traffic:
- Average agent run: 18 tool calls, 47K input tokens, 3.2K output tokens
- Cost per run on Opus 4.7 only: about $0.94
- Cost per run with Opus planning + Haiku sub-tasks: about $0.16
- P95 latency: 11 seconds (acceptable for async workflows, not for chat)
If you are doing chat-style agents, latency will eat you alive. Move to background jobs with status polling. Users tolerate 30 seconds for a "researching your refund" spinner. They do not tolerate 11 seconds of dead air mid-conversation.
Evals or you are just vibing
I will say this loudly: if you do not have an eval suite with at least 200 graded examples per agent, you do not have a production agent. You have a demo. We run three eval tiers:
- Unit evals. Single tool call, deterministic. Run on every PR.
- Trajectory evals. Full agent run on a frozen scenario set, graded by a stronger model plus human spot-checks. Run nightly.
- Shadow production. New agent versions run in parallel with the live one for 48 hours. We diff outcomes.
Braintrust handles tiers 1 and 2 cleanly. Tier 3 is custom. Without tier 3, model upgrades silently break things. We caught a 7% regression on Opus 4.7 vs 4.5 only because of shadow runs.
A note on LLM-as-judge: trust it, but verify. We sample 10% of judged runs for human review and recalibrate the judge prompt monthly. When we stopped doing this for a quarter, the judge drifted optimistic and shipped a regression that customer support caught before our evals did. Evaluators are also models, and they need their own evals.
The org-chart shift nobody is talking about
The most surprising thing about running agents in production is what it does to your team structure. We hired our first "agent reliability engineer" last quarter — essentially an SRE whose pager covers agent runs instead of services. The role exists because agents fail in different ways than services do: not crashes, but quiet wrongness. Latency dashboards do not catch a refund agent that has started subtly over-refunding by 2%.
If you are scaling agents past three or four production workflows, expect to need someone whose entire job is eval design, trace inspection, and quality regression detection. It is a real specialty now. Job postings for "AI engineer" and "agent reliability engineer" tripled on Hacker News' "Who is hiring" between January 2025 and April 2026.
What this means for you
If you are building agents in mid-2026, the playbook is no longer experimental. It looks roughly like this:
- Pick one boring, high-value workflow. Refunds, lead routing, log triage. Not "an AI assistant."
- Write the tools first. Spend a week on tool design before touching prompts.
- Use Opus 4.7 for planning, Haiku 4.5 for grunt work. Do not default to the most expensive model for everything.
- Build evals before you build the agent. Yes, really.
- Set hard budgets and idempotency keys before you go anywhere near production.
- Ship behind a feature flag with a kill switch a non-engineer can pull.
The companies winning with agents right now are not the ones with the cleverest prompts. They are the ones with the most boring, well-instrumented tool layers. That is the entire game.
One last note on team adoption: the engineers who do best with agentic systems are the ones who treat them like junior employees with infinite patience and zero context, not like APIs. You give them clear scope, useful tools, written guardrails, and feedback loops. The mental model that fails is "very large function call." The mental model that works is "very fast new hire who needs the same onboarding everyone else got." Once that frame clicks for a team, the velocity unlocks. We have shipped agents in two weeks that would have been three-month projects under the old framing.
FAQ
💡 Key Takeaways
- In February 2026, one of our internal agents — a refund-triage bot running on Claude Opus 4.
- I have spent the last fourteen months at a Series B fintech shipping autonomous agents into real customer-facing workflows.
- Three things converged in the last nine months that finally made agents production-grade:...
Ask AI About This Topic
Get instant answers trained on this exact article.
Frequently Asked Questions
You Might Also Like
Enjoying this story?
Get more in your inbox
Join 12,000+ readers who get the best stories delivered daily.
Subscribe to The Stack Stories →Nilesh Kasar
Community MemberAn active community contributor shaping discussions on AI.
The Stack Stories
One thoughtful read, every Tuesday.
Responses
Join the conversation
You need to log in to read or write responses.
No responses yet. Be the first to share your thoughts!