The Agentic Stack Is Breaking. Here's What Replaces It.

Share on social media:

the agentic stack is breaking replaced by agentic operating system AOS - visual showing crumbling legacy automation siloed scripts and fragmented agents replaced by integrated AOS with goals context orchestration and data communication decision and learning agents

The demo always works. The agent searches the web, drafts an email, updates the CRM, and sends a Slack notification — all from a single prompt. Founders screenshot it. VCs fund it. Engineers ship it. Then it hits production.


The tool call returns a 429. The agent retries six times, burning $4 in tokens, and still writes "I was unable to complete this task" into a Notion doc nobody asked for. Or worse: it completes the task incorrectly, with enough confident prose that nobody notices until the wrong invoice went to the wrong client.


This isn't a model problem. The models have gotten remarkably good. It's a stack problem — and the stack most teams built their agentic systems on in 2023 and 2024 was never designed for the reliability requirements of real work.

What "The Agentic Stack" Actually Means

When engineers talk about the agentic stack, they mean a specific architecture: a frontier LLM at the center, a set of tool definitions it can call (APIs, code interpreters, browser use, database writes), a memory layer (usually a vector store), an orchestration wrapper that routes between steps, and a prompt that instructs the model to "think step by step."


In its basic form, this is what LangChain, AutoGen, CrewAI, and dozens of clones popularized between 2023 and early 2025. The architecture is elegant on a whiteboard. Chain the reasoning. Call the tools. Store the context. Repeat.


The problem is that each of those links multiplies failure probability. In a five-step agentic workflow where each step has 90% reliability, the end-to-end success rate is 59%. At ten steps, it's 35%. Real pipelines are not 90% reliable per step — they're more like 70-80% on messy real-world inputs, which means multi-step agents frequently produce garbage with great posture.


BenchAgent, a 2024 evaluation by researchers at Carnegie Mellon, tested 18 LLM-based agents across 96 real-world tasks and found average task completion rates between 33% and 51% — even with GPT-4-class models. The agents weren't failing because the models were dumb. They were failing because the orchestration layer had no recovery logic, tool errors cascaded silently, and context windows bloated to the point where the model lost track of its original objective.

The Three Places Where It Actually Breaks

1. Tool calls are treated as reliable I/O — they're not

Most agentic frameworks wrap tool calls in a thin layer that assumes a clean request-response pattern. If the tool returns an error, the agent either halts, retries blindly, or hallucinates a fake result and continues. None of these outcomes are acceptable in a production system.


Stripe's API returns errors with specific codes that require different responses. A rate_limit_exceeded needs a backoff. An invalid_request_error needs the agent to re-read its own input. An authentication_error needs a human. The current generation of agent orchestrators largely cannot distinguish between these cases, and most prompts don't tell them to try.

2. Memory is expensive, stale, and often wrong

Vector-based memory stores — the dominant pattern — retrieve by semantic similarity, not by what's actually relevant to the current task. Ask an agent to "check what we discussed last Tuesday," and it will surface the three most semantically similar past chunks, which might be from a completely different project thread. The agent will use them confidently.


Beyond retrieval accuracy, there's a cost problem. Full context injection — dumping all memory into every call — is the safer option, but token costs compound fast. A team at Letta (formerly MemGPT) documented that their production agents were spending 60-70% of total token spend on context management alone. For most companies, that's where the unit economics fall apart.

3. Orchestration has no ground truth

The agent decides when it's done. That decision is made by the same LLM that did the work — which means there's no independent verification layer. If the model is slightly off-task, it will report success. If it misread the original instruction, it will confidently complete the wrong objective.


This isn't hypothetical. In early 2025, several teams using AutoGen for code generation workflows found agents marking tasks as "complete" after writing syntactically valid code that failed every unit test. The model had satisfied its own internal criteria. Nobody told it the criteria were wrong.

What the Next Architecture Looks Like

The teams actually getting agentic systems to work in production aren't using the classic stack. They've converged, largely independently, on four patterns that change the underlying assumptions.

Structured outputs as the contract layer

Instead of asking an LLM to reason freely and extract results from prose, production-grade agentic systems use structured outputs — JSON schemas, typed responses — at every decision boundary. OpenAI's structured outputs feature (released mid-2024) and Anthropic's tool-use protocol both enable this natively now.


The effect is significant. When the model must return a typed object with a defined schema, the failure mode becomes visible: the call either parses or it doesn't. You get hard errors you can handle, not confident nonsense you can't detect. Teams at companies like Notion and Linear have reported dropping their agent error rates by over 40% by moving internal routing decisions to structured output calls rather than free-form chain-of-thought.

Evals as infrastructure, not afterthought

The teams shipping reliable agents treat evaluations the way software teams treat tests — as a prerequisite to shipping, not a post-hoc audit. This means building eval harnesses before the agent, not after.


Braintrust, Langfuse, and a small number of internal eval frameworks are handling this for some teams. But the more important shift is cultural: the question is no longer "does it work in the demo?" but "what is the measured success rate on a representative sample of real inputs, and what does failure look like?"


This sounds obvious. Almost nobody did it in 2023.

Separation of planning and execution

One of the cleanest architectural changes emerging in 2025 is decoupling the agent that makes a plan from the agent that executes steps. The planner produces a structured task graph — explicit steps, dependencies, expected outputs, fallback paths — before any tool is called. The executor is then given one step at a time, with no visibility into the broader plan.


This does two things. It makes plans auditable by humans before execution starts. And it prevents the executor from "reasoning its way" into scope creep — the tendency of single-agent systems to expand their own task definition mid-run.


Cognition AI (the team behind Devin) and Cohere have both published architectural thinking in this direction. Microsoft's AutoGen 0.4, released in late 2024, made explicit actor-observer separation a core design principle rather than an optional pattern.

Human checkpoints as a feature, not a failure mode

The implicit assumption of most early agentic systems was that human involvement was a sign of failure — the goal was full automation. That assumption created brittle systems that couldn't ask for help when stuck.


The better framing is human-on-the-loop: agents that can identify their own uncertainty, pause at defined confidence thresholds, and surface specific questions rather than either failing silently or bulldozing through. This isn't less automation. It's more reliable automation — one that knows its own limits.


Some of the most mature agentic deployments in 2025 have hardcoded interruption points: before any write to a production database, before any external communication to a customer, before any purchase over a defined threshold. These checkpoints don't slow the system down meaningfully. They prevent the class of failures that make companies stop using agentic systems entirely.

The Platform Layer Is Where This Gets Decided

None of these patterns are trivial to implement from scratch. Structured output handling, eval harnesses, planning/execution separation, and configurable human checkpoints each require engineering investment. For most teams, that investment isn't the core product — it's infrastructure.


This is where the platform layer matters. The next wave of agentic infrastructure isn't about giving teams more powerful models or more tools — it's about giving teams the scaffolding that makes agents reliable enough to trust with real work.

At Amantra.ai, this is exactly the problem we're working on. The platform is built around the premise that agent reliability — not agent capability — is the bottleneck for most enterprise deployments. That means structured orchestration, observable execution, built-in eval loops, and configurable human-in-the-loop checkpoints, without requiring teams to build that scaffolding themselves.


The agents that work aren't smarter. They're better designed.

The One Thing Worth Taking Away

The current moment in agentic AI looks a lot like the early days of microservices: the architecture works, but the tooling for making it production-grade hadn't caught up. Teams that went all-in on microservices before service meshes, distributed tracing, and circuit breakers existed paid a serious operational tax.


The same reckoning is coming for agentic systems — and in many places, it's already here. The teams that will come out ahead aren't the ones with the most ambitious agent designs. They're the ones that treated reliability as a first-class requirement from the start.


The stack is breaking because it was designed for demos, not for work. What replaces it won't look dramatically different on the surface. Underneath, it'll be engineered to actually finish the job.


Amantra.ai builds the orchestration and reliability infrastructure for enterprise agentic deployments. If your team is hitting the production ceiling with the current stack, see how Amantra approaches it.