All posts

2026-02-10

Patterns in Agentic AI Systems — What Production Actually Looks Like

The demos are impressive. An agent browses the web, writes code, runs it, fixes its errors, and delivers a working result. In a controlled environment, with the right model and a hand-picked task, this genuinely works. Then you try to ship it and the problems begin.

After building and reviewing a number of agentic systems in production — from document processing pipelines to multi-step customer-facing assistants — I've noticed the same failure modes appear repeatedly. This post is about those patterns: what holds up, what doesn't, and how to think about the engineering decisions that matter most.

What "Agentic" Actually Means

The word gets used loosely. In practice, an agentic system is one where a language model takes actions — calling tools, writing to state, spawning sub-tasks — and makes decisions about which actions to take next based on intermediate results. The feedback loop is the key part. A simple RAG query is not agentic. A system that retrieves context, tries an action, observes the result, and decides what to do next is.

Most real agentic systems exist on a spectrum between single-turn tool use (one decision, one action) and fully autonomous loops (unbounded decisions, arbitrary state). Almost everything interesting lives in the middle.

The Orchestration Problem

The first hard decision is: who drives the loop? There are two broad patterns.

LLM-as-orchestrator: The model decides at each step which tool to call, what arguments to pass, and whether the task is complete. This is flexible and requires little scaffolding, but it leaks control. The model can get stuck in loops, misinterpret tool results, or decide it is done when it isn't.

Code-as-orchestrator: A deterministic script drives the sequence — calling the model only for specific sub-decisions — and uses the model's output to parameterise the next step. This is more predictable and easier to test, but it moves reasoning back to the programmer.

In practice, the right split is almost always more code and less model than your first instinct. Reserve model judgment for tasks that genuinely require it: interpreting ambiguous output, choosing between strategies, or summarising long context. Route everything else through deterministic logic.

State and Memory

Agents need memory. The naive solution — passing the full conversation history as context — fails quickly. Context windows fill up, costs grow linearly, and the model starts "forgetting" early instructions as the middle of the context inflates.

A more durable architecture separates memory into layers:

  • Working memory: The current task state, tool results, and immediate history. Short, fresh, high-fidelity.
  • Episodic memory: A compressed summary of what has happened so far. Periodically written by the model from working memory.
  • Semantic memory: Long-term facts, user preferences, or domain knowledge. Retrieved via vector search when relevant.

The retrieval step in semantic memory is itself an algorithm problem: nearest-neighbour search over high-dimensional embeddings, which is why vector database performance matters. A slow retrieval latency in a tight agent loop compounds quickly.

Tool Design is API Design

The quality of your tool definitions determines more of your agent's behaviour than your system prompt. A tool that does too much creates ambiguity about when to call it. A tool with a poorly described return schema creates parsing errors. A tool with side effects the model doesn't expect creates dangerous surprises.

Design tools like you design public APIs: minimal surface, clear contracts, explicit errors. If a tool can fail, describe what failure looks like and what the agent should do. If a tool has irreversible effects — sending a message, making a payment, writing to a database — add a confirmation step or separate the "plan" from the "execute" action so a human or a guard model can review before committing.

Failure Modes in Production

The failures I see most often:

Hallucinated tool calls: The model invents arguments that don't exist in the schema. Validate strictly at the boundary, not inside the tool.

Infinite loops: The agent gets stuck when a task fails repeatedly without a clear exit condition. Always define a maximum iteration count and a graceful degradation path.

Cascading context pollution: An early error populates the context with misleading state, and every subsequent step makes it worse. Implement checkpointing so you can roll back to a known-good state.

Over-instrumented systems: Adding more tools thinking more capability helps. It often makes routing harder and the model more likely to pick the wrong action. Start with the fewest tools that can solve the task.

Evaluation is Harder Than Building

A synchronous request-response API is easy to evaluate: it either returns the right answer or it doesn't. An agentic loop produces a trajectory — a sequence of decisions, actions, and intermediate states — and "correct" is often a property of the full trajectory, not just the final output.

Build evaluation into the system from day one. Log every tool call, every model decision, and every intermediate state. Define success metrics in terms the business cares about: task completion rate, steps-to-completion, cost per run, error rate. Use a small, representative set of golden tasks you can run after every model or prompt change.

Agentic systems are not magic. They are software with a probabilistic component, and they need the same engineering discipline — testing, observability, graceful degradation — as any other production system. The AI part is the easy part. The engineering is where the real work is.