This note covers the agent discipline: wrapping an LLM in a loop where it reasons, calls tools, observes the results, and repeats toward a goal — turning a system that answers into one that acts. It spans the agentic loop (ReAct), tool use and function calling, planning, memory, multi-agent orchestration patterns, and the reliability problems unique to multi-step autonomy. It builds directly on retrieval (just one tool the agent can call) and evaluation (which must move to the trajectory level), and excludes model internals and training, covered in other notes.
A bare LLM is inert and forgetful. It maps text to text in a single pass: it takes no actions in the world, keeps nothing between calls, and gets one shot of reasoning before it must answer. Plenty of real tasks need more than that — several dependent steps, external actions like searching, running code, or calling an API, and the ability to adapt based on what each step returns.
An agent is the wrapper that supplies all of it: an LLM placed in a loop, given tools and memory.[3] Three things mark a system as agentic — it uses tools, it runs multi-step loops, and it is goal-directed.[3] But autonomy cuts both ways. Every step the agent takes is another place it can go wrong, and unlike a crashed API call, an agent can fail while looking perfectly confident — returning a clean, well-formatted answer that is simply incorrect.[7] The discipline is about getting useful autonomy without letting that failure surface run away.
Tools are what turn understanding into doing. The model emits a structured tool call with parameters; your code parses it, executes the function, and feeds the result back as the next observation.[3] This is the dividing line: a chatbot responds with text, while an agent executes real actions in external systems through function calling.[2] A "tool" can be a web search, a database query, a code interpreter, or any API.
Planning decomposes a complex goal into manageable subtasks.[2] A planner–executor split assigns one role to decide the high-level plan and another to carry out each step and report back.[5] Planning upfront, rather than re-reasoning at every step, can be cheaper and even run steps in parallel — a meaningful saving when each LLM call costs money.[1] Skipping planning entirely produces a reactive agent: fast, but brittle and prone to drifting on complex tasks.[3]
Memory turns a stateless model into one that tracks progress across steps and sessions and can learn from experience[6] — it is how an agent remembers what it already tried and what its goal was. Reflection adds a layer where the agent critiques its own output and revises before proceeding, an optional quality loop that can be bolted onto any pattern.[4]
Beyond a single agent, work can be distributed. A router lets the LLM decide which tool or sub-agent to call.[4] A manager pattern has a central agent delegate subtasks to specialists. An orchestrator–worker pattern has a lead agent spawn workers to explore threads in parallel — in one published case this multi-agent setup beat a single agent by about 90% on internal research evaluations.[1] Which pattern fits depends on the task's latency, cost, and reliability budget; the coordination overhead should clearly earn its keep, so simpler is usually safer.[6]
This is what separates a demo from production. Small early errors compound: a wrong inference at step three propagates forward into confident but increasingly wrong downstream reasoning — a butterfly effect across long trajectories.[9] And capability is not consistency. On one retail-agent benchmark, a leading model scored 61% when judged on a single attempt but only 25% when required to succeed across several repeats[8] — the gap a happy-path demo never shows you.
The failures take recognizable shapes, and each needs a different defense.[7]
A small early mistake amplifies across steps into a badly wrong trajectory.
A confident, well-formatted answer that is simply wrong — or a tool called with bad params, treated as success.
Over a long horizon the agent loses its grip on the original objective.
The loop spirals into recurring calls, inflating latency and cost without converging.
Strip away the variations and an agent is one LLM core in a loop, surrounded by the parts that make the loop smarter: a planner to break work down, a memory to hold state, tools to act with, and — optionally — other agents to coordinate.
The throughline: a chatbot maps input to output; an agent closes the loop, turning each observation into the next decision until the goal is met — and everything else (planning, memory, orchestration) exists to make that loop smarter, faster, and more reliable. The discipline knits directly into the rest of the roadmap. Retrieval is just one tool the agent calls, so context engineering becomes about what each step puts in front of the model. Adaptation can produce the specialized models that serve as sub-agents. And evaluation has to level up: you can no longer score only the final answer — you must assess the trajectory, the path the agent took, using the session-, trace-, and span-level evals introduced in the first note.[7]
Autonomy multiplies the failure surface. Every extra step or sub-agent is another place for errors to compound, so keep the architecture as simple as the task allows. Plan upfront, cap iterations, and summarize periodically to fight goal drift — add orchestration only when it clearly earns its cost.
A working demo is not a reliable agent. Single happy-path success can hide collapse across repeats, so measure reliability over many runs, not one, and evaluate the path the agent took — not just whether the final output looked right.