You’ve heard the names: LangChain, CrewAI, AutoGen. You’ve seen the diagrams - tool registries, memory layers, multi-agent orchestration, evaluation harnesses - and you may have tried one, set up a tool, written a prompt, started a loop, and then spent an afternoon tracing why the agent was doing something you didn’t tell it to do, inside layers of abstraction that obscured the five lines of actual logic.
This post traces the agent stack back to the model API call - not to be comprehensive, but to show the seams. The stack didn’t get designed; it grew from a series of accidents, and when things grow, you can tell where the accidents are.
What Is an Agent?
Russell and Norvig define an agent as anything that perceives its environment through sensors and acts upon it through actuators, so a thermostat is an agent, and so is a single call to model.chat(messages) - it reads a prompt, produces a completion. Perceive, act.
But nobody calls a single API call an “agent” in the current sense - the word carries extra weight now. When engineers say “LLM agent,” they mean something that can do multi-step reasoning, call tools, plan across steps, adapt to feedback, and remember across iterations, all of which require infrastructure between the model and the world.
Nothing in that list is magical; each one is a design choice with a default implementation you inherited without noticing, and the rest of this post is about which choices got made and which ones got handed down.
Step 1 - Completion
Natural language processing existed as a research discipline long before large language models, working with statistical methods - bags of words, n-grams, the usual toolkit - to classify text, detect spam, extract sentiment, answer questions. Then neural networks pushed the field forward, recurrent models came and went, and eventually the transformer arrived, which among other things could be configured to predict the next token in a sequence. That turned out to be a surprisingly useful framing, and the models that came out of it changed what was possible.
You might remember GPT-2 - the paper came out in February 2019 with only the smallest model version, and over the next few months larger versions followed, the largest at 1.5 billion parameters in November. The original paper said something along the lines of humankind being under threat if the full model was released, which in 2026 reads as a curious artifact: the model is out, the internet has turned into an even stranger place, but I don’t think the bad things that happened over those years were because of GPT-2. Then came GPT-3 in May 2020, at 175 billion parameters, more than a hundred times larger than the largest GPT-2. The prediction engine underneath was the same throughout; what changed was scale, and with it, what the model could plausibly continue.
Unlike the models you’re used to now, where you ask something and get an answer, these early models were really just predicting text - you gave them the start of a story and they continued it, word by word. If you asked “What’s the weather today?” it might not answer the question but continue it as a line of small talk from some character in some conversation it had seen. You were passing a prefix and the model was continuing that prefix, and the entire game was controlling what sat to the left of the cursor so the continuation you wanted became the most probable next tokens.
There were a lot of open questions about how to make the model do what you asked rather than what it decided to predict on its own, and people developed a number of approaches. One of the most prominent was few-shot prompting: present several examples of a problem and its solution, sometimes with markup so you could parse the response out of the output, then give the model a new problem and expect it to follow the pattern. On top of that you layered a fair amount of parsing logic to handle failure modes - extra line breaks, the model ignoring the markup entirely, the model deciding to give a much longer response than you expected and getting cut off mid-sentence. All of that handling was something you had to build yourself.
The entire interface was that single string. You were sculpting text to coax a prediction, and that string was the application logic, the protocol, and the user interface, all collapsed into one artifact. That collapse matters, because every layer that follows - chat templates, structured output, function calling - is just that same prompt template wearing a costume, and the prediction engine underneath never changes.
Step 2 - Chat Templating
Around 2022, chat models arrived (OpenAI’s InstructGPT paper). Underneath, they were the same completion models, fine-tuned to follow a chat pattern - User: ... Chatbot: ... User: ... Chatbot: ... - so if you fed that structure as a prefix, the model would continue with the next assistant turn. You could finally have a conversation without role-playing through the prompt, hunting for few-shot examples to coax the behavior you wanted. And you could expect the model’s output to be a meaningful response rather than free-form continuation.
It wasn’t clean. Models tended to prefix answers with preamble - “Of course I can answer this question, here is the answer” - and the exact phrasing varied enough that you couldn’t regex it out. People tried asking for “no fluff, just the answer,” or wrapping responses in XML tags or Markdown fences. Markdown in particular became the lingua franca for interacting with language models: useful for structuring output, convenient for encoding headers and formatting, and models had seen enough of it during training that they followed the patterns reliably.
On the code side, the API shifted from passing a single string to model.complete() to passing a list of chat turns to model.chat(). The first turn was from the user, the second from the chatbot, the third from the user again, and so on, with a new role appearing along the way: the system prompt. Rather than embedding behavioral instructions in your first user message - “pretend you’re a lawyer” or “be serious” - you now had a dedicated place that models paid attention to, and it persisted even when you sliced or shuffled the conversation history.
messages = [
{"role": "system", "content": "You are helpful."},
{"role": "user", "content": "Hello"},
{"role": "assistant", "content": "Hi! How can I help?"},
{"role": "user", "content": "What's 2+2?"}
]
output = model.chat(messages)
Under the hood, the template engine still concatenates everything into one big string with special tokens between turns. The model still only sees that single string. The completion prompt from Step 1 was the context; now the message array is the context, with an extra layer of transformation in between. That convenience comes with a hidden cost: the template engine owns the projection from your structured messages to the flat prompt string, and you stop seeing how the message array becomes the prompt the model actually reads. The array and the prompt become the same thing in your mind, even though they’re different — and when things go wrong you need to know which side of that seam is responsible, the message array you constructed or the prompt string the model actually reads.
Step 3 - Structured Output
Engineers dealing with LLMs wanted models to return structured data, not plain text. The early approach was clumsy, if not embarrassing: ask the model nicely, then parse its prose and hope it contained valid JSON.
It worked quite poorly: it produced trailing commas. Unescaped quotes inside strings. Markdown fences wrapping the JSON. The error-retry loop - “that’s invalid JSON, try again” - burned context tokens on parsing failures.
My anecdote from that period: I was extracting structured content from large document chunks using Llama 2. It had just been released and was much better than anything I could run locally. The extraction mostly worked, but I kept seeing exceptions in the logs. When I enabled tracing and reproduced the error, I saw that the model first generated some “intro” text, which was expected. Then came the JSON I wanted, which I cut from the response and parsed. But to my surprise, after that the model produced “On second thought, I think the previous JSON was wrong, here’s the corrected one” and continued with another JSON body, which got my parsing scripts confused. Language models teach you that things can go in a surprising way.
Not so long after, model serving frameworks started implementing syntax-aware decoding, so the model could only produce tokens, which matched the syntax definition. That gave us technically-correct JSON outputs, at first as “whatever it is, it has to be JSON”, and later, with more advanced syntax engines, we also got outputs matching provided json schema.
The line between data and code is blurry in computer architecture. Bytes are bytes - interpret them as instructions, and they become a program. Same with structured outputs from a model. Give it a scenario and a description of available functions and their parameters, and the structured output becomes guidance on which functions to call and how. Models quickly learned to produce function calls reliably, and client libraries wrap and invoke them automatically based on the model’s output — a “tool call” is just a JSON object saying “I want to call this function with these arguments.” Once the model reliably emits structured actions, a loop becomes possible: give the model state and available actions, receive which action to call, execute it, feed the result back, get the next one — reliable structured output and function calling are what made the agentic loop practical.
Step 4 - Instruction-Tuned Models
Alongside chat models, there was another development. Not everybody needed long-form conversation - sometimes you just wanted extraction or transformation without the conversational wrapper - so instruct models emerged to fill that gap.
The breakthrough came with FLAN-T5. The idea: train on many datasets for many tasks, each described only by a natural language instruction, rather than on a single dataset for a single task. If you trained on fifteen of twenty datasets, the model would generalize to the five it had never seen - the actual numbers were different, but the result was the same. The model learned to follow instructions as a general skill, rather than memorizing task-specific patterns.
The instruct approach created a clean separation between instructions and data. You could keep the same instruction and swap in different data to get different outputs. That separation powered question-answering engines: “extract addresses from this text,” “summarize this document,” or even “here’s a chat, the character is this, here’s what they remember - what would the character say next?” Feed the instruction, feed the data, get a response.
Over time instruct and chat merged. Models became good at both, and the instruct capability got wrapped into the chat interface. You could frame an instruction as a system message, feed data as user messages, get responses as assistant messages. Chat syntax helped separate participants in a conversation, but the instruct approach was what pushed agentic patterns forward - it made it possible to define behavior through natural language and trust the model to follow it.
For the agent loop, that reliability matters because the loop that emerges next is defined entirely by instructions at runtime - a system prompt saying “think about what to do, call a tool, reason about the result” replaces hardcoded state machines and hand-written branching logic, and instruction-tuned models are what made that work in practice.
Step 5 - RAG
There was a problem that didn’t go away with better prompting or bigger models: context windows were finite. Early models could handle 512 tokens. A token isn’t exactly a word - the rough statistic was one token equals 0.75 words - so 512 tokens gave you about 370 words. Later models reached 2048, then more, but the underlying constraint was that memory and compute scaled roughly quadratically with context size. Going from 2048 to 4096 tokens needed four times the memory, not two. Improvements came - memory-efficient attention, flash attention, KV caching - but the quadratic pressure never fully disappeared.
When you had a 100-page document and needed to answer a question about it, you couldn’t fit it all in the prompt. One approach was map-reduce: slice the document, process each piece, then combine results. Wasteful, slow, resource-heavy. Another approach came from a separate line of research: models that could turn a chunk of text - a few hundred tokens - into a fixed-size vector, where semantically similar texts produced similar vectors. Sourdough bread and baking cakes would have higher vector similarity than sourdough bread and building construction. Those embedding models required far less compute than large language models, which meant you could process an entire corpus cheaply.
The recipe emerged: slice a document into chunks of 512 or 2048 tokens with some overlap, run each chunk through the embedding model, and store the mapping between chunk, its position in the document, and its vector. When a query arrived, embed the query, do a similarity search, and retrieve the closest chunks. There were plenty of false positives - you often needed the top 20 or even top 100 results and then filter further - but it was still better than running the full corpus through an LLM with no optimization.
That became RAG: inject the retrieved chunks into the prompt alongside the user’s question, and the model answers from them. The first-pass version was simple - retrieve chunks similar to the query, which works for “find a document about X” and breaks for queries that need reasoning about which document matters before retrieval can happen. Even when retrieval works, the chunks carry no structure about provenance: which document they came from, whether they contradict each other, whether they’re stale. They’re just text fragments in the prompt, and the model has to reconstruct relationships from raw text alone.
The implementation details have gotten much more sophisticated - hybrid search, re-ranking, query rewriting, agentic retrieval. But the core pattern hasn’t changed: external knowledge injected into the context window. Every memory system that followed - vector databases, knowledge graphs, state-shaped memory - is the same pattern in different clothes: get the right information into the prompt.
Step 6 - The Loop
Now look at what happens when you put the pieces together. You have multi-turn messages, so you can maintain a conversation. You have structured output, so the model can emit tool calls instead of prose. You have instruction-tuned models, so you can define behavior entirely through the system prompt. The combination is almost mechanical:
while True:
output = model.chat(messages) # model follows instructions
action = parse_structured(output) # extract tool call from response
if action is None: # no tool call = done
break
result = execute(action) # run the tool
messages.append({"role": "tool", "content": str(result)})
The model receives the full conversation history, considers the instruction in the system prompt, and either calls a tool or gives a final answer. If it calls a tool, you execute that tool, append the result back into the conversation, and go again. The model sees the tool result on the next turn and decides what to do next. The model writes drop table product, gets a SQL error, reads the error, retries with drop table products, and succeeds. It’s the same self-correction loop that makes agents useful and the same one that has produced enough stories of agents accidentally dropping production databases to fill a subreddit.
This is the ReAct pattern — not the 2022 paper specifically, but the pattern that emerges from the prior three steps, and the same loop appeared independently in dozens of projects because it’s the obvious thing to write in that situation. It isn’t clever; it just falls out of the capabilities once you stop thinking past the first implementation.
The load-bearing line is messages.append(...), the simplest possible update rule: you take the observation and stick it at the end of the history, with no compression, no restructuring, no judgment about whether this observation should replace something earlier or be dropped entirely - just append.
That rule is the default for a reason: when nobody has a better idea and the context window is big enough to hold the loop, append is all you need. But it’s still a choice. Every other update rule - forgetting, superseding, summarizing - is a deliberate departure from append. You can’t see those departures are available as long as append is the only rule you’ve ever considered.
There’s no framework hiding in those five lines, only a while loop, a model call, and an append - and everything that followed is this structure wrapped in abstractions, with the quality of the wrapping depending on whether it helps you see or hides what you’re already doing.
Step 7 - Framework Explosion
Once the loop worked, the natural reaction was to build something reusable — tool registries solved the problem of managing dozens of tool definitions, memory layers abstracted the growing message list into named concepts like short-term and long-term memory, agent types modularized responsibilities, workflow DSLs let you define sequences and branches visually, and multi-agent orchestration frameworks handled spawning, synchronization, and result reduction. Each layer was built to solve a real problem that emerges when you scale the five-line loop beyond toy examples.
Frameworks earn their place on large problems - if you’re building a workflow with six tools, three agent roles, and a branching approval step, they save you from writing that scaffolding yourself: the tool registry is useful, the state management cuts boilerplate, and tracing and debugging tools matter when things go sideways.
But there’s a cost that compounds with each layer, because the framework owns the message assembly, the tool execution order, how observations get fed back into the history, and the prompt formatting - so those five lines from Step 6 end up buried under configuration objects, middleware chains, and state machine transitions, and when the loop behaves in ways you didn’t expect (and it will), you debug by reading framework documentation and tracing through serialization pipelines instead of five lines of Python.
I don’t use agentic frameworks because wrapping a five-line loop in abstraction means you can’t see the line that’s wrong when it goes wrong — you’re buying configuration, not expressiveness. This isn’t “never use a framework”; it’s “write the loop yourself first, understand the mechanics, then decide whether the framework saves you more than it hides,” and for many problems the answer will be yes, so long as you know what you’re trading.
Why I Write Loops
I was running a long agent that scanned multiple documents for inconsistencies against a set of internal regulations. Initial tests looked fine — the model was finding things. Then over time the agent got worse, and I couldn’t say why. Tracing through the framework, I found it: the chat history compressor was supposed to summarize older turns into a recap and reuse it on the next run, but a config option was off, and instead of carrying the prior summary forward it was re-summarizing the whole history from scratch. The recap drifted, the model kept acting as if it had the full context, and the only signal was the quality drop.
The fix was a config flag, but reaching it required reading compression logic in someone else’s code on a hot path I hadn’t touched, for behavior I couldn’t observe directly. In a handcrafted loop, the same bug would have been a function I’d written myself - a single call I could step through and inspect, and that’s when I stopped configuring frameworks and started writing loops.
80 Lines
Here’s a working agent loop - no framework, one dependency. Give it an API key, run it.
The script uses uv, so a working run is uv run agent-loop.py with the right key in the environment. The defaults target OpenRouter’s openrouter/free model, which works with a free key from openrouter.ai. To hit OpenAI instead, set OPENAI_API_KEY, OPENAI_MODEL_NAME=gpt-4o-mini, and an empty OPENAI_BASE_URL. The same three values can be passed as flags.
The default system prompt is a cooking assistant with two fake tools, find_recipes and get_recipe_steps, and the default user query is the kind of thing the loop is built for: “I have chicken thighs, lemons, and thyme in the fridge. What should I make for dinner?” Running it prints two tool calls to the terminal - find_recipes with the ingredients, then get_recipe_steps with one of the returned recipe ids - followed by a final assistant turn with the recipe name, a short intro, and the steps in plain text. That is the loop in action: model emits an action, the tool runs, the result goes back into the conversation, the model decides what to do next.
The key line is messages.append(...) - every iteration grows the same list, it’s simple, it works, and the same append that holds the toy example together becomes the bottleneck once the context gets crowded.
You have a working agent loop, and it has exactly the flaw we’ll cover next: at the second turn of the conversation the model is already processing five messages to generate the sixth, and by the tenth turn the same rule is making it reprocess a much larger history on every iteration. Nothing about messages.append is wrong - until the conversation is long enough that the earlier turns have stopped mattering to the model. The next post starts at this append and walks through what breaks first.
