The seven layers under your LLM agent

June 17, 2026 llm, agents, rag, ai, agent-loop

If you’ve followed the LLM framework space recently, it probably feels like the early Node.js ecosystem: a new framework every week promising to make the last one obsolete. But past the news headlines, the stack has been relatively stable, and what looks like constant revolution is really just layers accumulating on top of the same few ideas. You may have tried one of these frameworks, set up a tool, written a prompt, started a loop, then spent an afternoon tracing why the agent was doing something you didn’t tell it to do, inside abstractions that obscured the five lines of actual logic.

What Is an Agent?

Russell and Norvig define an agent as anything that perceives its environment through sensors and acts upon it through actuators, so a thermostat is an agent, and so is a single call to model.chat(messages) - it reads a prompt, produces a completion, perceives and acts in one shot.

But nobody calls a single API call an “agent” in the current sense - the word carries extra weight now. When engineers say “LLM agent,” they mean something that iterates: a thermostat with a PID controller, watching the temperature overshoot and adjusting the next pulse, or the lane-tracking system in a modern car, watching the road through a camera and nudging the wheel back into the lane. Agents loop through their own observations. A single model.chat() call doesn’t, because the action is the last step, and there is no “try again based on what just happened” unless you build the loop yourself.

Each piece of that loop is a design choice, and most of the pieces are ones the field inherited rather than chose.

Step 1 - Completion

Natural language processing existed as a research discipline long before large language models, working with statistical methods - bags of words, n-grams, the usual toolkit - to classify text, detect spam, extract sentiment, answer questions. Then neural networks pushed the field forward, recurrent models came and went, and eventually the transformer arrived, which among other things could be configured to predict the next token in a sequence. That turned out to be a surprisingly useful framing, and the models that came out of it changed what was possible.

You might remember GPT-2 - the paper came out in February 2019 with only the smallest model version, and over the next few months larger versions followed, the largest at 1.5 billion parameters in November. OpenAI’s framing was measured - “an experiment in responsible disclosure,” citing concerns about malicious applications - but the headlines ran toward “humankind is under threat,” and that version is the one that stuck. In 2026 it reads as a curious artifact: the model is out, the internet has turned into an even stranger place, but I don’t think the bad things that happened over those years were because of GPT-2. Then came GPT-3 in May 2020, at 175 billion parameters, more than a hundred times larger than the largest GPT-2. The prediction engine underneath was the same throughout; what changed was scale, and with it, what the model could plausibly continue.

Unlike the models you’re used to now, where you ask something and get an answer, these early models were really just predicting text - you gave them the start of a story and they continued it, word by word. If you asked “What’s the weather today?” it might not answer the question but continue it as a line of small talk from some character in some conversation it had seen. You were passing a prefix and the model was continuing that prefix, and the entire game was controlling what sat to the left of the cursor so the continuation you wanted became the most probable next tokens.

There were a lot of open questions about how to make the model do what you asked rather than what it decided to predict on its own, and people developed a number of approaches. One of the most prominent was few-shot prompting: present several examples of a problem and its solution, sometimes with markup so you could parse the response out of the output, then give the model a new problem and expect it to follow the pattern. On top of that you layered a fair amount of parsing logic to handle failure modes - extra line breaks, the model ignoring the markup entirely, the model deciding to give a much longer response than you expected and getting cut off mid-sentence. All of that handling was something you had to build yourself.

The biggest part of the process (and fun) was about defining what was to the left of the cursor so the continuation you wanted became the most probable next tokens, and that single string was the application logic, the protocol, and the user interface, all collapsed into one artifact, which is why every layer that follows - chat templates, structured output, function calling - is just that same prompt template under layers, with the prediction engine underneath never changing.

Step 2 - Chat Templating

Around 2022, chat models arrived (OpenAI’s InstructGPT paper). Underneath, they were the same completion models, fine-tuned to follow a chat pattern - User: ... Chatbot: ... User: ... Chatbot: ... - so if you fed that structure as a prefix, the model would continue with the next assistant turn. You could finally have a conversation without role-playing through the prompt, no more looking for few-shot examples to induce the behavior you wanted. And you could expect the model’s output to be a meaningful response rather than free-form continuation.

Models tried to mimic helpfulness, for example tended to prefix answers with preamble - “Of course I can answer this question, here is the answer” - and the exact phrasing varied enough that you couldn’t regex it out. People tried asking for “no fluff, just the answer,” or wrapping responses in XML tags or Markdown fences. Markdown in particular became the lingua franca for interacting with language models: useful for structuring output, convenient for encoding headers and formatting, and models had seen enough of it during training that they followed the patterns reliably.

On the code side, the API shifted from passing a single string to model.complete() to passing a list of chat turns to model.chat(). The first turn was from the user, the second from the chatbot, the third from the user again, and so on, with a new role appearing along the way: the system prompt. Rather than embedding behavioral instructions in your first user message - “pretend you’re a lawyer” or “be serious” - you now had a dedicated place that models paid attention to, and it persisted even when you sliced or shuffled the conversation history.

messages = [
    {"role": "system", "content": "You are helpful."},
    {"role": "user", "content": "Hello"},
    {"role": "assistant", "content": "Hi! How can I help?"},
    {"role": "user", "content": "What's 2+2?"}
]

output = model.chat(messages)

Under the hood, the template engine still concatenates everything into one big string with special tokens between turns. The model still only sees that single string. The completion prompt from Step 1 was the context; now the message array is the context, with an extra layer of transformation in between. That convenience comes with a hidden cost: the template engine owns the projection from your structured messages to the flat prompt string, and you stop seeing how the message array becomes the prompt the model actually reads. The array and the prompt become the same thing in your mind, even though they’re different - and when things go wrong you need to know which side of that seam is responsible, the message array you constructed or the prompt string the model actually reads.

Step 3 - Structured Output

Engineers dealing with LLMs wanted models to return structured data, not plain text. The early approach was clumsy, if not embarrassing: ask the model nicely, then parse its prose and hope it contained valid JSON.

It worked quite poorly: it produced trailing commas, unescaped quotes inside strings, markdown fences wrapping JSON blob. (Prompt) engineers were trying various forms of “no prose, no Markdown, no preamble, no trailing comma, no extra keys” in their prompts, and the model would find a new way to fail. Some forms of “creative escaping” were very entertaining to read.

My anecdote from that period: I was extracting structured content from large document chunks using Llama 2. It had just been released and was much better than anything I could run locally. The extraction mostly worked, but I kept seeing exceptions in the logs. When I enabled tracing and reproduced the error, I saw that the model first generated some “intro” text, which was expected. Then came the JSON I wanted, which I cut from the response and parsed. But to my surprise, after that the model produced “On second thought, I think the previous JSON was wrong, here’s the corrected one” and continued with another JSON body, which got my parsing scripts confused. Language models teach you that things can go in a surprising way.

Not so long after, model serving frameworks started implementing syntax-aware decoding, so the model could only produce tokens, which matched the syntax definition. That gave us technically-correct JSON outputs, at first as “whatever it is, it has to be JSON”, and later, with more advanced syntax engines, we also got outputs matching provided json schema.

The line between data and code is blurry in computer architecture. Bytes are bytes - interpret them as instructions, and they become a program. Same with structured outputs from a model. Give it a scenario and a description of available functions and their parameters, and the structured output becomes guidance on which functions to call and how. Models quickly learned to produce function calls reliably, and client libraries wrap and invoke them automatically based on the model’s output - a “tool call” is just a JSON object saying “I want to call this function with these arguments.” Once the model reliably emits structured actions, a loop becomes possible: give the model state and available actions, receive which action to call, execute it, feed the result back, and the model decides what to do next. The agentic loop isn’t a new invention; it falls out of structured output the moment you wire the result back into the prompt.

Step 4 - Instruction-Tuned Models

Alongside chat models, there was another development. Not everybody needed long-form conversation - sometimes you just wanted extraction or transformation without the conversational wrapper - so instruct models emerged to fill that gap.

The breakthrough came with FLAN-T5. The idea: train on many datasets for many tasks, each described only by a natural language instruction, rather than on a single dataset for a single task. If you trained on fifteen of twenty datasets, the model would generalize to the five it had never seen - the actual numbers were different, but the result was the same. The model learned to follow instructions as a general skill, rather than memorizing task-specific patterns.

The instruct approach created a clean separation between instructions and data. You could keep the same instruction and swap in different data to get different outputs, and that separation powered question-answering engines: “extract addresses from this text,” “summarize this document,” or even “here’s a chat, the character is this, here’s what they remember - what would the character say next?”

Over time instruct and chat merged. Models became good at both, and the instruct capability got wrapped into the chat interface. You could frame an instruction as a system message, feed data as user messages, get responses as assistant messages. Chat syntax helped separate participants in a conversation, but the instruct approach was what pushed agentic patterns forward - it made it possible to define behavior through natural language and trust the model to follow it.

For the agent loop, that reliability matters because the loop that emerges next is defined entirely by instructions at runtime - a system prompt saying “think about what to do, call a tool, reason about the result” replaces hardcoded state machines and hand-written branching logic, and instruction-tuned models are what made that work in practice.

Step 5 - RAG

There was a problem that didn’t go away with better prompting or bigger models: context windows were finite. Early models could handle 512 tokens. A token isn’t exactly a word - the rough statistic was one token equals 0.75 words - so 512 tokens gave you about 370 words. Later models reached 2048, then more, but the underlying constraint was that memory and compute scaled roughly quadratically with context size. Going from 2048 to 4096 tokens needed four times the memory, not two. Improvements came - memory-efficient attention, flash attention, KV caching - but the quadratic pressure never fully disappeared.

When you had a 100-page document and needed to answer a question about it, you couldn’t fit it all in the prompt. One approach was map-reduce: slice the document, process each piece, then combine results. Wasteful, slow, resource-heavy. Another approach came from a separate line of research: models that could turn a chunk of text - a few hundred tokens - into a fixed-size vector, where semantically similar texts produced similar vectors. Sourdough bread and baking cakes would have higher vector similarity than sourdough bread and building construction. Those embedding models required far less compute than large language models, which meant you could process an entire corpus cheaply.

The recipe emerged: slice a document into chunks of 512 or 2048 tokens with some overlap, run each chunk through the embedding model, and store the mapping between chunk, its position in the document, and its vector. When a query arrived, embed the query, do a similarity search, and retrieve the closest chunks. There were plenty of false positives - you often needed the top 20 or even top 100 results and then filter further - but it was still better than running the full corpus through an LLM with no optimization.

That became RAG: inject the retrieved chunks into the prompt alongside the user’s question, and the model answers from them. The first-pass version was simple - retrieve chunks similar to the query, which works for “find a document about X” and breaks for queries that need reasoning about which document matters before retrieval can happen. Even when retrieval works, the chunks carry no structure about provenance: which document they came from, whether they contradict each other, whether they’re stale. They’re just text fragments in the prompt, and the model has to reconstruct relationships from raw text alone.

The implementation details have gotten much more sophisticated - hybrid search, re-ranking, query rewriting, agentic retrieval. But the core pattern hasn’t changed: external knowledge injected into the context window. Every memory system that followed - vector databases, knowledge graphs, state-shaped memory - is the same pattern in different clothes: get the right information into the prompt.

Step 6 - The Loop

Now look at what happens when you put the pieces together. You have multi-turn messages, so you can maintain a conversation. You have structured output, so the model can emit tool calls instead of prose. You have instruction-tuned models, so you can define behavior entirely through the system prompt.

while True:
    output = model.chat(messages)           # model follows instructions
    action = parse_structured(output)       # extract tool call from response
    if action is None:                      # no tool call = done
        break
    result = execute(action)                # run the tool
    messages.append({"role": "tool", "content": str(result)})

The model receives the full conversation history, considers the instruction in the system prompt, and either calls a tool or gives a final answer. If it calls a tool, you execute that tool, append the result back into the conversation, and go again. The model sees the tool result on the next turn and decides what to do next. The model writes drop table product, gets an SQL error, reads the error, retries with drop table products, and succeeds. It’s the same self-correction loop that makes agents useful and the same one that has produced enough stories of agents accidentally dropping production databases to fill a subreddit.

This is the same loop that drives Aider, OpenHands, and the early coding agents: give the model tools to list files, read them, search them, and write patches, and it iterates against a task until the code compiles or it runs out of turns. Five lines of Python, a handful of tools, and a working terminal — that is the entire system.

This is the ReAct pattern - not the 2022 paper specifically, but the pattern that emerges from the prior three steps, and the same loop that showed up independently in dozens of projects because of how natural it was.

The load-bearing line is messages.append(...), the simplest possible update rule: you take the observation and stick it at the end of the history, with no compression, no restructuring, no judgment about whether this observation should replace something earlier or be dropped entirely - just append.

That rule is the default for a reason: when nobody has a better idea and the context window is big enough to hold the loop, append is all you need. But it’s still a choice. Every other update rule - forgetting, superseding, summarizing - is a deliberate departure from append. You can’t see those departures are available as long as append is the only rule you’ve ever considered.

There’s no framework hiding in those five lines, only a while loop, a model call, and an append - and everything that followed is this structure wrapped in abstractions, with the quality of the wrapping depending on whether it helps you see or hides what you’re already doing.

Step 7 - Framework Explosion

Once the loop worked, the natural reaction was to build something reusable. Tool registries to manage dozens of definitions, memory layers to abstract the growing message list into named concepts, agent types to modularize roles, workflow DSLs to draw sequences and branches visually, multi-agent orchestrators to handle spawning, synchronization, and result reduction across sub-tasks. Each layer was built to solve a real problem that emerges when you scale the five-line loop beyond toy examples - and frameworks like LangChain, LlamaIndex, and the rest started life as connectors for the data sources and tools you wanted to feed that loop, before they grew agentic abstractions of their own.

Frameworks earn their place on large problems - if you’re building a workflow with six tools, three agent roles, and a branching approval step, they save you from writing that scaffolding yourself: the tool registry is useful, the state management cuts boilerplate, and tracing and debugging tools matter when things go sideways.

But there’s a cost that compounds with each layer, because the framework owns the message assembly, the tool execution order, how observations get fed back into the history, and the prompt formatting - so those five lines from Step 6 end up buried under configuration objects, middleware chains, and state machine transitions, and when the loop behaves in ways you didn’t expect (and it will), you debug by reading framework documentation and tracing through serialization pipelines instead of five lines of Python.

I still write loops by hand, and the reason is visibility. The frameworks are fine for quick starts - tool registries, memory layers, and orchestration out of the box - but when something goes wrong I’d rather step through five lines of Python than read through code I didn’t write. I prefer “non-magical” code, where you know who calls what: writing the loop yourself costs time, but you spend that time understanding the code, not learning a framework’s vocabulary. Most of the time I’d rather know what I’m running.

80 Lines

Here’s a working agent loop - no framework, one dependency (and a second one for convenience). Download it, give it an API key, run it.

uv run redefining-rag/posts/agent-loop.py

The defaults target OpenRouter’s openrouter/free model, which works with a free key from openrouter.ai. To hit OpenAI instead, set OPENAI_API_KEY, OPENAI_MODEL_NAME=gpt-4o-mini, and an empty OPENAI_BASE_URL. The same three values can be passed as flags.

The default system prompt is a cooking assistant with two fake tools, find_recipes and get_recipe_steps, and the default user query is the kind of thing the loop is built for: “I have chicken thighs, lemons, and thyme in the fridge. What should I make for dinner?” Running it prints two tool calls to the terminal - find_recipes with the ingredients, then get_recipe_steps with one of the returned recipe ids - followed by a final assistant turn with the recipe name, a short intro, and the steps in plain text. That is the loop in action: model emits an action, the tool runs, the result goes back into the conversation, the model decides what to do next.

The key line is messages.append(...) - every iteration grows the same list, it’s simple, it works, and the same append that holds the toy example together becomes the bottleneck once the context gets crowded.

You have a working agent loop, and it has exactly the flaw we’ll cover next: at the second turn of the conversation the model is already processing five messages to generate the sixth, and by the tenth turn the same rule is making it reprocess a much larger history on every iteration. messages.append works fine until the conversation grows long enough that the earlier turns stop mattering to the model. The next post starts at this append and walks through what breaks first.