Inside the Loop | The Agent Handbook

The Agent Loop in Detail

Chapter 2 gave you the architecture. Now i want to show you what it actually feels like when an agent runs.

Because there's a gap between understanding the concept (observe, think, act, repeat) and understanding what happens in practice. The concept is clean. Practice is messy.

Let's walk through it.

An agent gets a task. Say, something simple: "Find the cheapest flight from Singapore to Tokyo next month and book it." Sounds straightforward. But watch what actually happens under the hood.

Step 1: The agent reads the task and plans. It breaks this into sub-goals. Search for flights, compare prices, select the cheapest, enter passenger details, complete the booking. Five steps. Each one looks simple in isolation.

Step 2: It takes the first action. Opens a browser, navigates to a flight search engine, enters the route and dates. Already, things can go sideways. The site might have a CAPTCHA. The date picker might be a custom widget that doesn't respond to standard inputs. The page might load slowly and the agent times out.

Humans do this too. you click a button twice because the page felt slow. You accidentally submit twice. You end up in a weird state and you don’t know why. Agents get stuck in the same ambiguity, except they don’t have the gut feel we have from ten thousand hours of browsing.

Step 3: It reads the results. The agent needs to parse what's on the screen. Flight options, prices, layover times, airlines. If the results are in a clean table, great. If they're in a dynamic carousel with "show more" buttons and hover tooltips.. the agent might miss half the options.

This is an underrated point. A lot of agent failures are not reasoning failures. They are perception failures. The page is built for human eyes. The information is there, but not in a shape the agent reliably extracts.

Step 4: It decides. Cheapest flight found. Great.

Except the cheapest flight is a 14-hour layover. maybe it lands at 2am. maybe it’s two separate tickets. maybe it’s non-refundable.

And now we hit the part that’s uncomfortable. The instruction said cheapest. iI did not say cheapest reasonable. iI did not say no red-eyes, no 2-stop itineraries, no budget airlines, no 10-hour layovers.

Step 5 through 10: More actions, more decisions, more things that can break. Enter passenger name, select a seat, enter payment info, handle confirmation. Each step is another opportunity for failure.

Here's the math that keeps agent builders up at night. If each step succeeds 95% of the time (which is optimistic for web interactions), and you have 10 steps, your end-to-end success rate is 0.95^10. That's about 60%. Four out of ten runs fail somewhere.

And that's with 95% per step. Most agents aren't there yet.

How reliability compounds

Per-step success rate

80%95%99%

Number of steps

59.9%end-to-end success rate

95% per step × 10 steps = 59.9% overall

State

The thing that makes long tasks hard is not that the agent forgets everything. It’s that it loses track of where it is.

State is the agent’s internal scoreboard for the current run.

What page it’s on. what it already clicked. which flight it selected. what passenger name it entered. whether it is on attempt one or attempt three. whether it is in the middle of a form. whether the last tool call succeeded.

Humans keep a lot of state for free. You glance at the page and you remember. You have that little internal sense of progress.

Agents do not get that for free. State has to be written down explicitly, kept consistent, and updated carefully. Otherwise you get the classic failure pattern:

Repeat steps. undo previous progress. click back and forth. re-enter the same field. select the wrong option because the agent forgot it already filtered.

i’ve watched agents do this in long sessions and it looks like laziness. It isn’t. It’s state corruption. the task stops being a straight line and starts being a blob of half-finished attempts.

Retries

When a step fails, the agent has to decide: retry or give up?

This sounds like a simple binary, but there's real nuance here. Some failures are transient (the page didn't load, try again). Some failures are structural (the website doesn't support this action). Some failures are ambiguous (the agent clicked a button and nothing visible changed, did it work or not?).

Good retry logic looks like this. Try the same approach once. If it fails again, try a different approach. If that fails, escalate to a human or bail gracefully. Bad retry logic is just.. doing the same thing ten times and hoping for a different result.

The worst kind of failure in agents isn't the loud crash. It's the silent wrong answer. The agent completes all the steps, reports success, and the output is wrong. It booked the wrong date. It entered the wrong passenger name. It selected a non-refundable ticket when you wanted flexible. The agent is confident. You're confident. Everyone finds out later.

Stopping Conditions

When should an agent stop? This is less obvious than it seems.

The simplest case: the task is done. Flight booked, confirmation received. But what about "research the best CRM tools for a 50-person sales team"? When is that done? After finding 5 options? 10? 20? After reading reviews for each one? After comparing pricing tiers?

Open-ended tasks are where agents struggle most. Without a clear stopping condition, they either stop too early (returning a surface-level answer) or run forever (burning tokens and time on diminishing returns).

The best agent systems build explicit stopping conditions into the task definition. "Find 5 options with pricing" is better than "research CRM tools." Specific beats vague. This applies whether you're giving instructions to an agent or to a new hire. The principle is the same.

Tool Use and Environments

The tools chapter in every agent framework tutorial makes it look clean. Define a function, register it as a tool, the agent calls it when needed. Done.

In production, tool use is where most of the pain lives.

The Browser Problem

The most common tool agents need is a web browser. And the web was not built for agents.

Think about what browsing actually involves. Clicking buttons that might be behind cookie consent banners. Scrolling to load dynamic content. Handling pop-ups, modals, and overlays. Filling forms that use custom date pickers, dropdown menus, and auto-complete fields. Solving CAPTCHAs (which exist specifically to stop automated access).

Every one of these is a potential failure point. An agent that works perfectly on one site might fail completely on another because the HTML structure is different, or the JavaScript loads differently, or there's a bot detection system that blocks automated browsers.

This is why companies like Browserbase exist. They provide cloud browsers specifically designed for agents. Headless browsers that handle JavaScript rendering, anti-bot protection, and session management. It's infrastructure that sounds boring until you try to build a web-interacting agent without it.

i did a deep dive on TinyFish, which solves a related problem. Most of the web is a mess. Tables that don't render, dynamic content that requires scrolling, pages built for human eyes not machine parsing. TinyFish uses AI to read the unreadable web. It's the kind of infrastructure that every agent touching the web eventually needs.

The Credential Problem

Here's a question that gets uncomfortable fast: how do you give an agent access to your accounts?

If an agent needs to book a flight, it needs your login credentials. If it needs to send an email, it needs access to your email account. If it needs to make a purchase, it needs your payment information.

The naive approach is to just give the agent your password. This is.. obviously bad. But it's what many early agent demos did (and some still do).

The better approaches involve OAuth tokens, API keys, and scoped permissions. Give the agent access to do specific things without giving it the keys to everything. Model Context Protocol (MCP) is emerging as a standard for this. Instead of giving the agent your Notion password, you give it an MCP connection that lets it read and write specific Notion pages. The agent never sees your credentials.

But even with scoped permissions, the trust question remains. An agent with write access to your email can send emails on your behalf. An agent with access to your calendar can accept meetings. An agent with access to your code repository can push changes. Each permission is a surface area for things to go wrong.

Rate Limits and Costs

Every tool call costs something. An API call to a search engine. A page load in a browser. A database query. A model inference to decide what to do next.

These costs add up in ways that aren't obvious until you're running agents at scale. A single research task might involve 50 web page loads, 20 API calls, and 100+ model inferences. If the agent gets stuck in a retry loop, those numbers multiply fast.

Rate limits are the other hidden constraint. Most APIs limit how many requests you can make per minute or per day. An agent that's working quickly can easily hit these limits, especially if it's running multiple tasks in parallel. Good agent systems handle rate limiting gracefully, queuing requests and backing off when limits are hit. Bad ones just crash.

Memory Systems

Memory in agents is one of those problems that sounds simple and is actually incredibly hard. We talked about memory in the previous chapter.

Three types of agent memory

The Memory Poisoning Problem

Long-term memory sounds great in theory. An agent that remembers your preferences, learns from past interactions, gets better at anticipating your needs over time. But what happens when the memory is wrong?

Imagine an agent that remembers "user always prefers the cheapest option." Maybe that was true for one purchase. Now the agent is booking a flight for a business trip with a client and it picks the budget airline with a connection through three airports. The memory was too aggressive in generalizing from a single data point.

Or worse: what if someone deliberately feeds the agent bad information? Prompt injection through memory is a real attack vector. If an agent stores user feedback and uses it in future sessions, an attacker can shape the agent's future behavior by manipulating what it remembers.

This is why memory systems need careful design. What gets stored, how long it persists, how it can be corrected, and who has access to modify it. Most agent systems today either have no long-term memory (every session starts fresh) or have naive memory (store everything, hope for the best).

The middle ground, storing the right things with the right confidence levels and the right expiration policies, is an unsolved problem. i'm watching a few teams work on this. Nobody has cracked it yet.