How to Think About Agents | The Agent Handbook

The Evaluation Problem

Here's something that took me a while to internalize. Most agent benchmarks are essentially broken.

Not broken in the sense that they don't measure anything. They measure something. But what they measure is often disconnected from what actually matters in production.

SWE-bench is a good example. It asks whether a coding agent can resolve issues from GitHub. That sounds close to reality. But a benchmark like this holds a lot constant that production does not. The environment is cleaner. The task framing is more stable. The feedback loop is simplified. There is usually no real penalty for time, for expensive retries, for partial progress, for interruptions, for tool flakiness, for ambiguous requirements.

The result is that SWE-bench scores have become a marketing metric. "We achieved 40% on SWE-bench!" Cool. Does your agent actually help a developer ship code faster? Different question entirely.

The same pattern plays out across the space. Demos look incredible. Benchmarks show improvement. But when you deploy the thing in production, the failure modes are.. not what you expected.

I now think of benchmarks as unit tests. Useful for tracking progress and comparing approaches. Dangerous if you confuse them with system performance.

What benchmarks hold constant

BenchmarkProduction

Benchmarks are unit tests. Useful for tracking progress. Dangerous if you confuse them with system performance.

So what should you actually look at?

Real-world success rates. Not on benchmarks, but on the actual tasks your users need. A 95% success rate on demos means roughly a 70% success rate in production, in my experience. Maybe less.

Failure mode analysis. How does it break? Silent failures (wrong answer, confident delivery) are far worse than loud failures (error messages, asking for help). Ask to see the failure cases, not just the success cases.

Time and cost per task. An agent that takes 10 minutes and costs $2 to do something a human can do in 3 minutes is not actually saving money. Do the math.

The Reliability Gap

This is maybe the most important concept in the entire space right now.

There's a massive gap between works in a demo and works 1000 times in a row in production. i'd estimate that gap is where majority of agent companies are stuck today.

Why is the gap so wide? Three reasons.

Edge cases multiply. A demo tests the happy path. Production hits every edge case in your data, your APIs, your users' behavior. An agent that works 95% of the time fails 50 times out of 1000 runs. Depending on the stakes, that might be totally unacceptable.

Context variability. In a demo, you control the input. In production, users throw things at the agent that no one anticipated. Different phrasings, unexpected data formats, half-completed tasks from previous sessions.

Compounding errors. Agents make decisions sequentially. If each step has a 95% success rate and you have 10 steps, your end-to-end success rate is 0.95^10 ≈ 60%. That's with 95% per step. Most agents aren't even there.

The companies that are winning are the ones that have figured out how to close this gap for their specific use case. Not in general. For their specific inputs, their specific tool integrations, their specific failure modes.

The Economic Question

When does an agent actually save money vs. a human? This seems straightforward but the math is trickier than most people think.

The naive calculation: "A human costs $X/hour, the agent costs $Y/hour, the agent is cheaper." But this ignores:

Supervision costs. Someone still needs to check the agent's work, handle its failures, and intervene when it gets stuck. This is often 30-50% of the time savings.
Error costs. When the agent makes a mistake, what does it cost to fix? In customer support, a bad response might lose a customer. In coding, a bug in production has real downstream costs.
Setup and maintenance. Prompts need tuning, tools need updating, edge cases need handling. This is ongoing work, not a one-time investment.

The honest math, for most use cases today. Agents save 30-50% of the cost, not 90%. They're most cost-effective for high-volume, medium-stakes tasks where the failure cost is low and the volume justifies the setup investment.

This will change as models improve. But right now, anyone claiming 10x cost savings is either in a very specific niche or not counting all the costs.

The Closed Loop Advantage

Here's a framework i keep coming back to, and it's central to how i evaluate agent companies.

The companies that win in this space will be the ones that generate their own training data from deployed agents. i call this the closed loop advantage.

It works like this. You deploy an agent. It handles real tasks. Some it gets right, some it gets wrong. The right answers become training data for fine-tuning. The wrong answers (once corrected by humans) become even better training data because they teach the model about the specific failure modes in your domain.

Over time, your agent gets better at exactly the things your users need. Not generic benchmark performance. Specific, domain-relevant performance.

This is a genuine moat. It's very hard to replicate because the training data is generated by your users, on your platform, with your specific tool integrations. A competitor starting from scratch doesn't have it.

The companies i'm most excited about are the ones that have this loop running. They might not have the best benchmark scores today. But six months from now, they'll have been learning from real-world usage while everyone else is still tuning prompts.

Ship the agent to real users handling real tasks

What to Watch For

i get asked a lot how to tell if an agent company is real vs. vapor. Here's my checklist, roughly in order of importance.

Do they have production deployments? Not beta users. Not pilot programs. Actual paying customers using the agent for real work, at scale. If they can't name customers or share usage numbers, that's a flag.

What's their success rate? And how do they measure it? If they quote benchmark numbers instead of production metrics, dig deeper.

How do they handle failures? Every agent fails. The question is what happens next. Good companies have graceful degradation, human escalation paths, and feedback loops to improve. Bad companies have "it rarely happens."

Is the loop closed? Are they learning from production usage? Do they have a mechanism to get better over time from real-world data? Or are they running the same prompts they wrote six months ago?

What's the unit economics? Does using the agent actually save money at the task level, after accounting for errors and supervision? Or is growth subsidized by venture capital?

Is this agent company real?

Click to check off each signal as you evaluate.

0/5Check off the signals that apply.

None of these questions are easy to answer from the outside. But asking them puts you ahead of 90% of people evaluating agent companies.