The infrastructure that decides whether your agent works on Monday, not just in the demo.
The demo worked.
The agent pulled the right CRM record, summarized the customer, checked recent support tickets, and recommended the right next step. Everyone in the room nodded. It looked ready.
Then Monday happened.
Salesforce returned a field in a slightly different shape. The support API timed out. The customer name in billing did not match the name in Intercom. The agent retried the same action twice. Nobody could explain why it made the recommendation, because the logs only showed the final answer.
The model did not fail.
The harness did.
An agent harness is everything in your agent system that is not the model: tool execution, memory, eval loops, failure recovery, observability, and human-in-the-loop controls.
The model decides what to do. The harness decides whether it is allowed, tracks whether it worked, and handles what happens when it does not.
Most agents do not fail in production because the model is bad. They fail because everything around the model is missing. A tool call returns an unexpected shape. State does not persist between turns. A failure goes unlogged until a customer sees the output. An agent has permission to read billing history and also change a contract field.
That surrounding infrastructure is the agent harness. It is the difference between an agent that works in a demo and one you can trust in production.
In many production agent projects, the harness becomes most of the engineering work. Teams often budget for it like plumbing. They discover it is the product around the model.
Why this matters now
Agents are moving from answering to acting.
A chatbot that gives a mediocre answer is annoying. An agent that updates CRM, emails a customer, escalates an account, changes a workflow, or recommends a renewal action is different. Once the system can act, the operating boundary matters as much as the intelligence.
The early production data points in the same direction. A recent study of 306 production-agent practitioners found that reliability remains the top development challenge, 68% of production agents execute at most 10 steps before requiring human intervention, and 74% still rely primarily on human evaluation. In other words: production agents are not becoming useful because they are fully autonomous. They are becoming useful because teams are putting tighter harnesses around them.
Source: Measuring Agents in Production
The six pieces that matter
Every production harness needs six components, regardless of framework or model.
Tool execution
The harness defines which tools are available, validates every call before it executes, and enforces scope.
Without this layer, the agent has access to whatever you connected it to. That is fine in a prototype. It is not fine in production.
A churn-risk agent should be able to read product usage, support tickets, CRM notes, and billing status. It should not be able to email the customer, change the renewal forecast, or update a contract field without approval.
Tool execution is where you answer:
- Which tools can this agent call?
- Which fields can it read?
- Which systems can it write to?
- Which actions are blocked?
- Which actions require approval?
The model should not be the permission system. The harness should.
Memory and state
Short-term memory and long-term memory are different systems.
Short-term memory is what happened in this conversation. Long-term memory is what the agent has learned over days, weeks, or months: customer history, previous recommendations, resolved issues, user preferences, known exceptions, and prior decisions.
Most agents work fine in a demo because the demo is one turn. Production is not one turn. Production is hundreds of turns across weeks, with messy customer data and changing context.
Without persistent state, the agent repeats work, forgets what happened, loses continuity, and makes recommendations without knowing what it already tried.
For a customer onboarding agent, memory might include:
- Which accounts were already flagged
- Which CSM reviewed the recommendation
- Which intervention was attempted
- Whether the customer responded
- Whether the risk decreased after the action
Without that state, the agent is not operating. It is guessing again every morning.
Eval loop
An eval loop runs the agent against known inputs and expected outputs on a schedule.
In a demo, a human is always watching. In production, nobody watches until something breaks. The eval loop is what watches.
A useful eval loop tests the agent against realistic cases:
- A customer with normal product usage but severe support friction
- A customer with low usage but an active champion
- A renewal account with incomplete billing data
- A support escalation where the customer is high-value but low-risk
- A tool response with a missing field
- A rate-limited API call
- A scenario where the correct action is to ask a human
The point is not to prove the agent is perfect. The point is to know when behavior changes, when quality drops, and when a new model, prompt, tool, or data source made the system worse.
If you cannot test the agent before customers do, you do not have a production agent. You have a live experiment.
Failure recovery
A tool call times out. An API errors. The model references a field that does not exist. A customer record is duplicated across systems. A data source is stale. A permissions check fails.
In a prototype, those failures become weird outputs.
In production, the harness needs defined recovery paths:
- Retry the call
- Use a fallback source
- Return a partial answer with caveats
- Ask for clarification
- Escalate to a human
- Stop the workflow
- Log the failure for review
“Retry” is not a recovery strategy by itself. Some failures should not be retried. Some should degrade gracefully. Some should block the action entirely.
For example, if an account-health agent cannot reach billing, it may still summarize product usage and support history, but it should not recommend a renewal action. If a customer email draft depends on missing contract terms, the agent should stop and ask a human.
Failure recovery is what prevents one bad API call from becoming a bad customer interaction.
Observability
Observability means you can see what the agent did and why.
Every tool call. Every retrieved record. Every decision. Every approval. Every failure. Every fallback path. Structured enough to filter by customer, account, agent, tool, time window, or workflow.
Without observability, debugging is guesswork. With observability, you can replay the exact sequence that led to an output.
That matters because production questions sound like this:
- Why did the agent flag this account as at-risk?
- Did it use the latest support ticket?
- Did it read the right billing record?
- Why did it recommend outreach instead of escalation?
- Did a human approve the action?
- Did this behavior change after yesterday’s deploy?
If you cannot answer those questions, you cannot responsibly put the agent near customers, revenue, contracts, or operational systems.
Human-in-the-loop controls
Some decisions should not be fully automated.
The harness defines which actions require human approval: sending a customer email, updating a contract field, escalating an account, changing a forecast, issuing a refund, or marking a customer as churn risk.
The default should be restrictive. It can loosen as trust builds.
A good human-in-the-loop system is not just a manual checkpoint. It captures the decision, the approver, the reason, and the outcome. That feedback becomes part of the system’s memory and eval set.
The agent can prepare the work. The human decides when the action carries real business risk.
Without this layer, the agent operates at the speed of the model with the judgment of nobody.
The gap between “works in a demo” and “works on Monday” is almost entirely a harness problem.
The harness maturity model
Level 0: Script
Calls a model and maybe one tool. No replay, no evals, no recovery.
This is fine for prototypes and internal experiments. It is not enough for customer-facing work.
Level 1: Controlled tool use
The agent has scoped tools, validates calls, and logs actions.
This is the first real boundary. The agent can do useful work, but its permissions are limited.
Level 2: Observable agent
Every decision can be replayed. Tool calls, inputs, outputs, and failures are visible.
This is where debugging becomes possible.
Level 3: Recoverable agent
The system has retries, fallbacks, graceful degradation, and escalation rules.
This is where the agent stops breaking every time the real world is slightly messy.
Level 4: Governed agent
The harness includes human approvals, policy boundaries, eval loops, and permission-aware memory.
This is where the agent can operate near meaningful business workflows.
Level 5: Production harness
The system has continuous evals, drift detection, audit trails, monitoring, versioned behavior, and clear ownership.
This is where the agent becomes infrastructure, not a demo with better logging.
What “production-grade” actually means
The difference between a demo harness and a production harness shows up at 2am on a Thursday when nobody is watching.
Demo: the tool call succeeds, the data is clean, and the customer account looks like the test account.
Production: the tool call returns a 429 because the rate limit changed. The customer’s name in Salesforce does not match the name in Intercom. The agent hits a scenario nobody evaluated against. The answer might still look confident.
Production-grade means the harness handles the unhappy paths, not just the happy one.
The eval loop runs daily, not once before launch. Failures are categorized, not buried. Someone on the team can explain yesterday’s agent decisions without opening the codebase. Human approval is built into high-risk actions. Tool permissions are scoped before the agent gets access.
The model gives the agent intelligence.
The harness gives it responsibility.
Three questions to ask about your harness
Can you replay any agent decision from the last 24 hours?
If you can pull up the exact tool calls, retrieved data, approval steps, and decision path behind a specific output, your observability is working.
If you would have to reproduce it from scratch, it is not.
What happens when a tool call fails?
If the answer is only “it retries” or “it errors out,” the recovery layer is incomplete.
A production harness has defined behavior for the failure modes the tool layer can produce: timeout, missing field, permission failure, stale data, duplicate record, unexpected response shape, and rate limit.
Which agent actions require human approval?
If the answer is “none” or “we have not decided,” the boundary is not set.
The default should be more restrictive, not less. It loosens as trust builds.
FAQ
Do I need a harness for a simple, single-turn agent?
The simpler the agent, the simpler the harness. But “no harness” means no visibility into what the agent did and no recovery when it fails. That may be acceptable in a prototype. It is not acceptable when the output reaches a customer or changes a business system.
Which harness framework should I use?
The framework matters less than the coverage. Whether you build on LangGraph, CrewAI, a custom loop, or a well-structured script, the six components need to be present.
Evaluate frameworks on tool control, eval loops, observability, failure recovery, state management, and human approval flows. Not feature count.
How do I know if my harness is good enough?
Run the three diagnostic questions:
Can you replay decisions?
Can you handle failures gracefully?
Can you name which actions require human approval?
If yes, you are in reasonable shape. If not, the harness is the next thing to build before shipping the next agent.
What is the minimum viable harness?
Tool execution validation, basic observability, and a defined failure path.
Those three catch the most common production failures. Add eval loops, persistent memory, and human-in-the-loop controls as the agent gets closer to customers, revenue, contracts, or internal systems.
Is the harness part of the agent or separate infrastructure?
Both. Conceptually, it is the operating layer around the agent. Practically, it may be spread across orchestration code, logging, eval systems, permissioning, queues, workflow tools, and human review interfaces.
The important part is not where it lives. The important part is that it exists.