Deploying AI agents to production: what separates production from demo

Building an AI agent demo is easy. You call the API, the model responds, you show it to stakeholders. Everyone is happy.

Then you deploy it to production. A week later a $3,000 invoice arrives because the agent got stuck in a loop and called the model 8,000 times. Or it returns a structured command that does not exist, because a user slipped an instruction in via natural language. Or you simply have no idea what is happening — no logs, no metrics, no context.

These are the problems you will run into. Not if. When.

Here are six things you need to have figured out before you put an AI agent into production.

1. Cost gate — stop a runaway loop before it shows up on the invoice

Every agent that calls an LLM in a cycle is a potential runaway loop. A bug in the exit condition, an unexpected model output, a network timeout — and the agent calls the API hundreds or thousands of times.

The fix is mechanical: a cost gate that throws an exception before the request if the daily budget has been exhausted. You count tokens precisely — cache writes cost 1.25× the standard input price, cache reads cost 0.10×. If that is not reflected in your budget, you are not seeing the real cost.

Without this, one looping agent burns through your monthly budget overnight. With it, the agent ends gracefully with an exception and an alert.

2. Guardrails — input sanitisation and server-side output validation

An LLM is an open input/output system. That means two risks:

On input: a user can inject instructions in natural language that change the model's behaviour. Prompt injection — not theory, a real attack vector. Sanitise input before it reaches the system prompt.

On output: the model returns a structured result (a JSON command, an entity name, a numeric code). Without validation it will return a value that looks correct but does not exist in the system. We handle this with server-side regex guards — not just schema validation, but actual checks against allowed values. Concretely, in cases where an admin writes free text and the model extracts a structured command from it: without a guard, a hallucinated command would pass silently into the system.

Guardrails do not slow the agent down. They slow the damage down.

3. Observability — every LLM call must be auditable

Running a production system with no visibility into LLM calls is flying blind. You do not know where latency lives, what things cost, which configuration is returning bad results.

We log every call with four dimensions: tokens (input/output/cache), latency, cost, configuration (model, temperature, system prompt version). Data aggregated per provider and per model.

The result is concrete: you can see that GPT-4o mini is 8× cheaper than GPT-4o on a specific use case and still accurate enough. Or that one endpoint accounts for 60% of total costs. Without numbers those are just impressions.

4. Fallback and self-healing — the agent must survive its own errors

AI agents make mistakes. The model occasionally returns invalid JSON. A request fails due to temporary unavailability. The result does not match the schema.

Two things help:

Local inference as a hedge. A simple task — sentiment analysis, for example — does not always need to go to a cloud API. Running computation locally means zero latency, zero cost, zero dependency on an external provider. A hedge on both availability and cost.

Self-healing on failure. When a request fails or returns invalid output, the agent receives the error message and generates a corrected request. One retry with error context resolves most transient problems. Instead of a crash you get a degraded but functional result.

Graceful degradation — not an unnecessary crash on the first error.

5. Human-in-the-loop and audit log — AI must not be a black box

Invoice data extraction, legal documents, financial decisions — these are areas where an AI mistake has real consequences. Here, monitoring accuracy is not enough. Neither is 90%.

Human-in-the-loop is an architectural decision, not a workaround. The model proposes, a person confirms. Every decision — proposal and confirmation alike — goes into an audit log with timestamp, user, and context.

Two effects: lower risk on critical data and a better training signal (you can see where the model makes mistakes and why a person corrects it).

For reference: a municipal RAG chatbot with temperature 0.1 achieves ~90% accuracy on live data. The remaining 10% is the argument for review, not blind trust.

6. Eval before deployment — how do you know a change helped?

Without systematic evaluation, changes to prompts and configuration are shots in the dark. Did you improve accuracy, or did you break it on a different segment of inputs?

Eval does not have to be complicated — a set of representative inputs and expected outputs, run on every change, is enough. We choose a low temperature (0.1) for factual tasks where deterministic output is more valuable than creativity.

Honest gap: we do not do fine-tuning or MLOps. We optimise at the prompt and architecture level, not at the level of model weights. For most production use cases, that is sufficient.

How it holds together

All of the components above only work when the AI agent is not an isolated chatbot. It has to be integrated into a real system — with data, state, and business logic. The model makes sense when it knows what it is working with: it has the context of the job, the communication history, the data structure from the system. Without that, it is just a text generator.

Cost gate, guardrails, observability, fallback, human-in-the-loop, and eval are the infrastructure around the model. The model itself is just one layer.

FAQ

Do I need to solve everything at once, or can I go step by step?

You can go step by step, but cost gate and observability belong in version one. Without them you have no data to decide what to address next. Add guardrails as soon as the agent is processing user input. Implement human-in-the-loop where a mistake has real consequences.

What if the model I am using today gets discontinued or its price goes up?

That is exactly why we keep model calls behind an abstraction layer — switching the provider or model is then a configuration change, not a refactor. This is one of those places where architecture decides before the problem appears.

Is AI in production significantly more expensive than in a demo?

A demo calls the model tens of times. Production calls it thousands of times a day. But cost is mostly about what you are counting. Cache reads cost 0.10× the standard price — proper caching can cut the invoice dramatically. Observability tells you where the money is going. Without it you are optimising at random.

If you are preparing an AI agent for production and want to review the architecture before you hit problems in the field, get in touch — we will go through it together.

How to Deploy an AI Agent to Production Without It Falling Apart

1. Cost gate — stop a runaway loop before it shows up on the invoice

2. Guardrails — input sanitisation and server-side output validation

3. Observability — every LLM call must be auditable

4. Fallback and self-healing — the agent must survive its own errors

5. Human-in-the-loop and audit log — AI must not be a black box

6. Eval before deployment — how do you know a change helped?

How it holds together

FAQ

Facing a similar problem? Get in touch.

1. Cost gate — stop a runaway loop before it shows up on the invoice

2. Guardrails — input sanitisation and server-side output validation

3. Observability — every LLM call must be auditable

4. Fallback and self-healing — the agent must survive its own errors

5. Human-in-the-loop and audit log — AI must not be a black box

6. Eval before deployment — how do you know a change helped?

How it holds together

FAQ

Facing a similar problem? Get in touch.

What Is Vibe Trading

How We Build Software with AI Agents (and Where It Bit Us)

What we can deliver: five projects and five principles