AI agents in software development: real experience and war story

A year ago I was skeptical. Not about AI itself — about the way people talked about it. "AI will replace developers." "Just write a prompt." "Copilot will write it for you." Then I started actually using AI agents in my own projects, watching what works and what doesn't, and the picture got more complicated.

This article is not about how great AI is. It is about how we specifically wire it into development, and about one incident that cost us more money than it should have — and taught us to test abstractions differently.

The architecture we use: orchestrator + workers

Our AI workflow has two layers.

A strong model (orchestrator) holds the overall context. It plans. It does reference discovery — walks the code and finds ALL affected places before every change. It reviews other agents' output. It decides what is an edge case and what is a hallucination. This work costs tokens, but it is worth it — the strong model does exactly what it is good at.

Weaker, cheaper models (workers) get isolated, well-defined steps. Add an import to 15 files. Generate a test scaffold from this pattern. Scrape the build results. Rewrite the DTO to the new structure. These steps are well-specifiable, require no cross-file reasoning, and can run in parallel. Workers are cheap and fast — and if one makes a mistake, the orchestrator catches it in review.

It is the same principle as a human team: the senior holds architecture and decisions, others execute well-defined chunks. The difference is speed — parallel workers handle in a minute what one person would take an hour to do.

But it only works when the work is actually well-specified. A vague brief produces a vague result. That is true for a junior developer and it is true for AI.

Reference discovery: a lesson from a real incident

One of the first lessons AI agents taught us was not about AI. It was about discipline.

When wiring a SignalR handler into the frontend, we let an agent make the change. The change looked correct. Tests passed. Three days later it turned out the wiring was working in 3 of 8 places — the agent had quietly skipped the rest because they were not in the same file.

Since then, one rule applies: before every change, we find ALL places affected by it. Only then do we make the change. Agents can search references quickly and thoroughly — but we have to assign that task explicitly. Without it, they optimise for what they can see, not for what exists.

Reference discovery is unglamorous work. It does not make for a good demo video. But it is the difference between a change that works and a change that looks like it works.

Spec before code, not after

The second shift was in the order of work. We used to write specs as after-the-fact documentation — if at all. AI agents made that worse: with AI you can generate code fast, but fast code without a spec is a fast problem.

The rule now: a spec is an agreement before the work starts, not a description after it ends. The spec says what must hold, why, and what decisions stand behind it. Code is just the implementation of agreed behaviour.

The benefit: when an agent proposes a solution that fits the spec, it is a good proposal. When it does not, we know why. Without a spec, every AI hallucination has to be explained against the entire context — and that makes every round of review more expensive.

War story: how `Microsoft.Extensions.AI` silently stole money from us

This is the central story of the article, because it illustrates a class of bugs that AI abstractions do not guard against.

We work with multiple LLM providers simultaneously — Claude, OpenAI, Gemini, Grok — under a single abstraction. Microsoft.Extensions.AI is a reasonable choice for a unified interface: one interface for multiple providers, configuration-level swapping, no vendor lock-in at the code level.

But we noticed token costs were not falling the way they should. We had prompt caching set up for the system message — which with Claude Sonnet and Opus saves up to ~90% on repeated calls. The model works with long system prompts, so caching is a material difference for us.

Except caching was not working. Not obviously — no error message, no warning. Tokens were simply being counted at full price.

We spent a while hunting for a configuration issue. Then we wrote an integration test that compared expected cache hits against actual ones — and the test failed. No cache hits. On any call.

The cause: Microsoft.Extensions.AI at that version was silently dropping the cache_control parameter that the Anthropic API requires to mark tokens for caching. The abstraction ignored it — either it did not support it, or it did not propagate it to the raw request. From the code's perspective everything looked correct. From the provider API's perspective we were sending requests with no instruction to cache.

The fix was straightforward: for this client we bypassed the abstraction and went to the provider's raw endpoint with fine-grained control. Prompt caching has model-specific minimum thresholds — approximately 1024 tokens for Sonnet and Opus, 2048 for Haiku. You have to mark the right blocks, in the right order, at the right minimum length. The abstraction was hiding that detail — and hiding it badly.

Lesson: An abstraction can silently steal behaviour. Testing the happy path is not enough — you have to test the boundary. For a critical integration (cost, security, state), you want to know what actually crosses the wire, not what the high-level API claims.

Today we have an integration test that verifies cache hits match expectations for every provider. It is not enough for the call to succeed. It must succeed the way we intended.

What AI actually speeds up — and what it does not

After a year of working with AI agents I have enough data for a sober comparison.

AI speeds up:

Well-specified, repetitive work. Boilerplate, mappers, tests from patterns, rename across files.
Code exploration. Find where something is defined, what all the references are, what pattern is used elsewhere in the project.
First drafts. READMEs, specs, PR descriptions, migration scripts. Output needs review, but the starting point is faster.

AI does not speed up:

Architecture. Which abstraction is right, how to split responsibilities, where the bottleneck will be at scale — this requires judgment that AI does not have.
Integration. Exactly where two systems touch, what happens in the error case, what the real behaviour of a third party is versus what the documentation says — AI hallucinates, it does not measure.
Diagnostics. Why does production not contain cache hits? Why is latency significantly higher than in dev? AI guesses. A senior engineer traces.

The most dangerous combination is a junior developer and AI without senior review. The output looks correct. It has types, it has tests, it passes PR. And it contains a bug that shows up in production because the model did not know what it did not know — and the junior did not catch it, because AI did not catch it either.

Senior review is the multiplier. Not a check out of distrust — it is the layer of judgment AI does not provide.

Where we are today

In our projects we run with multiple LLM providers under a single abstraction and with our own MCP servers that expose our systems to AI agents — catalogues, workflows, business data. Every LLM call has a budget gate and an audit log. Costs are predictable and measurable, not the result of prompting without thinking.

But the most valuable thing we have learned is not technical. It is discipline: reference discovery before every change, spec before code, integration test for every critical abstraction. With AI or without it.

FAQ

What model do you use as orchestrator?

Currently Claude Opus or Sonnet for orchestration (large context, good reference discovery), cheaper models for isolated steps. The specific choice changes with every new version — we test and reassess.

How do you test that prompt caching actually works?

An integration test that makes two identical calls in sequence, compares cache_read_input_tokens in the response metadata against input_tokens, and verifies the second call had a cache hit. If not, the test fails — and we know before it costs us money.

When do you not recommend AI agents?

For work where the critical judgment lives in context the agent does not have. Security architecture, database migration with production on the line, tuning real-world performance — here AI is at most a helper for exploration, not a decision-maker. And always when the output goes directly to production without senior review.

How We Build Software with AI Agents (and Where It Bit Us)

The architecture we use: orchestrator + workers

Reference discovery: a lesson from a real incident

Spec before code, not after

War story: how `Microsoft.Extensions.AI` silently stole money from us

What AI actually speeds up — and what it does not

Where we are today

FAQ

Facing a similar problem? Get in touch.

The architecture we use: orchestrator + workers

Reference discovery: a lesson from a real incident

Spec before code, not after

War story: how Microsoft.Extensions.AI silently stole money from us

What AI actually speeds up — and what it does not

Where we are today

FAQ

Facing a similar problem? Get in touch.

What Is Vibe Trading

How Long Does Software Development Actually Take (and What Slows It Down)

How Much Does a Custom ERP Cost (and When to Build One)

War story: how `Microsoft.Extensions.AI` silently stole money from us