Why AI models keep fucking up

I know this sounds salty, but if you write code every day you have probably felt this too.

Every new model drop comes with the same story: better reasoning, better coding, better everything. And sure, some of that is true. They are way better than they were in 2023.

But in actual day-to-day work, they still mess up in the most irritating ways.

You ask for one component change, it rewrites half the project. You ask for a bug fix, it sneaks in new dependencies. You ask for a clean UI, it gives you a purple-pink dashboard from 2017.

So this is not an anti-AI post. I use these tools all the time. This is a “can we stop pretending this is solved” post.

What is actually going on?

Big reason: these models are built to produce likely text, not guaranteed truth.

That is not a hot take, that is literally what the research says. The hallucination survey by Ji et al. explains this really clearly: the output can sound polished and still be wrong, especially when the prompt is vague or context is missing (paper).

In normal language: if there is a gap, the model will confidently fill it.

And in code, “confidently wrong” is expensive.

”But benchmarks are improving” - yes, and?

Benchmarks have improved a lot. No argument there.

The original SWE-bench paper was brutal. On real GitHub issues, top performance back then was 1.96% (paper). Newer systems are much stronger, and SWE-bench Verified has scores in a totally different range now (leaderboard).

Still, your private repo is not a neat benchmark environment.

Real projects have weird build scripts, half-documented flows, stale assumptions, and random edge cases no model can infer from one prompt. So even with better model scores, there is still a big gap between “passes eval” and “safe to merge without babysitting.”

Agents are getting more autonomous, but humans are still in the loop

Recent data from Anthropic on real agent usage is actually useful here.

From their 2026 analysis:

median Claude Code turns were around 45 seconds
long-tail autonomous turns got much longer over time
most tool calls still appeared to involve safeguards and human involvement
software engineering made up almost half of observed tool use

(source)

So yes, autonomy is going up. Also yes, people still monitor closely, because when agents go off track, they go off track fast.

Real world tests show the same thing

Project Vend is a funny example but the lesson is serious. They gave an AI system better tools, better structure, better workflows, and it improved. But it still got manipulated, made bad calls, and needed humans to step in (Project Vend phase two).

That is exactly what we see in coding too.

Smarter model plus better tooling helps a lot. It still does not make the system foolproof.

Why this hurts developers more

Because our mistakes are expensive.

If a chatbot gives someone a mediocre draft email, who cares. If a coding agent makes a wrong migration or touches auth logic, now you have an incident.

That is why devs sound more skeptical than everyone else. We are not haters. We are the people cleaning up the aftermath.

How I use AI now (without losing my mind)

What works for me:

Ask for a plan first, code second.
Set hard constraints in the prompt (no new packages, only touch specific files, minimal diff).
Review every file it changed, even if the diff “looks fine.”
Run tests after each meaningful change.
Never auto-ship anything touching auth, data, billing, or infra.

This keeps the speed, cuts most of the chaos.

Final line

AI coding tools are useful right now. They are also unreliable right now.

Both things can be true at the same time.

So yeah, models still keep fucking up. The move is not to quit using them. The move is to use them like power tools: fast, helpful, and dangerous if you get careless.