I Gave Four AI Models the Same Broken Script. Only One Understood the Assignment.

A Saturday morning experiment in what happens when you trust AI to debug AI-generated code—and what it reveals about the hidden risks of building with models you don't fully understand.

I’m a VP of Strategic Partnerships, not a software engineer. I know enough Python to be dangerous. On a Saturday morning, I wanted to build something simple: a local, AI-powered “Chief of Staff” running through Ollama. A personal strategist for pipeline targets, partnership plays, and 2026 planning—no cloud, no API keys, and no data leaving my laptop.

What followed was a five-hour education in why implicitly trusting AI models without understanding the subject matter is one of the most underestimated risks in the current wave of AI adoption.

The script was 42 lines of Python. The bug was on exactly one of them. Four models—Qwen3 (running locally), ChatGPT, Claude, and Gemini—each took a fundamentally different path to solving it. The divergence in those paths tells you everything you need to know about the real risks of AI-assisted development.


The Setup: 42 Lines, One Bug

The bug is obvious if you know what you’re looking at. Line 33 says model='qwen2.5:7b'. My machine has qwen3-uncensored:latest. That’s it. Change the string, and the script works. But I didn’t know that yet.

Round 1: Qwen3 — Confidence Without Grounding

Qwen3 was running locally on my machine. It correctly identified a 404 error, yet even after I provided my ollama list output, it never suggested the one-line fix.

Instead, it spiraled into “over-engineering theater,” producing a “v2” script with classes and security managers that crashed immediately. Qwen performed the aesthetics of criticism—insults and all-caps headers—without the substance of a fix. It even hallucinated that its own code was “racist,” a bizarre critique that was meaningless in context.

Round 2: ChatGPT — The Reliable Engineer

ChatGPT diagnosed the root cause on the first pass: wrong model name, one-line fix. It functioned as a competent generalist, shipping a working script with a startup self-test. Its security analysis was grounded—no theater, just functional code.

Round 3: Claude — Right Diagnosis, No Deliverable

Claude correctly identified the core problem and provided a meta-analysis of why the other approaches were failing. However, it prioritized methodology over utility. On a Saturday morning, a working tool beats a perfect lecture.

Round 4: Gemini — The One That Understood the Assignment

Gemini reframed the problem. It asked: “Who is this user and what do they actually need?” It produced a two-file system—a context.json for configuration and a streamlined script. It was the only model to realize that a VP doesn’t want to become a full-time developer; they want a tool that works.


What This Actually Means for AI-Assisted Development

The “Confidence Loop” is the Most Expensive Failure Mode. Qwen3 produced 2,000 lines of code without checking basic assumptions. The Guardrail: Never let an AI model iterate on its own output more than twice without independent verification.

“Security” Without Threat Modeling is Theater. Adding regex sanitization to a local tool that doesn’t execute shell commands is useless. The Guardrail: Name the specific attack you are defending against.

Match the Model to the Phase. Debug with ChatGPT/Claude; Build with ChatGPT/Gemini; Design with Gemini.

If You Can’t Read the Code, You Can’t Trust the Code. The thirty minutes I spent learning the Ollama SDK was worth more than five rounds of prompting.


The risk in 2026 isn’t that models are “stupid.” The risk is that they are confidently wrong about things they cannot verify. Trust, but verify. And if you can’t verify, learn enough so that you can.