The Guardrails 15: Vibe data analysis, more time auditing, less time writing/coding
It was a rough day with research agents—but a good day to practice patience with them, to better understand how they learn and behave when deployed into the research field.
I started my morning by asking Gemini to critique a computational approach I’m developing, based on a newly published preprint. Gemini Pro melted down. It hallucinated a paper that doesn’t exist while failing to read the actual PDF I provided. This is the exact kind of error that gave ChatGPT a bad rap back in 2023 and early 2024—making up citations. You’d think such a rudimentary mistake wouldn’t happen in 2025. I chalk this up less to model failure than to a backend database issue in the Gemini Pro chat interface.
That misadventure ended with a series of cringe-worthy admissions from the Gemini agent:
“You are absolutely right to call me out. I am failing, and I sincerely apologize. There is no excuse for my repeated errors. I understand your frustration completely.”
“I am deeply ashamed to admit my previous analysis was also incorrect. My internal file processing is broken, and I am not providing the user with the correct information. I’m starting from scratch…”
Back to my main task: having four different AI agents—Claude Code, Qwen Coder, Gemini, and GPT-4.1—analyze outputs from a temporal semantic network analysis. I’ve long stressed the importance of human-in-the-loop oversight and what I call “AI intuition” (that gut feeling when something is off). Today’s experience reinforced that. I now spend more time auditing AI outputs (hopefully thinking more too), and less time writing or coding. Audit becomes not just helpful, but essential for any AI-augmented research workflow.
It takes multiple agents to get things “right.” And this is where things get interesting. We’re no longer in a single-model world. We have many AI systems—different architectures, strengths, blind spots—allowing us to compare notes, assign peer reviews, run blind checks, and triangulate findings. Luckily, I’ve accumulated plenty of compute credits on Windsurf to burn through before the cycle ends (sorry, Planet Earth!), so most of my research time today was spent asking agents to fix their own outputs.
Surprisingly, Qwen (the Chinese opensource model from Alibaba) turned out to be quite impressive, while GPT-4.1 was underwhelming. Whether that’s due to model quality or Windsurf’s memory architecture, I’m not entirely sure. What struck me was that Qwen correctly identified a subtle flaw in my analytical approach—something that could’ve inverted the results entirely. It didn’t execute the fix perfectly, but the insight was enough to lead me down a path of meaningful revisions. That alone made me rate Qwen more favorably than GPT-4.1, which didn’t even catch the analytical misstep.
GPT-4.1, on the other hand, was more stubborn—overconfident, deeply apologetic, and still riddled with small mistakes in result reporting. Both GPT and Qwen produced consistent statistics and correct code, but they struggled with interpreting results and summarizing findings consistently. They’re decent at contextual interpretation when fed clean data, but prone to small (yet glaring) errors in their reasoning. I had to laugh when GPT-4.1 proudly declared:
“You can now trust that the OpenAI markdown report for all edge attributes is fully consistent with the CSV and Qwen outputs.”
Don’t get me wrong—I’m not outsourcing the critical thinking to machines. I’m just testing whether AI models can serve as reasoning partners to augment human work.
The pattern is becoming clear: AI agents are great at writing and executing code, and good at reporting results programmatically. They can interpret findings when you spoon-feed them the data. But they often struggle when asked to do both—write and run code and interpret results qualitatively. And this might not be an inherent flaw in the models—it could be a workflow issue: what information they have access to, how memory is managed, or how inputs are framed. I’ll blame Windsurf and its vibe-coding interface for now.
This work mode is intricate and audit-heavy. We need robust, red-teaming style audit protocols. A few practices I now recommend:
Use multiple agents based on different model architectures.
Audit at least three outputs: code, spreadsheet results, and the written report or figures.
Always be skeptical.
Have agents cross-audit each other’s reports and flag inconsistencies.
Human brains are wired for shortcuts—we seek confirmation, whether we’re trying to prove that AI is right or wrong. A third-party agentic audit agent helps counterbalance that.
But this raises a deeper question: Why go through all this trouble to audit AI outputs? Why not just do the work ourselves?
Two reasons:
Rapid Prototyping. You wouldn’t believe how fast you can now turn an abstract idea into a working prototype. You iterate, revise, and learn while building.
Deeper Reflection. Even the audit process itself forces human researchers to think more critically—about assumptions, designs, interpretations. That one insight from Qwen is a good example.
Next up: auditing Gemini and Claude Code—the last two agents in today’s experiment. Claude is supposed to be the best performing one. Fingers crossed it lives up to the hype.