The Skeptic Who Got Benchmaxxed: What Actually Changed About AI Coding Agents

There are two types of AI coding blog posts. The first kind is breathless hype: “I built a SaaS in 30 minutes with Claude!” The second kind is breathless doom: “AI is destroying the craft of programming!”

Max Woolf’s recent post is neither. It’s the rare piece where someone who was publicly skeptical about AI agents changed their mind, showed the receipts, and still managed to be annoyed about it.

I want to break down what he actually did, because buried under 5,000 words of dry humor is a workflow that I think most developers are sleeping on.

The backstory: a professional skeptic

Woolf is a data scientist in San Francisco. Last May, he wrote a post titled As an Experienced LLM User, I Actually Don’t Use Generative LLMs Often. His position was reasonable: LLMs can answer simple coding questions, but agents are unpredictable, expensive, and overhyped. He was open to changing his mind if the tech improved.

Fast forward to November. Anthropic dropped Opus 4.5 right before Thanksgiving. Woolf noticed the timing was suspicious. Companies bury bad announcements on holidays. He had no Thanksgiving plans, so he tested Opus anyway.

What he found was not what he expected.

The AGENTS.md revelation

Before touching Opus, Woolf did something that most people skip: he wrote an AGENTS.md file. If you’re not familiar, it’s a file you put in your project root that controls agent behavior, like a system prompt for your codebase.

This is where it gets interesting.

Most people complain about agents generating emoji-laden, over-commented, verbose garbage. Woolf’s fix was simple. He added rules:

**NEVER** use emoji, or unicode that emulates emoji (e.g. ✓, ✗).

**MUST** avoid including redundant comments which are tautological
or self-demonstrating

He added preferences for uv over base Python, polars over pandas, secrets in .env, and a dozen other opinionated constraints. Not telling the agent what to build, but how to build it.

He claims the difference between having and not having this file is immediately obvious. I believe him. I’ve seen the same pattern with my own agents. When I migrated my home server with Cici, the thing that made it work wasn’t the model. It was writing TOOLS.md and MEMORY.md to give the agent context about my network, my paths, my conventions. Without those files, Cici was guessing. With them, she executed moves without asking “which folder?”

Woolf’s approach is the same idea, scaled up. His AGENTS.md is essentially what I called “the API for your AI agent” — except he’s taken it further with granular formatting rules and tool preferences.

His Python and Rust AGENTS.md files are public. Worth stealing.

The prompting strategy nobody talks about

Here’s the part that doesn’t fit in a tweet. Woolf doesn’t just type build me a thing into Claude Code. He writes full Markdown spec files, tracked in git, with explicit constraints. Then he tells the agent: implement this file.

His YouTube scraper prompt is a good example:

Create a robust Python script that, given a YouTube Channel ID,
can scrape the YouTube Data API and store all video metadata in
a SQLite database. The YOUTUBE_API_KEY is present in `.env`.

You MUST obey ALL the FOLLOWING rules:
- Do not use the Google Client SDK. Use the REST API with `httpx`.
- Include sensible aggregate metrics.
- Include `channel_id` and `retrieved_at` in the database schema.

Notice: he specifies the HTTP library. He specifies the schema columns. He bans the SDK he doesn’t want. This is not “vibe coding.” This is writing a spec, which is what senior engineers do anyway, except now the spec gets executed immediately. I wrote about this exact shift in my Amp post: plan first, code second. The code is just the implementation detail. Woolf arrived at the same conclusion independently, but his specs are even more constrained. He leaves the agent zero wiggle room.

The result worked first try. 20,000 videos scraped. Clean, Pythonic code. No Sonnet 4.5 slop.

From Python to Rust: where it gets wild

Once Woolf confirmed Opus could handle Python, he did what any sane person would do: he asked it to write Rust.

Historically, LLMs have been terrible at Rust. The language is niche, the borrow checker is unforgiving, and there’s not enough training data for LLMs to fake their way through. Woolf had been testing LLMs on Rust for years. They always failed.

They stopped failing.

Here’s what he built, all with Opus and later Codex:

icon-to-image: Renders Font Awesome icons into images at arbitrary resolution. Written in Rust with Python bindings via PyO3. Features supersampled antialiasing, transparent backgrounds, and PNG/WebP output. He started with fontdue for speed but discovered it can’t render curves properly at high resolution (it approximates). Told Opus, Opus swapped in ab_glyph without breaking anything.

miditui: A MIDI composer and playback DAW that runs entirely in a terminal. Yes, a terminal. Uses ratatui for the UI and rodio for audio. Opus couldn’t see the terminal output and still implemented correct UI changes. Woolf fell back on his QA engineering background to find bugs manually and describe them to the agent.

ballin: A terminal physics simulator rendering 10,000+ bouncing balls using Braille unicode characters for sub-pixel resolution. Uses the rapier2d physics engine. Built from 14 iterative prompts, each one a detailed spec, each followed by manual review and git commit.

These are not toy demos. The ballin PROMPTS.md shows the full development history: 14 numbered prompts, each one building on the last, each one specific enough that the agent couldn’t misinterpret intent. Same pattern I use with Amp: decompose, specify, implement, review, commit. The difference is Woolf does it manually with Markdown files instead of using tooling to automate the decomposition. The principle is identical.

The benchmaxxing pipeline

Here’s where Woolf’s post stops being an experience report and starts being something else.

He developed an 8-step pipeline for building machine learning algorithms in Rust:

Implement the algorithm with benchmarks
Clean up code and optimize
Scan for algorithmic weaknesses, describe problem/solution/impact for each
Optimize until ALL benchmarks run 60% faster. Repeat until convergence. Don’t game the benchmarks
Create tuning profiles for CPU thread saturation and parallelization
Add Python bindings via PyO3
Create Python comparison benchmarks against existing libraries
Accuse the agent of cheating, then make it minimize output differences against a known-good implementation

Steps 4 and 8 are the clever ones. Step 4 gives the agent a quantifiable target instead of a vague “make it faster.” Step 8 is a built-in integrity check: even if the agent found wild optimizations, the output must still match scikit-learn.

Then he discovered something weird: chaining different models produces compound speedups. Codex optimizes the code 1.5-2x. Then Opus, working on the already-optimized code, somehow finds more optimizations. Different models apparently have different optimization strategies. This is something I’ve been experimenting with too. In my CLIProxyAPI setup, I route different Amp modes to different models: Opus for smart mode, Gemini for rush. Woolf’s approach is more deliberate. He chains them sequentially on the same codebase to compound their different optimization instincts.

The numbers he reports:

Algorithm	vs. Existing Rust	vs. Python
UMAP	2-10x faster	9-30x faster
HDBSCAN	23-100x faster	3-10x faster
GBDT	1.1-1.5x faster	24-42x faster fit

If these numbers are real, and he open-sourced nndex as proof, that’s not incremental improvement. That’s a different category of result.

nndex: the proof of concept

The project he released to back up his claims is nndex: an in-memory vector store for exact nearest-neighbor search, written in Rust with Python bindings. It’s conceptually simple (cosine similarity reduces to dot products on normalized vectors) but the implementation is anything but.

The Rust code uses:

simsimd for SIMD-accelerated dot products
rayon for parallel iteration with adaptive thresholds
Five different single-query strategies and five batch strategies, selected at runtime based on matrix shape
wgpu for GPU compute (Metal/Vulkan/D3D12) with custom WGSL shaders
An IVF approximate index with spherical k-means for when exact search is overkill
LRU caching, denormal flushing, zero-copy numpy interop via PyO3

The benchmark results against numpy (which uses optimized BLAS under the hood):

Low-to-medium dimensions (< 256): nndex wins by 2-9x
Single query on 50k rows: 4.9x faster
All results: 99.5-100% top-k overlap with numpy. Similarity deltas under 1e-6

The whole thing is built with #![forbid(unsafe_code)]. All performance comes from safe Rust, SIMD via safe wrappers, and GPU dispatch. No unsafe blocks.

What I actually take away from this

Woolf’s post is long and covers a lot of ground. Here’s what I think matters:

1. AGENTS.md is not optional. It’s the difference between useful output and garbage. If you’re getting bad results from coding agents, this is the first thing to fix. Not the model. Not the prompt. The persistent context file that shapes every interaction. I’ve been saying this since my first week with Clawdbot — the agent remembered my preferences because I wrote them down. Woolf’s AGENTS.md is the same concept, but production-grade.

2. Prompting is spec writing. The people getting good results aren’t typing casual requests. They’re writing detailed specifications with explicit constraints, tracking them in git, and referencing them by filename. This is just good engineering practice that happens to also work for agents.

3. The “literal genie” framing is perfect. Agents don’t read minds. They don’t infer your preferences. They do exactly what you say, including the things you forgot to say. You need to be specific about what you DON’T want as much as what you do.

4. Domain expertise is the multiplier. Woolf knew enough about MIDI, physics engines, font rendering, and machine learning algorithms to audit agent output and catch mistakes. Without that knowledge, the same prompts would produce the same bugs but nobody would notice until production. I learned this the hard way when Cici tried to hack my router with 2015-era exploits. If I didn’t know my TP-Link AX used encrypted tokens, I would have let the agent waste an hour on a dead end.

5. Chaining models is underexplored. The idea that Codex and Opus find different optimizations, and that running them in sequence produces compound speedups, caught me off guard. It’s like getting a second opinion from a doctor who went to a different medical school.

6. QA skills are the new coding skills. Woolf explicitly mentions that his background as a black-box QA engineer was critical. Finding bugs in agent code is different from writing bug-free code. If you spent years in QA, you might be better positioned for the agent era than the 10x developer next to you.

The uncomfortable conclusion

I’ve written before about how AI pair programming is really management. Woolf’s experience confirms this but takes it further. He’s not managing one agent on one task. He’s running a pipeline: spec, implement, benchmark, optimize, chain, verify. The agent is a tool in the pipeline, not the pipeline itself. This is the compound engineering loop I wrote about — Plan, Work, Review, Compound — except Woolf’s “Compound” step is literally running benchmarks and feeding the results back into the next optimization pass.

The uncomfortable part is that this workflow produces results that are hard to wave away. You can’t say “it’s just regurgitating GitHub” when the code is faster than everything on GitHub. You can’t say “it’s just autocomplete” when it’s implementing physics engines and ML algorithms from specs.

You can say “well, I could do that myself given enough time.” And you’d be right. But Woolf addresses this too: the session limits on these tools forced him into a habit of coding for fun an hour every day. The agents didn’t replace his programming. They let him tackle projects that would have taken months, and he learned Rust along the way by reading the diffs.

This won’t replace all programming. But for people who know what they want to build and not necessarily how to build it, this workflow clearly works.

I remain cautious. But my caution now has a different shape. I’m less worried about whether agents can produce good code, and more worried about whether developers will put in the work to make them produce good code. The AGENTS.md file, the detailed specs, the manual review, the integrity checks: that’s a lot of discipline. “Vibe coding” is easier. I argued before that I trust simple agents more because I can see what they’re doing. Woolf’s workflow is the opposite of simple — it’s elaborate and deliberate — but it shares the same core principle: transparency. Every prompt is in git. Every benchmark is reproducible. Every optimization is auditable.

And that’s the real problem. The people getting the best results from agents are the ones who were already good engineers. The gap isn’t closing. It might be widening.

Manage your agents. Or they’ll manage your codebase into the ground.

The Skeptic Who Got Benchmaxxed: What Actually Changed About AI Coding Agents

Related Posts