How I Got AI to Optimize Its Own Code

My puzzle generator was too slow. Instead of optimizing it myself, I set up a system where AI could.

January 8, 2026

My puzzle generator was taking 55+ seconds per puzzle. An hour later, AI had it down to under 1 second - without me writing a single optimization.

Here's how I set it up to optimize its own code.

I recently added Flagdoku (Sudoku meets Minesweeper) to Puzzle Time and needed to generate a full year of puzzles across three difficulties. At 60 seconds per hard puzzle, that would take hours.

The generator logic is pretty complex - it builds irregular regions, places flags while maintaining uniqueness constraints, then verifies the puzzle has exactly one solution and can be solved without guessing. I wasn't going to be able to manually optimize this in any reasonable timeframe (or at all).

So I tried something that ended up working really well - I gave AI a way to verify its own work.

Flagdoku

Setting Up the Feedback Loop

The typical way people use AI for coding is pretty linear.

You describe what you want, get some code back, test it yourself, and go back and forth until it works. That's fine for simple things, but performance optimization on complex algorithms doesn't really fit that pattern. There are too many variables and the feedback cycle is too slow when you're the one running tests.

What worked better was creating a system where AI could see the results of its own changes directly.

I started by having Codex wire Flagdoku into my existing puzzle verification script. It generates days of puzzles, logs attempts/durations per date, and surfaces failures.

Codex first ran the script to test generating 30 days of puzzles. When a day was slow or failed, I got it to turn on profiling, which produced per‑attempt timing breakdowns. Representative slow‑before / fast‑after:

# before (hard, pre-fix)
[flagdoku:irregular] attempt 1/200 total=3227ms regions=3ms layout=3223ms clues=0ms baseUnique=0ms trim=1ms trimUnique=1ms trimChecks=26

# after (hard, post-fix)
[flagdoku:irregular] attempt 1/200 total=11ms regions=6ms layout=1ms clues=0ms baseUnique=0ms trim=4ms trimUnique=4ms trimChecks=28

With this data, the bottleneck was obvious. Hard mode was spending seconds in layout generation, and retries multiplied the cost. Other checks were tiny by comparison.

Codex could read these logs, implement a fix, and repeat. It ran for over 1 hour autonomously and successfully optimized the generation!

Why This Actually Works

The core insight is pretty simple: AI has no idea if its code is actually good until something external tells it. When you're manually testing and reporting back, you become the slowest part of the process. You're doing verification work that could be automated.

When AI gets structured, measurable feedback, it can pinpoint exactly where performance is bad. It can confirm whether its fixes actually helped or made things worse. And it can iterate way faster than you could ever test manually.

This is the same reason having good tests makes AI-generated code significantly better. Tests act as an automatic feedback mechanism. AI writes code, runs the tests, sees what failed, and fixes it without needing to ask you anything.

This Works Beyond Puzzle Generation

The same pattern applies pretty much anywhere you need AI to optimize something:

For API performance, you benchmark your endpoints, log the slow ones with context about what made them slow, and let AI analyze the patterns.

For build times, you profile the build process, share the timing breakdown, and iterate on the slowest parts.

For error rates, you log failures with enough context to understand why they happened, then let AI spot the patterns you might miss.

The main lesson connecting all of these is to give AI concrete numbers/logs it can read, let it propose changes, re-run the measurement, and repeat until you hit your target.

The Actual Takeaway

The Flagdoku generator went from ~55s hard seeds to <1s average. That didn't happen because I suddenly got better at optimization. It happened because I set things up so AI could see exactly what was slow, try fixes, and verify whether they worked.

Most people use AI in a one-shot way. Ask a question, get an answer, move on. The real leverage comes from giving it measurable goals and letting it iterate against them.

You may have heard of the Ralph Wiggum loop lately - that's exactly what it does too. It keeps running iterations until a specific criteria is met, then it outputs "DONE".

Enjoyed this post?

Subscribe to get new posts delivered to your inbox.