My Agentic Development Workflow

Apr 16

AI agents write good code. Fast, too. The problem is everything around the code: what happens when the agent pushes to the wrong branch, when review tools bill you per cycle, when the same bug comes back because nobody documented the fix. After years of building with agents, the workflow is where I spend most of my attention. The models keep getting better on their own. The process doesn't.

My main tools are Claude Code and OpenAI Codex and this is the workflow I've built around them. It’s also good to understand that none of this replaces experience of knowing what you're doing. It just lets you do more with what you already know.

Pay for the best models

For building, I use whatever's latest and greatest. Right now that's Claude Opus 4.6 and GPT-5.4, running through their official harnesses, aka. Claude Code and Codex respectively. Codex seems to produce more reliable results on complex tasks; Opus handles UI/UX more gracefully, especially when armed with the frontend-design skill. In practice, I switch between them depending on what I'm building.

For the applications themselves, you can and should use open-weight models to optimize spend. But for development? Pay for the best. The difference is basically between code that mostly works and code that actually works.

Structured work process

The main structure of the work process is Compound Engineering. Every feature follows the same loop: brainstorm, plan, work, review, compound. The agent brainstorms approaches, requirements and edge cases before writing any code. It then plans the actual implementation against existing patterns and context files. When that is done it writes code, runs tests, and reviews its own changes against known bugs and project conventions. Finally it does the compound step where the whole approach gets its name: document what was learned, search for similar patterns across the codebase, update instructions when needed, and add tests that catch the whole category. This way each unit of work aims to make future work easier.

The context files, things like AGENTS.md, bug logs and decision records, grow with every feature. Agents improve over time because the context they work from keeps growing.

Voice input

For input I mainly use Wispr Flow for sending instructions via voice-to-text. I have a key bound on my keyboard for it, so I just press that and waffle on what I want to happen next. Wispr Flow works especially well because it knows developer lingo and runs whatever you ramble about through a separate AI model to turn it into clear, well-formatted text. The AI sees clean instructions while you can just think out loud.

Tests as the primary gate

Tests are becoming one of the most important parts of the process. In tests, you define what you're expecting, and that expectation holds against every changeset from today to years out. Tests survive context overflows and restarts. They keep the agent on rails, even when it decides to stray and cut corners, and they're the main feedback loop for agentic development.

For the testing process itself, I prefer unit tests during development because feedback speed matters most when the agent is writing code. A test suite that runs in seconds lets it write, test, fix, and repeat without waiting around. Once a bigger chunk of work is done, the agent runs end-to-end tests, and only commits when those pass. All of the same tests also run in CI, in isolation.

Feature branches

When I’m working on early versions, I often commit straight to the main branch to keep the velocity high. For things that eventually require stability, I switch to feature branches, and the agent is instructed to create a new feature branch and pull request for each new feature.

However, some agents do not always respect the instructions given to them. For example, Claude, left to its own devices, can happily commit and push directly to main, regardless of the instructions given. This can be especially dangerous if you're also deploying automatically from main. To avoid this, lock the branch, either through GitHub branch rules or a git pre-commit hook.

Automated review catches what you can't read

There's going to be so much generated code that you can't review it all. I can't read every line Claude writes. Neither can you. So you need automated review. I've written before about why traditional code review can't keep up with AI-generated code.

Tools like Greptile and CodeRabbit are useful, but they quickly get expensive with agentic workflows. As you're pushing changes constantly, every fix triggers another review cycle, and the bill climbs. Ask me how I know… Instead, I run Compound Engineering's review skill locally before anything hits GitHub. This catches issues early and it also means you're not blasting your GitHub Actions minutes with constant codebase churn.

If you need an integrated review process on GitHub, Codex has a more forgiving plan for reviews. Personally I use it to run another read-through in the CI after the local reviews are done and resolved.

In practice, the AGENTS.md file has instructions for the agent to run compound and review steps locally, create a PR when ready, wait for CI and any external review, fix anything flagged, and repeat until clean.

Security through layered work

Code review catches structural issues, but it won't find everything. Security problems in particular tend to slip past tools that only look at code quality. So every now and then, I run a different kind of review.

I instruct agents to red-team against the services I work with. They rotate through common attack patterns first without knowledge of the code, then cross-reference against the actual implementation. Many recent critical vulnerabilities, like the remote code execution in React server actions, weren't found by humans. They came from AI security tools doing exactly this kind of work. So this is something you need to do before someone else gets to it first.

CLI tools, not MCP

When I need to add tools, I tend to reach for CLI tools over MCP. GitHub CLI, Vercel CLI, that kind of thing. The value of tools is in letting the agent pull state from external sources. PR review comments. Test status. Error logs. Deployment state. The agent checks what reviewers said, sees if the preview deployment failed, reads the actual error messages and acts on it.

Why CLI over MCP? CLI tools tend to link more reliably to cloud resources, tying your repository to specific GitHub and Vercel projects without the agent having to figure out on the fly which one it's working with.

Production monitoring as the final layer

Even with tests, reviews, and branch protections, stuff will slip through. Code that passes every check can still behave unexpectedly once real users hit it with real data, especially at real scale. So the final layer is monitoring in production.

I mainly use Sentry for monitoring, while the agent also has access to Vercel's observability for performance and logs. When something breaks in production (and something will), I want to know about it fast and the agent to be able to handle it. For this I need the full context: what failed, what the request looked like, what happened leading up to it. That context gets passed to the agent for the fixes and the next iteration of the codebase.

It’s never done

As code production becomes near free through agents, the workflow and process are now what you need to improve. I review my workflow after every major chunk of work. The process, not the code. What slowed things down? Where did the agent get confused? Are test suites getting too heavy? Are there new tools worth trying?

When an agent keeps making the same type of mistake, it usually means my instructions are unclear. To resolve this, I review and update the AGENTS.md and any related documentation with the agent, so that it can write the instructions for itself. Over time, these small adjustments keep compounding.

Give it few months and your workflow probably looks completely different. I know mine will.

Closing notes

Every piece of this workflow exists because something broke and I had to make sure it wouldn't break the same way again. After months of running it, the interesting part isn't any individual layer. It's how they compound. Tests make review faster. Review makes commits cleaner. Cleaner commits make monitoring quieter. And the context files mean the agent that writes tomorrow's code is smarter than the one that wrote today's.

It's all just plumbing. But the plumbing is what lets you trust the output.

Mikko Tikkanen