Can AI Code Solo? Part II: Six Months in the Driver’s Seat

Six months ago, in my article "Can AI Code Solo? Testing the Limits of AI Developers," I documented my first experiments with early agentic tools like Roo Code. The potential was visible, but the process was difficult. Today, the landscape has evolved significantly. This follow-up details my journey with a new generation of models and tools—including Claude 4.5, Gemini 3 Pro, and platforms like Antigravity—to explore what has changed and what might come next.

As before, the experiences I share are from the perspective of a hobby programmer exploring these technologies outside of a professional work environment. Your mileage may vary if you're using these tools in a corporate/production setting.

The Token Problem: Stopping Roo Usage

Following the first report in which I detailed my Roo Code usage, I configured a robust multi-agent setup in Roo Code to facilitate test-driven development. This system utilized a custom orchestrator designed to create detailed plans and verify their execution—an approach that, while perhaps more complex than typical usage, was necessary to produce reliable code at the time.

My experiments with Roo yielded several compelling projects. Notably, I successfully implemented a diffusion model from scratch—with zero manual coding—and developed multiple Wordle solvers employing varied strategies with mixed success. The process was both enjoyable and illuminating.

However, despite the technical success, the "token tax" proved unsustainable. Preserving the necessary context across multiple agents required an exorbitant number of tokens. With the costs outweighing the results, I ceased using Roo in August (with the aformentioned setup) to search for more efficient alternatives.

Specialized Agents

I moved focus to Google AI Studio's (ai.dev) coding agent. With Gemini 3 Pro, the capability is much better. I made an app that generates "nano-banana" calendars and exports PDF files. The result was good and deployable. But the platform has constraints because of commercial structure, like API costs for the apps.

Within AI studio/Canvas, I also tested the agent with a browser RTS game (a simple "Starcraft clone"). It had three units, two resources, and a tech tree with four buildings. The start was easy. But making changes was slow. The agent rewrote full files for every edit instead of using diffs. This was very slow. This agent is powerful but still early in development.

Other specialized agents, like gemini-cli and Claude Code, are designed to operate from the command line. While intriguing in principle, I ultimately found limited practical use for them, consistently favoring agents integrated within an IDE that leverage the same underlying models.

Antigravity Tool

In November, Antigravity was released. It is a much better agent setup than Roo. It supports Claude and Gemini. It is likely free now to attract developers. The difference is big. The experience is polished. It looks like Antigravity is based on or heavily inspired by Windsurf.

The important part is how models work together. Gemini 3 Pro is good at UX and design. Claude 4.5 is better at logic and finding bugs. I used both strengths. I built a Deep Research Agent with a TUI (Text User Interface) using Python's textual library in one evening.

The project has 2,000 lines of code. Unlike my initial experience in May, the code quality is actually high. Development is faster. There is much less frustration. Interestingly, the agent is configured in such a way that the iterations speed is increased quite a bit.

Tracking Progress: Leaderboards

As tools change, we must measure progress. We are moving past static benchmarks to real evaluations. I recommend looking at LiveSWE-Bench and LiveBench.

These leaderboards show how models handle real engineering tasks and fresh questions. The competition at the top is strong. Models change positions often. This shows how fast the state of the art advances.

However, be careful. It is very hard to truly quantify these numbers into what it is actually like to use these things. A high score on a chart does not always mean the tool feels good to use. Benchmarks are artificial. Real work is messy. Do not trust the rankings blindly. You must try the tools yourself to see how they fit your specific work and style. Effectiveness often depends on you. It is related to how clearly you express what you want, and how well you guide the agent to fix what does not work.

Conclusion: Are We Obsolete?

What about Singularity? Will an LLM replace me?

This is not happening yet. Tools like Antigravity, Cursor, and updated Roo increase productivity. But they need a pilot. Models are smarter, but they are just engines. They require a human who knows how to use them effectively. It is possible we are witnessing a transition from programming in high-level languages to programming in English or another natural language. We become more effective, not redundant.

Currently, we're far from replacing any senior developer with the agentic coders that we have. With that said, senior developers will be more effective in their jobs using these tools.

Please keep in mind that this article covers my personal exploration of today's AI coding landscape. The field is vast and evolving quickly, and I'm eager to learn from the community. I invite you to reach out on Twitter and share your own experiences, insights, and favorite tools.