Can AI Code Solo? Testing the Limits of AI Developers

Introduction: My Journey into AI-Powered Development

Can Artificial Intelligence truly write entire software programs without human intervention? Since December 2024, outside of my regular work, I've been on a mission to find out. I wanted to see just how far current AI tools could take the concept of "hands-free" coding. What started as an experiment quickly became a fascinating journey, as the capabilities of these tools evolved at an astonishing pace over just a few months.

In this post, I'll share my experiences: the tools I've tried, the workflows I've adopted, and the surprising results I've achieved. We'll look at what's genuinely possible with AI development today, where the limitations still lie, and what needs to improve. Think of this not as a formal survey, but as a practical snapshot from the trenches of pushing AI coding to its current limits.

Consider this document a time capsule. The AI landscape changes incredibly fast, so this reflects the state of things in May 2025. Hopefully, looking back on this in the future will highlight just how much progress continues to be made.

Early Explorations (December 2024)

My initial foray into AI coding was quite manual. The process revolved around a constant back-and-forth between the AI chat interface and my local development environment. Because the tools lacked built-in execution or file editing, I'd prompt the AI for code, copy the generated snippet, paste it into my editor (like Visual Studio Code), run it locally, and then copy any error messages back to the AI for debugging.

This copy-paste cycle became tedious, especially for larger programs or when the AI forgot previous context, requiring me to re-paste the same code sections repeatedly. There was no concept of state persistence or direct file access for the AI. Despite these hurdles, this basic loop was enough to get some initial working demos off the ground.

ChatGPT 4o: Showed early promise. It could generate code and even show execution results within its interface, but its relatively limited context window made building longer programs difficult.
AI Studio with Gemini 2.0 Pro: This was a mind-blowing experience. I asked it to build a Wordle solver using the Proximal Policy Optimization (PPO) algorithm, providing the original PPO paper as context. Impressively, its first response questioned the suitability of PPO for Wordle's action space! Ultimately, it produced a functional PPO trainer (albeit not a great solver) implemented first in PyTorch and later successfully adapted to JAX upon request. The generated code was verifiable and correct.

Refining the Process and a Paradigm Shift (January - April 2025)

The exploration continued with new models and tools emerging:

Claude 3.5: This model felt even more reliable than Gemini 2.0 Pro for coding tasks. However, daily usage limits and a tendency to run out of context hampered its usefulness for complex projects. While I never completed a full application solely with Claude, its ability to debug code generated by other models was impressive.
The Paradigm Shift - Agentic Coding: The introduction of agentic coding frameworks marked a significant turning point. I started using Roo Code (a fork of the earlier Cline). These tools employ multiple AI agents collaborating on a project – typically including roles like Coder, Architect, and an Orchestrator (like a project manager) delegating tasks.
- Suddenly, working with multi-file projects became feasible. Agents could intelligently decide which files to edit and what changes to make.
- This unlocked the potential to tackle more than just toy problems. However, it introduced a new challenge: cost. While powerful models like Claude were effective, their API token costs were prohibitive for extensive experimentation. During this period, accessing the more affordable Gemini Pro models reliably was difficult due to throttling issues.
- My workflow evolved but still involved some manual steps: generating code changes in Roo Code within VS Code, running the web application, observing errors in the browser's developer console, and instructing Roo Code to fix them.

Hitting Stride with Advanced Models (April - May 2025)

The release of Gemini 2.5 Pro felt like the second major paradigm shift. It offered performance comparable to top-tier models like Claude but at a fraction of the API cost. This made extensive agentic coding economically viable.

With a powerful and affordable model driving the agents, complex tasks became much easier. Gemini 2.5 Pro started "one-shotting" (generating correct code on the first try) many problems. This spurred me to create The Sandbox, a collection of AI-generated projects, as I felt a significant capability threshold had been crossed.

It effortlessly one-shotted the Mandelbulb rendering project, though I suspected similar examples existed on platforms like ShaderToy.
The truly mind-blowing moment came with the Flower Constellations project. This involved implementing orbital mechanics described in a technical paper. I fed the agent the European Space Agency's "Flower Constellations" PDF report, pointed it to the specific method, and asked for an implementation. It produced a perfect, working demo in one shot: Flower Constellations Demo. This demonstrated an ability to understand and apply complex, specialized knowledge from external documents.

Moving Beyond Simple Demos: Complex Projects Take Shape

Armed with Roo Code and the Gemini 2.5 Pro backend, building relatively sophisticated applications seemed increasingly reliable.

The Mandelbox project took several days to develop via the AI agent. While the core concept might have existed on ShaderToy, I guided the AI to implement specific variations and features I'm confident were novel, making it my first substantial AI-driven project.
The Wordle Solver demo exemplified a more rigorous approach. I configured Roo Code to follow Test-Driven Design (TDD). The agent first wrote over a hundred unit tests. Then, for each feature, it would add relevant tests, implement the code, and iterate until all tests passed. The final result included over 5,000 lines of documented code, accompanied by AI-generated design specifications, architecture documents, and gap analyses.

Despite these successes, a key bottleneck remained: the lack of a direct feedback loop between the browser and the coding agent. Manually relaying JavaScript errors from the Chrome developer console back to Roo Code still slowed down the iteration cycle considerably.

Assessing the Current State: How Well Does AI Code?

So, where do things stand? The progress is undeniable, but it's not perfect.

Is it faster than coding yourself? The answer is nuanced. If a project involves technologies you're unfamiliar with, AI can drastically speed up the initial development, getting you to a working prototype much faster than learning everything from scratch. However, the common "80/20 rule" often applies in reverse: AI might handle the first 80% of the project in 1% of the time, but completing the final 20% (debugging, refining, handling edge cases) can take 99% of the effort.

What works exceptionally well is rapid prototyping and exploring new tech stacks. Furthermore, agentic coding, especially with a TDD approach, can lead to robust and well-structured applications. The downside is cost and time. The TDD process inherently consumes more tokens (and thus money) and requires more iterations as tests are written, failed, and passed. However, the resulting quality and maintainability might justify the investment, especially for larger projects. By updating design documents and tests before adding features, the AI agent maintains a better understanding of the overall architecture, improving its chances of success.

The ability of frontier models like Gemini 2.5 Pro to process and implement information from dense technical documents (like the ESA paper) is truly remarkable. It feels like science fiction. Reports from places like Google DeepMind suggest similar feats, such as feeding Gemini 2.5 Pro the original Q-learning paper and having it implement the algorithm to learn Pong autonomously. This level of comprehension and application is astounding.

Another powerful example is the Procedural Planet Editor. I tasked Gemini 2.5 Pro (using its Deep Research capabilities) with surveying planet generation techniques. I then fed this AI-generated report back to the agent and asked it to implement the methods in a demo. The initial version was largely complete, though the final polish, as usual, required significant fine-tuning.

Key Challenges and Considerations

Despite the successes, several significant challenges and "gotchas" remain when relying heavily on AI for development:

Scalability and Architectural Consistency: While agentic coding creates impressive demos, I'm still unsure about the upper limit of project size it can handle effectively. Without strong, human-guided architectural oversight, there's a risk that agents working on different parts of a large application might develop inconsistent interaction patterns or suboptimal structures simply because they lack a unified vision.
Monetary Cost: Development using powerful AI models via APIs isn't free. The TDD approach for the Wordle Solver demo, for example, cost around $90. Roughly $20 of that was wasted when the agent got stuck in a debugging loop, repeatedly failing integration tests because it forgot a crucial detail documented elsewhere (the UI used different color conventions than the backend). I eventually had to intervene manually to break the loop.
Tool Use Failures and Compounding Costs: When the AI agent fails to use a tool correctly (like applying a code diff improperly), it often retries, increasing the conversation history (context) and leading to further, potentially unnecessary, tool calls. This inflates the token count sent to the model on subsequent turns, driving up costs—unless the backend service has very efficient caching. A significant portion of the budget for complex tasks can currently be attributed to these failed attempts. Hopefully, models will improve their internal diff-handling capabilities to mitigate this.
Context is King: The ability of models to handle long context windows seems crucial. I strongly suspect the success of these demos relies heavily on the extensive context built up during debugging. Models with shorter context lengths consistently struggled with more complex tasks.
Essential User Supervision: Leaving AI agents completely unsupervised can be risky and expensive. Reports on forums like the RooCode subreddit mention users experiencing costly editing loops (sometimes hundreds of dollars) with both Gemini 2.5 Pro and Claude 3.7 because they allowed the agents to auto-approve all changes without review.
Automation Pitfalls: An experiment I ran using a Playwright MCP server to give the agent full browser control (reading console logs, taking screenshots autonomously) highlighted issues. While aiming for full automation, it paradoxically increased overhead due to excessive back-and-forth communication compared to human guidance. Furthermore, when tasked with adding a simple animation to *this* page, the agent entered a $5 loop of failed attempts, producing nothing. This demonstrated how, without oversight, agents can pursue unproductive paths at significant cost, reinforcing the need for human judgment, especially when budgets are a concern.

Looking Ahead: The Future of AI in Development

One thing seems certain: AI is rapidly becoming an indispensable tool in the software developer's toolkit. Whether it's writing unit tests, refactoring complex code, generating boilerplate, adding features, or even assisting with system design, Large Language Models (LLMs) are increasingly involved.

Right now, agentic coding excels at creating impressive prototypes and enabling developers to quickly learn and utilize unfamiliar technologies. However, it's still an open question whether today's tools can reliably build and maintain truly large-scale, mission-critical applications without significant human oversight and architectural guidance. Challenges around coordination, cost, context management, and ensuring architectural integrity remain.

But the pace of progress is breathtaking. By the time you read this article, the landscape may have shifted yet again. It's highly likely that using AI assistants throughout the entire software development lifecycle will become standard practice. While AI agents might not fully replace skilled human engineers in the near future, they are undeniably transforming the development process and dramatically amplifying what a single developer can achieve.

If you have thoughts to share on this topic, feel free to join the discussion on this X thread.

Want to explore the projects mentioned? Check out my collection of AI-coded experiments: The Sandbox.

You are reading a version of the article which Gemini re-wrote to flow better while keeping all the original contents. The human-written version of this article can be found here.