Intro

Since December 2024, outside of work, I've been exploring how far I can go in getting complete programs written without manually coding anything myself. The idea was to test the boundaries of what AI tools can currently achieve when it comes to hands-free software development. In just a few months, the tools have changed rapidly, and the progress has been nothing short of astonishing.

I'll begin with a brief overview of the tools and approaches I've explored, then dive deeper into how things are working today: what's achievable, what limitations remain, and where the technology still needs to improve. This is less a survey and more a hands-on snapshot of current AI-assisted development from someone trying to push its boundaries.

Consider this entire document to be a time capsule. Thing will only improve from here. It is meant to be a snapshot in time, to be read in the future to see how much things have progressed since this was written.

December

Workflow at the Time

During this period, the development process was extremely manual and repetitive. Since there were no built-in tools for executing or editing code directly in the chat interfaces, the workflow essentially involved copying and pasting entire blocks of code between the AI interface and a local editor or terminal. When the AI generated code, I would manually paste it into my local development environment, run it, and then copy any error messages or stack traces back into the chat for debugging.

This meant that every iteration required round-tripping between the AI and the local machine. For larger programs, this became tedious fast—especially when the same section of code had to be re-pasted multiple times because the AI couldn't remember prior outputs. The models weren’t able to maintain state across sessions, and there was no way to point them to files directly. Despite these limitations, I still managed to produce working demos using this crude but effective loop.

January-April

April-May

Beyond Toy Problems

How well does it work?

Things are looking good, but clearly are not perfect.

First off, is it faster than coding something yourself? It depends. If it involves technologies you're not familiar with, then most certainly it will make it possible to get to a working demo very quickly, compared to having a very large overhead needing to learn first about those technologies. However, while the first 80% of the project may take 1% of the time, the last 20% will take 99% of the time using this method.

What works well is clearly getting a prototype up and running very quickly. Agentic coding can also enable very solid designs because of the ability to implement a test-driven approach. However, the downside of this approach is cost. It is a lot more costly in terms of tokens to go this route, since everything needs to be tested, debugged, improved, etc. This also means that it takes longer to get anything done since a lot more is demanded from the agents. However, I think this may be the best way forward despite the cost, since it provides the most robust outcome. The reason for it is that typically while adding new functionality, we'd change the architecture document to reflect this, change the unit tests to reflect this, change the UI tests to reflect this, etc. When a new feature is added, the model would read these documents, and understand the overall architecture when adding a new feature. This massively improves the chances of success in a large project.

Another tremendous thing which is happening is that with the frontier models' capabilities we have today it is possible to get a 160-page technical document written by the European Space Agency, pinpoint a specific method and get it implemented. This, to me, is absolutely mind blowing. I believe DeepMind recently showed that they fed the Q-learning paper and got Gemini 2.5 Pro to implement it and get it to train a model to learn how to play Pong. Again, this sounds like absolute science fiction to me and I can hardly believe we're here.

Another example is that in the Procedural Planet Editor I asked Gemini 2.5 Pro with Deep Research to create a comprehensive survey of all planet creation methods. I then proceeded to feed this report and ask it to write the code to implement them, and create a demo out of it! What you see in the final demo was started like this and it was basically done. Of course, the last 20% as usual took many more hours of fine tuning, etc.

What are the biggest gotchas with this approach?

I'm still uncertain about the size boundary of what's possible to design using agentic coding. It's clear that it's a powerful tool capable of creating impressive demos seamlessly. However, without a strong architectural overview, there's a risk of agents independently creating many varying interaction patterns within a large application, simply due to the absence of clear, consistent guidelines.

Another significant consideration is monetary cost. For instance, developing the Wordle Solver application cost approximately $90 using test-driven development approaches, making it the most expensive demo I've created so far. Around $20 of this budget was wasted on a debugging loop caused by the agent forgetting to document a crucial detail: the UI used colors as letter conventions, while the backend followed a different convention, resulting in continuously failing integration tests. I ultimately intervened to provide a hint, resolving the issue and breaking the loop.

Additionally, tool use failures often compound monetary costs. Each unsuccessful attempt increases the context size, leading to repeated, unnecessary tool interactions. As the context grows, the token count provided to the model also expands, directly impacting the overall cost—unless the server employs efficient caching with aggressive pricing. Currently, a significant portion of a project's budget can be attributed to these failed tool uses and unsuccessful diffs. Ideally, these expenses will decrease as model providers begin incorporating improved diff-handling capabilities directly within their models.

Context is king. I strongly suspect that the main reason these demos function effectively is due to the extensive context accumulated during debugging sessions. Shorter context-length models have consistently shown poorer results when developing complex demos.

User supervision remains essential. For instance, I've seen reports on the RooCode subreddit where users encountered situations with both Gemini 2.5 Pro and Claude 3.7 entering costly edit loops—sometimes amounting to hundreds of dollars—because users passively allowed the agents to auto-approve every edit request.

An experiment attempting to fully automate the Roo code agent highlighted challenges with both efficiency and behavior boundaries. Using a Playwright MCP server granted the agent autonomy to modify the page and invoke tools like screenshot, read_console, or get_browser_console_logs. However, this hands-off approach paradoxically introduced overhead, likely resulting in far more back-and-forth interactions than a human-guided process would require. Furthermore, during this same automated session, the agent's behavior underscored the unclear line between productive work and ineffective loops. When tasked with creating a subtle animation for this page, it entered a costly cycle of failed attempts, ultimately producing no visible result while spending around $5. I only let this continue out of curiosity about self-correction, but it demonstrates how, without oversight, such agents can pursue unproductive paths, necessitating user intervention in budget-sensitive environments.

The Future

Looking ahead, one thing feels increasingly certain: AI is becoming an integral part of modern software development. Whether you're writing tests, refactoring legacy code, adding new features, or even architecting entire systems, large language models (LLMs) are now playing a central role in these tasks.

At this moment, agentic coding shines as a way to create compelling demos and to dive into unfamiliar technologies with unprecedented ease. But despite the impressive progress, it's not yet clear whether these tools can reliably handle the complexity of building and maintaining truly large-scale applications. Current limitations around coordination, architectural consistency, and context management still pose significant challenges.

That said, the pace of advancement is staggering. By the time you read this, the capabilities of these tools may have taken another leap forward. It's entirely plausible that using an AI assistant throughout the software development lifecycle—from inception to deployment—will be not just common, but expected. And while agentic workflows may not yet replace skilled engineers, they are rapidly transforming what a single person can build.

If you'd like to post your thoughts you can use this thread.

Want to see my collection of AI-coded projects? It's here.