Intro
Since December 2024, outside of work, I've been exploring how far I can go in getting complete programs written without manually coding anything myself. The idea was to test the boundaries of what AI tools can currently achieve when it comes to hands-free software development. In just a few months, the tools have changed rapidly, and the progress has been nothing short of astonishing.
I'll begin with a brief overview of the tools and approaches I've explored, then dive deeper into how things are working today: what's achievable, what limitations remain, and where the technology still needs to improve. This is less a survey and more a hands-on snapshot of current AI-assisted development from someone trying to push its boundaries.
Consider this entire document to be a time capsule. Thing will only improve from here. It is meant to be a snapshot in time, to be read in the future to see how much things have progressed since this was written.
December
Workflow at the Time
During this period, the development process was extremely manual and repetitive. Since there were no built-in tools for executing or editing code directly in the chat interfaces, the workflow essentially involved copying and pasting entire blocks of code between the AI interface and a local editor or terminal. When the AI generated code, I would manually paste it into my local development environment, run it, and then copy any error messages or stack traces back into the chat for debugging.
This meant that every iteration required round-tripping between the AI and the local machine. For larger programs, this became tedious fast—especially when the same section of code had to be re-pasted multiple times because the AI couldn't remember prior outputs. The models weren’t able to maintain state across sessions, and there was no way to point them to files directly. Despite these limitations, I still managed to produce working demos using this crude but effective loop.
- Using
ChatGPT 4o
- It looked quite promising! The UI allowed me to ask for some code, and the code would be written. It then would show me the results of the code. However, it was not possible to write long programs due to the relatively short context.
- Using AI Studio with
Gemini 2.0 Pro
- This was the first time my mind has been blown. First off, I was using AI Studio and requested the model to write a Wordle solver. I provided it with the PPO paper and requested to use PPO to write the solver. The first reply I got was saying that it may not be appropriate to do this because of the action space for Wordle!
- In the end, I ended up with a trainer which did implement PPO and which did indeed solve Wordle (poorly). However, everything was working. I verified the code and everything looked right. It implemented everything in PyTorch, and then I was even able to get it to properly use JAX.
January-April
- Using
Claude 3.5
- This model seemed to be even more reliable than Gemini Pro 2.0. The only problem with it was that there were a limited number of requests allowed per day, and it very quickly ran out of context, which was a serious problem.
- I've never really managed to write a full application with it, but it seemed very good, and it even debugged some bugs introduced by Gemini Pro 2.0 in AI Studio.
- The Paradigm Shift
- Agentic coding with Roo Code (a fork of Cline) opened up many more ways to use models which could not otherwise be useful due to shorter contexts. It implemented a series of agents that act as Coders, Architects, or Orchestrators (i.e., project managers who delegate tasks).
- Multi-file projects were suddenly possible. The agents could choose what files to edit, what to change, etc.
- This enabled me to go beyond toy problems, but it was clear that there would be a tradeoff in cost. Claude was good for coding, but the cost for using the API per token was much too high to be useful for "messing around to find out what I can do with agentic coding." It was very frustrating because Google didn't really have a way to use Gemini Pro properly during this time, which would have been the free alternative. The big problem was getting throttled by design and eventually banned/unbanned.
- The new workflow still involved roundtrips from the browser to Roo Code inside Visual Studio, instructing it to fix errors found in the Chrome Developer console.
April-May
Gemini 2.5 Pro
was released. This was the 2nd paradigm shift for me because it provided much better pricing than Claude (a fraction of the price) while providing comparable performance.- I was now able to use a very smart model to do not all the coding and planning tasks. This model was now one-shotting many problems with great ease. This is when I started putting together The Sandbox because I started feeling like something important is happening.
Gemini Pro 2.5
one-shotted the MandelBulb project. I was skeptical because this definitely had been done on ShaderToy before.- The Lone Star constellation blew my mind because it had not been done before, and Gemini clearly couldn't implement it. However, I downloaded the technical report from the European Space Agency: Flower Constellations - ESA Report (PDF), gave it as context, and asked for an implementation. It one-shotted a perfect version: Flower Constellations Demo.
Beyond Toy Problems
- With Roo Code and the
Gemini 2.5 Pro
backend it was possible to write relatively complicated programs seemingly reliably. For example, I got the MandelBox project written by Gemini over a few days. I was not super-convinced though because it's clear that this is still within the ShaderToy domain. However, I was able to get Gemini to come up with other fractal variations which are now part of the project, and got it to do very specific things which I am sure were never done exactly like that before. This was my first "big" project. - The Wordle Solver demo is something which I configured in Roo Code to be a test-driven design. It started by writing the unit tests (over one hundred of them) and as it implemented various design features, it would add tests before each feature, get the test to fail, and iterate over and over until all features were completed. Overall, the demo had over 5,000 lines of code, was fully documented and as an artifact we also had documents with the full specification of the project (design doc), architecture doc, gap analysis doc, etc!
- Unfortunately, as of today, I still don't have a way to connect Roo Code to the Chrome browser to and feed back automatically various javascript errors. This would greatly improve the turnaround time.
How well does it work?
Things are looking good, but clearly are not perfect.
First off, is it faster than coding something yourself? It depends. If it involves technologies you're not familiar with, then most certainly it will make it possible to get to a working demo very quickly, compared to having a very large overhead needing to learn first about those technologies. However, while the first 80% of the project may take 1% of the time, the last 20% will take 99% of the time using this method.
What works well is clearly getting a prototype up and running very quickly. Agentic coding can also enable very solid designs because of the ability to implement a test-driven approach. However, the downside of this approach is cost. It is a lot more costly in terms of tokens to go this route, since everything needs to be tested, debugged, improved, etc. This also means that it takes longer to get anything done since a lot more is demanded from the agents. However, I think this may be the best way forward despite the cost, since it provides the most robust outcome. The reason for it is that typically while adding new functionality, we'd change the architecture document to reflect this, change the unit tests to reflect this, change the UI tests to reflect this, etc. When a new feature is added, the model would read these documents, and understand the overall architecture when adding a new feature. This massively improves the chances of success in a large project.
Another tremendous thing which is happening is that with the frontier models' capabilities we have today it is possible to get a 160-page technical document written by the European Space Agency, pinpoint a specific method and get it implemented. This, to me, is absolutely mind blowing. I believe DeepMind recently showed that they fed the Q-learning paper and got Gemini 2.5 Pro
to implement it and get it to train a model to learn how to play Pong. Again, this sounds like absolute science fiction to me and I can hardly believe we're here.
Another example is that in the Procedural Planet Editor I asked Gemini 2.5 Pro
with Deep Research to create a comprehensive survey of all planet creation methods. I then proceeded to feed this report and ask it to write the code to implement them, and create a demo out of it! What you see in the final demo was started like this and it was basically done. Of course, the last 20% as usual took many more hours of fine tuning, etc.
What are the biggest gotchas with this approach?
I'm still uncertain about the size boundary of what's possible to design using agentic coding. It's clear that it's a powerful tool capable of creating impressive demos seamlessly. However, without a strong architectural overview, there's a risk of agents independently creating many varying interaction patterns within a large application, simply due to the absence of clear, consistent guidelines.
Another significant consideration is monetary cost. For instance, developing the Wordle Solver application cost approximately $90 using test-driven development approaches, making it the most expensive demo I've created so far. Around $20 of this budget was wasted on a debugging loop caused by the agent forgetting to document a crucial detail: the UI used colors as letter conventions, while the backend followed a different convention, resulting in continuously failing integration tests. I ultimately intervened to provide a hint, resolving the issue and breaking the loop.
Additionally, tool use failures often compound monetary costs. Each unsuccessful attempt increases the context size, leading to repeated, unnecessary tool interactions. As the context grows, the token count provided to the model also expands, directly impacting the overall cost—unless the server employs efficient caching with aggressive pricing. Currently, a significant portion of a project's budget can be attributed to these failed tool uses and unsuccessful diffs. Ideally, these expenses will decrease as model providers begin incorporating improved diff-handling capabilities directly within their models.
Context is king. I strongly suspect that the main reason these demos function effectively is due to the extensive context accumulated during debugging sessions. Shorter context-length models have consistently shown poorer results when developing complex demos.
User supervision remains essential. For instance, I've seen reports on the RooCode subreddit where
users encountered situations with both Gemini 2.5 Pro
and Claude 3.7
entering costly
edit loops—sometimes amounting to hundreds of dollars—because users passively allowed the agents
to auto-approve every edit request.
An experiment attempting to fully automate the Roo code
agent highlighted challenges with both efficiency and behavior boundaries. Using a Playwright MCP server granted the agent autonomy to modify the page and invoke tools like screenshot
, read_console
, or get_browser_console_logs
. However, this hands-off approach paradoxically introduced overhead, likely resulting in far more back-and-forth interactions than a human-guided process would require. Furthermore, during this same automated session, the agent's behavior underscored the unclear line between productive work and ineffective loops. When tasked with creating a subtle animation for this page, it entered a costly cycle of failed attempts, ultimately producing no visible result while spending around $5. I only let this continue out of curiosity about self-correction, but it demonstrates how, without oversight, such agents can pursue unproductive paths, necessitating user intervention in budget-sensitive environments.
The Future
Looking ahead, one thing feels increasingly certain: AI is becoming an integral part of modern software development. Whether you're writing tests, refactoring legacy code, adding new features, or even architecting entire systems, large language models (LLMs) are now playing a central role in these tasks.
At this moment, agentic coding shines as a way to create compelling demos and to dive into unfamiliar technologies with unprecedented ease. But despite the impressive progress, it's not yet clear whether these tools can reliably handle the complexity of building and maintaining truly large-scale applications. Current limitations around coordination, architectural consistency, and context management still pose significant challenges.
That said, the pace of advancement is staggering. By the time you read this, the capabilities of these tools may have taken another leap forward. It's entirely plausible that using an AI assistant throughout the software development lifecycle—from inception to deployment—will be not just common, but expected. And while agentic workflows may not yet replace skilled engineers, they are rapidly transforming what a single person can build.
If you'd like to post your thoughts you can use this thread.
Want to see my collection of AI-coded projects? It's here.