Getting an LLM to write a small, nontrivial program from scratch without human interaction
For only around $5
Lately I've found myself spending more and more time as an errand boy hauling information between the compiler, the LLM and back. In a recent bout of overwhelming laziness, I decided to see if I could find a way to automate this too and, as Benny Hill was fond of saying, cut out the middle man. Somewhat surprisingly, I found it is to a large degree already possible with current LLM technology, even a "small" (affordable) model like o3-mini, if sufficiently extreme automated micromanagement is applied. This contrasts with more free-form agent-based approaches, which currently struggle with staying on track for a whole project, especially "mini" models.
The program in question (the first test-case I picked for this approach) is a market data recorder: an application that connects to an exchange's live data stream (in this case Binance websockets) and records the data to parquet files, for entirely non-nefarious purposes. While not particularly difficult, it nevertheless requires the LLM coding it to correctly utilise multiple external libraries (gorilla websockets and parquet), and write code that correctly interfaces with an external API. Something that would take a human at the very least a few hours to write and debug (and much more when doing the same thing for many exchanges). With the relatively simple framework I developed for typed workflows (available here), which uses the "LLM in a loop" approach to force the model to bash its head against the compiler and unit tests (which it's compelled to write), I managed to get o3 mini to write a largely correct* implementation (available here), for the cost of only around $5.10 (1824376 tokens in, 826883 tokens out), significantly cheaper than the human labour to produce equivalent code. It took around 2 hours and 40 minutes, quicker than a human (especially given the number of unit tests it wrote), with around 2.6 minutes spent in compilation and 2.5 spent running unit tests, 95 compilation failures, 34 test failures, and 23 incorrect JSON responses from the LLM.
The key components of this approach are as follows:
1. Embedding the LLM in the host programming language as a domain-specific language (DSL), so that it can be integrated into control flow and return structured values (via JSON, which get checked and automatically converted to values in the host language).
2. Smart context management in the host language, showing the LLM only what's necessary for the task at hand, which not only reduces costs but also reduces the inaccuracy that results from larger context lengths.
3. Breaking up the project into smaller chunks (workflow) managed by the host language control flow.
4. Automatic validation of code quality via compilation and unit test checks, refusing to allow the LLM to progress to the next task until all code compiles and tests pass (miraculously, it never decided to just "delete the failing unit test", which is more than can be said of some human developers).
5. Providing explicit line numbers to the LLM as comments at the start of lines, to facilitate easier editing (I don't know if other tools do this, but if not they really should, as having an LLM manually count lines is a huge waste of processing power/attention, much as it would be for a human).
To illustrate the above, the exact program was roughly as follows. First, have the LLM design an architecture for the project given the spec (just return as text). Next, have it plan the source files for the project and their dependencies, returning a list of (sourceFile, [dependency]) pairs. This is checked to ensure no circular dependencies, and the LLM forced to retry in the case where it introduced any. Explicitly listing the dependencies allows us to only provide the relevant dependencies in the initial context when creating a file.
After that, we iterate over all the planned files, and for each file, first get it to write the file (returning a description), and ensure it compiles. Next, the LLM is asked to plan a list of unit tests, and we iterate over them one-by-one, getting the LLM to write them and not allowing it to finish until everything compiles and the tests pass. At each new task, we refresh the context to just what is initially relevant to the task at hand. After this is done, the project is largely complete, requiring only human checking to ensure it actually works as expected and that the unit tests cover everything important.
To assist in this we provide the LLM with multiple "Tools", specified in the context, which it may call with a specific JSON format. We provide a list of AvailableFiles (all relevant files in the project, including documentation on the APIs/libraries used), with optional descriptions that it adds upon file creation, that it can Open, which will add them to context, and Close if already open, which removes them from the context. For files in context we allow it to edit them (based on line numbers), insert into a file at a specific line number, and append to a file (appending can also be used to create a new file). Compiling and running tests is done automatically after any source file modification, and in the case of failure the error message is returned to the LLM. We also provide a "Panic" tool, for when the LLM deems the task is completely insolvable given the resources at hand, which proved very useful in the initial development of the framework.
For context management, we limit the context returned to the model to the last ten messages, and strategically truncate messages. In particular, we truncate all but the last compile/test failure message (so it only sees the relevant one), and all but the last message from the LLM (so if a tool operation files due to e.g. a syntax error in the JSON, it can see what it wanted to do and try again with correct syntax, but the context isn't bloated with all previous file operations it made, instead it just sees the current file state via the OpenFiles in the context).
To assist it in persisting state/intention across calls, we encourage it to write to a journal.txt detailing what it's currently attempting and what it plans to do next, especially when debugging failed unit tests. Quite cleverly, the model decided to use this as the logfile for the application's logs, so that it would always see the most recent unit test logs in the journal without need to explicitly open them. It did however forget to make this logfile path configurable..
We also memoise (cache to disk) the values successfully "returned" by the LLM, so that we can stop the process and restart, and it will continue from where it left off. This is useful if it goes off the rails and needs help to recover, which happened a few times during framework development as a result of bugs in the framework that presented unclear/incorrect context to the LLM. It also happened when I was trying to use Deepseek R1 (over OpenRouter, so potentially quantised). It for instance got confused when missing the closing ) for an import statement and seemed unable to fix the issue, and sometimes was unable to correctly get the go local import syntax correct.
I feel like at least in the short term a workflow-based approach like the above is optimal for squeezing the most productivity out of LLMs, so I'm open-sourcing the small, awkward library I built in hope it might prove useful for someone. While free-form agents may be the future, workflows are the present. And even in future, it'll likely be cheaper to use a workflow-based approach (which works with smaller LLMs) than fully autonomous agents. Such agents, if sufficiently human, might even in future decide to use a workflow-based approach themselves to save time, delegating to smaller LLMs. It's my hope that such workflow approaches, alternatively known as NOn-deterministic exTremely-high-level Programming Languages (NOTPLs) become widely used, such that LLMs will train on them and themselves become better at writing workflows, simplifying my work even further.
One important note: the library I wrote is in Haskell, the most convenient language for writing EDSLs. The actual usage is however relatively simple, by Haskell standards, and I welcome any attempts to tidy it up further. Ideally in future it would provide a full DSL, a simple language that wraps the underlying implementation while providing a nicer user interface, and then more and more programs could be written in/by such NOTPLs, saving a significant amount of time.
The main interface to the library is the following:
runAiFunc ::
forall bs a b.
(FromJSON a, ToJSON a, Show a, BS.BuildSystem bs) =>
Context ->
[Tools.Tool] ->
a ->
(a -> AppM (Either (MsgKind, Text) b)) ->
RemainingFailureTolerance ->
AppM b
a is a type parameter, the type of the object we want the LLM to return, which must be convertible to JSON. bs is a type representing the build system (essentially an interface), which allows changing the backend used for compiling and running unit tests. b is a type we postprocess the a into (we can just use a for it if we don’t need postprocessing).
Context is a struct with the background text for the task, and [Tools.Tool] is a list of tools that task is allowed to use. Currently the library doesn’t support externally defined tools, but new tools can easily be added to the library in Tools.hs.
We provide a value of type a to the function as an example/dummy value, to illustrate to the LLM how it should look.
RemainingFailureTolerance is just an integer representing how many syntax errors (LLM returns syntactically incorrect JSON etc.) before aborting.
The most complex type is (a -> AppM (Either (MsgKind, Text) b)). This is the type of a validator: a function that takes an a as input, and returns either a Text error and error kind, or a value of type b (some post-processed version of a). The result is however wrapped in AppM, a monad, which just means the validation function has access to a state and the ability to do IO; e.g. to validate that a file really exists on disk, or that compilation succeeds and unit tests pass. The runAiFunc function will keep looping until the LLM returns a value for which the validator passes (or RemainingFailureTolerance reaches zero).
Other LLMS:
I also attempted to use DeepSeek, but came across a few issues. The biggest is WRT availability: the official DeepSeek provider is under heavy load, so I had to use OpenRouter instead, which dynamically routes to many providers, who may use different levels of quantisation. Overall DeepSeek seemed to do worse at following instructions, and at fixing code; in particular it kept failing to notice when a closing bracket ) was missing in the import list. I'm not sure however whether this is just the result of quantisation, as OpenRouter doesn't seem to require DeepSeek providers to list what degree of quantisation they're using. DeepSeek over OpenRouter was generally also significantly more expensive; there are a couple larger providers, Fireworks and Together.ai, which tended to be more stable/reliable/fast, but charged around 4x more than the official DeepSeek API (and than OpenAi o3-mini-high). There were also some smaller providers with more competitive prices, but they tended to get overloaded very frequently and rarely be available (often returning empty strings as a response).
I tried using Gemini Flash 2.0, but it seemed to struggle too much at returning code as properly escaped JSON, which is forgivable given it's 10x cheaper than o3 mini and not a thinking model. I tried Gemini Flash Thinking too, but the only version available on OpenRouter was the free version, which was heavily throttled to the point of being unusable. Since even o3-mini sometimes gets the JSON escaping wrong, I’m going to look at changing the tool call syntax so the model can use raw text blocks for code, which would remove the need for it to correctly JSON-format/escape the code and hence hopefully significantly decrease the rate of syntax errors.
Other Languages:
I also tried using C++ as a backend, but came across an issue where the LLM would occasionally miss/delete a closing angle bracket, causing the compiler to vomit up so many errors that the context got completely overwhelmed and the LLM couldn’t fix it. Given the C++ also took significantly longer to compile, and isn’t necessary for this project, I decided to go with Go instead. Go seems particularly well suited to LLM usage because of the simplicity of its design and the large, high-quality standard library (meaning the LLM needs to be familiar with fewer external libraries to get a task done).
Footnote:
*I said largely correct because I identified three bugs in the final response, two of which were technically my fault for not specifying more clearly, and one is a mistake I'd have also made myself. It didn't explicitly set the parquet page size, row group size and compression type, which is best practices but not something I directly asked of it. For merging snapshots with orderbook updates, it didn't explicitly request a snapshot upon startup, instead waiting for the regular one-snapshot-per-minute timer (technically I didn't explicitly ask it to request an initial snapshot, but it's a sensible thing to do). And most significantly, it had a bug due to this issue. It tried to read from the gorilla websocket on a timeout, so that it could regularly check for a cancellation request from the context, but rather un-intuitively the library trashes the connection after timeout, so after the first timeout all further reads fail. This wasn't caught in the unit tests because the tests only checked if it could receive a single correct JSON object from the websocket, so succeeded before the connection timed out. I'd argue that the original developers of gorilla/websocket (who've now abandoned it) deserve more blame than the LLM here, for not providing a non-destructive way to read with timeout, but fortunately some brave soul has taken over maintainership of the library so maybe we'll see a fix sometime in future.