Building a 100% LLM-written, standards-compliant HTTP 2.0 server from scratch with Gemini 2.5 Pro
To test the typed LLM workflow library I've been working on, Promptyped, I decided to see if I could get it to build a HTTP 2.0 server from scratch. This is a particularly good task for LLMs, as a clear, robust specification already exists, as do third-party clients and spec compliance validation tools. I.e. the hardest part, deciding exactly what the application should do, has already been done, and all that remains is to actually code it. I chose Go as the implementation language, because it's often described as simple (less rope for the LLM to hang itself with), and compiles quickly.
It took around 59 hours of API time and $350 of Gemini 2.5 Pro credits to produce the initial implementation, which passed 117/145 (80.6%) of the h2spec HTTP 2.0 compliance tests. The majority of failures were due to the server failing to return an error response when it should, as these cases weren't captured by integration tests since the http clients used for testing don't send unusual/invalid inputs to the server. A series of refactoring runs, in which the LLM was shown the h2spec failures then asked to make and follow a plan to fix them, brought it up to 128/145, then to 137/145, then to 142/145, then to 144/145, and at last to 145/145. Finally, after 60 API hours and $281 of refactoring, Llmahttap (pronounced lmah-tahp) was born! Weighing in at around 15k lines of source code and 32k lines of unit tests, from start to finish Llmahttap took in total around 119 hours of API time to build, and $631 of API credits. In clock terms it took around two weeks to build; it would have been around half of that if not for the initial Gemini 2.5 Pro rate limit of just 1k requests per day.
I wouldn't recommend anyone use it in production, as there are probably plenty of security issues (the TLS support is poorly tested as h2spec requires a proper cert for TLS testing, and I don’t have one lying around, plus it completely lacks HTTP 1.1 support), but the code is interesting to read from the perspective of seeing what a 100% Gemini-built application looks like. It's also a demonstration of the power of structured LLM workflows to achieve results that are difficult (or at least more expensive) with a purely free-form agentic approach.
Note that while 100% of the application code is AI-written, only around 99.9% of the unit test code is. A couple of times it set the component it was testing to log to null and hence made very slow progress on getting the unit tests it wrote to pass, so I manually modified those tests to instead log to stdout so it'd be properly visible in the `go test` failure output. And when some unit test files got too large I split them up into multiple smaller files to save time and money, as the model doesn’t need five thousands lines of unit tests in context. I also git reverted to a recent checkpoint a few times when it looked like it was getting really off-track, and when it hit a syntax error it was pathologically unable to fix (misplaced braces in a long slice of structs literal). Also note that 100% refers to the HTTP 2.0 spec implementation; for HPACK (a separate spec from the main HTTP 2.0 spec), the Go stdlib is used, similarly for TLS.
The code was written by a program in a higher-order NOn-DeTerministic Programming Language (NOTPL) called Promptyped, which is a DSL embedded in Haskell. The NOTPL program's rough structure was as follows:
First, with the whole spec in context, generate a list of files that must be created. Alongside each filename, include a high-level description, and list the dependencies of each file (including both project files and spec subjection files), then topologically sort the result so the files are listed after the files they depend on.
Then, for each file to be created, generate a detailed list of tasks to be completed, keeping the relevant dependencies for that file in context.
Then, for each task, put the LLM in a code-compile-test loop until the change is complete (as verified by a separate LLM call), the project builds successfully and all unit tests pass.
The refactoring NOTPL's structure was similar, except instead of generating a list of files to create, it generated a list of files that might need changes, then for each such file generated a list of actual changes needed. The Promptyped logs from development are included compressed in the Llmahttap repo, for anyone who’s interested in the exact prompts and responses.
I did encounter one significant issue in the initial NOTPL program. The topological dependency sort didn't place the unit test files directly after the files they tested, instead many ended up right at the end of the list, and the integration test was placed earlier (it's not ideal to start integration testing when there aren't even any unit tests yet). Note that as described in my previous post, the framework memoises/caches the results of all tasks on disk, and restarting the program will automatically continue from where it left off. So to address the issue, I manually modified the cached JSON task-list on disk to re-order the unit tests such that each was listed right after the files it tested, then restarted. There was also a smaller issue: it created staticserver.go as part of general end-to-end testing, but then later in the task to create a static file server it made a separate static_file_server.go with duplicate functionality. This was because the task that created staticserver.go didn't have vision of the overall work plan, so didn't know what the upcoming static file server file would be called. The fix for this was to run a separate refactor task afterwards to merge them (and to update the NOTPL program to also give the overall plan as context to individual file tasks).
To achieve this task the Promptyped framework required some significant upgrades in functionality (relative to the version described in the previous post). The key improvements made, in roughly descending order of importance, were as follows.
Having a separate LLM check to verify that each task was actually done. This makes it much harder for the LLM to falsely report completion for an incomplete task.
Rejecting diffs that would lead to syntactically incorrect code (would cause `go fmt` to fail due to inability to parse the code). I found Gemini (even 2.5 Pro) would sometimes completely fail to be able to fix compilation errors in a large context resulting from imbalanced braces (e.g. after an off-by-one error in a diff causes it to overwrite a function's closing brace, or add an extra closing brace). Rejecting such diffs avoided this issue as the LLM would never see such malformed code. Interestingly, in my testing OpenAI models seemed not to get stuck in the same situation, and did not need this functionality, but I used Gemini 2.5 Pro for the project due to better long-context handling and significantly cheaper API calls compared to o3. The model is also sent a message showing what the rejected code post-diff would have looked like, to help it see how the diff is wrong
Clearing the message history after 5 failed attempts to produce syntactically correct tool calls. This prevents the LLM from getting stuck continuously trying small variations of the same syntactically invalid approach.
Checking for progress after every 5 diffs resulting in failed compilation, and if there's no progress (same compiler errors still), then running a separate query with a smaller amount of context to suggest a solution. This helps avoid situations where the LLM gets stuck on a stupid hallucination (e.g. it thinks the function name is fooFunc instead of FooFunc even though the context contains the code showing it is indeed named FooFunc).
When the `go test` failure output is more than e.g. 600 lines, parsing it to extract the actual failed tests, then running just one and giving the LLM the truncated output of that (first and last 400 lines). This makes it simpler for the LLM by only including logs from the relevant failed unit test in context, and saves time and money by reducing the overall context size. The names of all failed unit tests are also passed to the LLM alongside this, so the LLM doesn't incorrectly assume a failing test is passing.
Each time the LLM calls a tool, requiring it to also include a summary of what it's doing, why it's doing it, and its future plans after it. These are put into an event list alongside other events like compilation successes and unit test failures, and the model is shown the last 70 or so events. This helps the model keep a consistent train of thought/action, and also reduces the chance of the model getting stuck in loops (e.g. fixing test A in a way that breaks test B, then fixing test B in a way that breaks test A, and continuing this way indefinitely). It also makes it easier for a human to see the approach the model's taking to the current task without needing to read the code diffs.
Keeping all relevant source code in context, minified/"unfocused". I added support for focusing/unfocusing to the framework, such that an unfocused source file shows only type definitions, top-level comments and function headers. This significantly saves space in the context, while still giving the LLM the information it needs to use those types and functions in the files it's currently working on. I automatically focus the most recently modified files, and also give the LLM itself a tool to focus/unfocus files, with the maximum number of focused files limited to 4 (at which point, focusing a new file will unfocus the least recently modified one).
Some other changes that were not strictly necessary, but useful for reducing costs and saving time:
Using editing by regex instead of by line number. I found all LLMs often make mistakes with line numbers, even if I include regular line-number comments in the code shown to the LLM, particularly off-by-one errors. What seems to be a more reliable approach is instead of line numbers, asking the LLM to provide a regex to match the first line and a regex to match the last line, along with nearest-line-numbers to the first and last line (so if there are multiple potential matches, it picks the right one, and potential matches too far from that line are rejected). When the LLM fails this, generally it fails by providing a regex that matches nothing at all and does nothing, which is less destructive to code than the common failure mode of line-based editing where an off-by-one error leads to e.g. a closing brace being accidentally deleted or duplicated.
Being very liberal in the tool call syntax accepted from Gemini. Often it ignores the syntax in the prompt and instead goes with its own syntax, and isn't always consistent with what syntax it uses, so being liberal avoids wasting time and money re-querying it just to make it change the tool syntax. I deliberately avoid using the explicit tool-calling functionality offered by LLM providers as I find it cleaner to treat the LLM as just a function from Text to Text, and this approaches generalises better across providers.
On tool syntax errors, running a separate query with a small context to attempt to correct the syntax. Although it doesn't always work, when it does it avoids unnecessarily re-running big slow queries just because of easily fixable syntax errors.
Using rawtext literal syntax for code rather than json. For tools like append to and modify file, instead of requiring the LLM to include the new text/code as an escaped string in the JSON object, a textBoxName field is used instead. The model then separately provides a textbox with that name, and the text to be included in C++ raw string literal syntax. This avoids the need for the model to escape special characters in the JSON, reducing the amount of errors it makes.
In my view, the success of this task demonstrates that LLMs are already capable of generating non-trivial applications from scratch, given a sufficiently detailed specification and an external means of correctness testing. The human's challenge is then properly specifying, and the hardest part: deciding what to build. Maybe structuring the architecture also requires human input; to my eyes, Gemini generally tends to architect code worse than a human, at least without detailed prompting, although perhaps more expensive LLMs like o3 or Claude Opus can do better.
In the long run, I suspect people will move away from the free-form agent approaches being heavily promoted by LLM providers (who have a huge financial incentive to make you put as much in the context window as possible). Structured workflow/NOTPL approaches with a carefully managed context have the potential to be significantly more efficient, in terms of both cost and time (since both are directly proportional to context size), and ultimately economic law dictates that firms with the most cost-effective approach (lowest cost of production) will eventually win. Even if we had human-level agents capable of coding a task from start to finish entirely unsupervised, it'd likely still be more cost-efficient for them to delegate the coding to a cheaper, dumber model following a structured workflow, than to code it all themselves.
The next thing I plan to implement is more granular focusing: the ability to just focus specific individual functions. So the LLM can e.g. open a 5k line unit test file without needing to keep the whole file in context, instead just the function bodies for the tests it's currently working on are shown in full. And a clean way to run and manage multiple tasks simultaneously. I also plan to add support for Haskell as a project language, so people unfamiliar with Haskell can use Promptyped itself to add new custom workflows to Promptyped.