LLMs have unquestionably unlocked amazing capabilities in a broad range of fields, including my own area of research (which broadly spans program analysis, fuzzing, and decompilation).

However, this breakneck pace of development is creating a dangerous gap between the perceived SOTA (in academia) and what LLMs are actually capable of with very minimal prompt/context engineering.

As such, there is ripe potential for academic arbitrage where researchers incorporate a powerful frontier LLM (which not-so-secretly already achieves SOTA performance) as a component of a traditional system, showing that their new fusion system achieves SOTA performance.

graph1

In an ideal scientific world, these fusion systems bring something new to the table, answer some important questions, or enable future research.

Unfortunately, academic incentives often push researchers in unhelpful directions. There is pressure to build LLM-in-a-box systems that both lean heavily on frontier LLMs yet, paradoxically, fail to take advantage of concurrent advances in their capabilities.

This post is a brief introspection on the phenomenon and some thoughts on how we might address it.

Disclaimer: I’m not picking on any particular researcher or paper in this post. It’s rather a commentary on a phenomenon I’ve observed broadly, affecting even my own work.

1. The Race to Publish

In academia, being first matters: early publications attract more citations and face less skepticism during review.

Within this incentive structure, “academic arbitrage” emerges. The moment a frontier LLM can outperform prior SOTA methods on a task, even if only marginally, you can publish the inevitable “Doing X with LLMs” paper.

But waiting until a frontier LLM can easily do X is far too slow. To be first, you must publish as soon as the model is just barely capable of doing X. That usually requires some substantial hand-holding to compensate for the model’s immaturity.

Examples:

We don’t trust model-x to produce format-z directly so we bound the type of output it generates and automatically transform it into format-z.
We observe that model-x gets confused by our full task, so we split it into smaller subtasks and prompt them individually.
The context window of model-x is too small for our task, so we pre-validate the context it needs and just feed it that.
We tweak our prompt to avoid failure cases we’re seeing with model-x.

🤝

Effect

Hand-holding

Systems need extra tricks to compensate for limitations of subpar LLMs.

2. The Need for Novelty

Papers that simply apply a frontier LLM to a new task will almost certainly be rejected for lack of novelty. Exceptions are benchmarking papers or certain types of SoKs (although those may quickly become stale as LLMs continue to improve).

To get published, your paper needs to do something more complex with the LLM. Fuse it with a fuzzer, add verification pipelines, build specially-crafted tools, create different agents that talk to each other, add RAG, add MCP, etc… Large architecture diagrams with robot icons, OpenAI logos, and lots of arrows are refreshing to reviewers and make it clear how your system is different.

🎛️

Effect

Bells and Whistles

Systems need to be more complex to be considered novel.

3. Challenges in Ablation Studies

Ideally, we would compare our systems against a “bare” frontier LLM to demonstrate that our machinery actually adds value beyond what the model can do on its own. But what does that comparison even look like? As a reader, my first instinct is usually to sanity-check whether something like ChatGPT can simply one-shot the task. Yet pitting a chat-interface model against a system with richer context, tool integration, and bespoke scaffolding is fundamentally an apples-to-oranges comparison.

In practice, what we really want is a detailed ablation study. Which context matters? Which tools matter? Which components are essential? Unfortunately, these experiments are expensive and labor-intensive. And when the LLM is deeply entangled with the system’s architecture, disabling a single component often breaks the entire pipeline, making it impossible to cleanly isolate the contribution of each part.

📊

Effect

Missing Baselines

Systems are not clearly evaluated against "bare" frontier LLMs.

Outcome: LLM-in-a-Box

The unfortunate outcome of these forces is that we end up with systems that are tightly coupled to the particular model used in development due to 🤝 Hand-holding . They are over-engineered due to the need for 🎛️ Bells & Whistles . And it is unclear what parts of the system even matter, and even if there is any added value due to 📊 Missing Baselines .

In many cases, its almost as if we are taking a capable LLM and putting it inside a box. We’re giving it some limited access to the outside world through context and letting it use a small curated set of tools. We poke some holes in the box as needed to fit the limited capabilities of the immature model. Sometimes we put several LLMs in separate boxes and connect them together with pipes. But these boxes are fixed, there is no room to grow.

The same box that enables an immature model to perform at all, is a liability for a more powerful model that is unable to fully leverage its capabilities. It puts a strong upper bound on the performance of the system, regardless of how powerful the model is. It’s the LLM-era equivalent of the Bitter Lesson.

Solutions

The problem is systematic and I don’t think these sorts of papers will ever go away. My recommendations are specifically for readers and authors in this space:

For Readers

When reading an LLM system paper, try to look beyond the bells and whistles. Sometimes you can identify the core important parts, sometimes not. Try to see if there is a clearcut explanation for certain system components. Which ones are essential? Which ones are needed just to address temporary limitations? When next-generation model-y or model-z come out, what parts of the system will be obsolete? What parts will be liabilities? Try to recognize when you’re looking at an LLM that has been shoved into a box.

For Authors

Its tempting to try to write these LLM arbitrage papers. They are juicy, easy to write (the LLM already does all the work), and you’ll probably get cited. Try to think about value-add in a way that is not based on a particular model’s limitation (since those will become obsolete), but rather a fundamental limitation orthogonal to the advancement of LLMs. Think about how your system will scale as models get more powerful. Avoid falling into the trap of building a really fancy box for your LLM.