Large language models (LLMs) like ChatGPT have quickly become part of everyday life. They help write emails, summarize long documents, and respond to questions in mere seconds. To the person on the other side of the screen, the exchange feels natural, almost conversational.
What happens between a prompt and an answer, however, is far more complex. Most of that process stays out of sight, letting people use these systems without thinking about how the responses are produced.
Inside LLMs, information moves through layers of mathematical operations that are difficult to explain, even for the people who design them.
Recently, Neel Somani released a research project called Symbolic Circuit Distillation that takes a closer look at this hidden process. The project takes a language model and produces a readable computer program that behaves the same way in certain cases.
Instead of guessing what the model might be doing, the method compares a number of human-readable programs and proves that the program behaves the same way. It builds on recent work from OpenAI known as Sparse Circuits, which simplifies language models by isolating the parts that matter for a specific question.
For now, the technique works only in simple cases, and Somani is clear about the limitation. Even so, the project reflects his ongoing interest in understanding how language models behave. As artificial intelligence becomes more widespread, understanding how these tools behave is just as important as what they produce.
The Cost of Blind Trust in AI
For many years, limited insight into machine learning systems was treated as a tradeoff for performance. If a model performed well, few people pressed for deeper explanations of how it reached its results. However, as language models became more capable and began to be used in more settings, that approach became harder to sustain.
Somani points to two concerns that often come up in conversations about AI safety. One is reward hacking, which he describes as a model optimizing its objective so aggressively that it begins to undermine other goals or ethical constraints.
The second is scheming, which refers to situations in which a model misrepresents its intentions while pursuing outcomes that are harmful or deceptive.
Both problems are difficult to detect when you can’t see how a system reasons internally. Without a clear view of how a model reaches its conclusions, people are left evaluating behavior by outcomes alone.
At the same time, artificial intelligence systems now operate at enormous scale. At companies like Google or OpenAI, models handle billions of requests each day, placing constant pressure on speed, efficiency, and cost. Systems that waste time or computing resources quickly become expensive to maintain.
Somani’s work sits between these competing concerns. Some projects focus on making systems easier to understand, while others aim to make them run more efficiently. In both cases, the emphasis is on reducing how much trust must be placed in systems whose internal operations are difficult to observe.
Improving How Systems Run
Some of the biggest advances in artificial intelligence haven’t come from new models, but from finding better ways to run them. As systems have grown larger, teams have increasingly relied on shared infrastructure tools to manage the complexity of operating them efficiently on graphics processing units (GPUs).
Two commonly used tools, vLLM and SGLang, allow developers to share low-level performance improvements instead of rebuilding them from scratch. One technique these systems use is prefix caching.
At large scale, many requests sent to language models begin the same way. System prompts and templates often repeat, even when the details that follow are different. Prefix caching takes advantage of that overlap by reusing previously computed information rather than recalculating it for every request.
The catch is where that cached information lives. Prefix caching only helps when a request reaches the same graphics processor that already holds the cached data. If the request lands on a different processor, the system has no choice but to redo the work.
Somani’s prototype, called KV Marketplace, addresses that issue by allowing GPUs to share cached information directly with one another. Instead of repeating the same computation, one processor can transfer the data to another using a low-overhead RDMA transfer path.
Early experiments showed an improvement of roughly 12.5 percent in latency and throughput. Neel Somani released the prototype as a fork of vLLM and shared it publicly, inviting feedback from others working on GPU systems or inference optimization.
Although the project focuses on performance, it connects to the same underlying theme that runs through his interpretability work. As systems get bigger, inefficiencies tend to surface in different forms, sometimes as wasted computation, and other times as gaps in understanding.
Finding Bugs Before Systems Go Live
A major part of Somani’s work focuses on formal methods, a set of tools that can be used to prove logical facts about programs, such as proving correctness. These methods are commonly used in fields like security and privacy, where errors can have serious consequences.
Machine learning has been slower to adopt these tools. Models are usually evaluated by how well they perform, even when the reasons behind that performance are unclear. Somani has spent much of his time exploring whether proof-based techniques can be applied to systems often treated as black boxes.
Much of that work, he says, is driven by his intellectual curiosity and desire to use his skills in ways other researchers aren’t yet exploring.
One example is a project called Cuq, pronounced “kook,” which automatically checks code written for GPUs. GPU code is notoriously difficult to write correctly, and small errors can be hard to detect once systems are deployed at scale.
Cuq applies theoretical checks to catch those mistakes earlier in the development process. The goal isn’t to replace engineers, but to reduce the number of errors that make it into real-world systems.
The same line of thinking appears in Symbolic Circuit Distillation. The project starts with a language model whose behavior can’t be easily explained and asks whether that behavior can be translated into something a person can read, reason about, and verify. In limited cases, Somani’s work shows that it can.
From Academic Proof to Practical Scale
During his time at the University of California, Berkeley, Neel Somani pursued a triple major in mathematics, computer science, and business administration. While there, he contributed to research in type systems, differential privacy, and scalable machine learning frameworks.
One of the projects he worked on was Duet, a formal verifier designed to automatically prove that code meets the formal definition of differential privacy. The work required careful reasoning about both mathematics and implementation, reinforcing the importance of precision at every level of a system.
After graduating, Somani worked as a quantitative researcher in Citadel’s commodities group. There, he focused on solving complex problems in global markets, where small modeling errors could easily lead to significant financial consequences. His methods as a quantitative researcher applied formal methods (MIP optimization) to the electricity market.
Alongside his research career, Somani is also the founder of Eclipse, an Ethereum Layer 2 platform powered by the Solana Virtual Machine. Ethereum Layer 2's are able to provide verifiability guarantees on code that is executed. Eclipse has raised $50 million in Series A funding and has attracted significant attention within the industry.
The work focuses on developing infrastructure that can function reliably at scale, running parallel to his academic and technical research interests. Outside of his technical work, Somani supports higher education through a personal scholarship program, reflecting his dedication to supporting the next generation of talent.
Leaving Room for Better Ideas
One longer-term idea Neel Somani has explored is for a given input, whether transformer models that are difficult to explain could be translated into programs that people can read and interpret directly. However, if the initiative isn’t possible or fails to interest the research community, he’s willing to change course.
In the near future, he expects artificial intelligence to raise more questions than it answers. Today, language, image, and video models are still developed largely in isolation, built with different assumptions and techniques. He sees that separation as temporary, with ideas from one area likely to influence progress in others.
Much of that shift, he believes, will depend on how models are trained. Many current systems rely on fairly simple forms of feedback (via "reinforcement learning from human feedback"), especially when compared with how people learn through experience.
Instead of viewing this as a shortcoming, Somani sees it as an open problem, one that could give reinforcement learning a larger role or lead to new training approaches altogether.
Limits on memory are another concern. Even very large models can only work with a limited amount of information at one time, which restricts how much context they can use when responding. Somani sees future systems expanding that capacity, allowing models to draw on much larger bodies of text, including entire books or libraries.
Taken together, these ideas raise more questions than predictions. Rather than trying to map out where the field is headed, Somani remains focused on building tools that make today’s systems easier to understand, leaving room for his work to evolve as technology does.