What LLMs Can and Cannot Do

Will the power of Large Language Models (LLMs) continue to increase, or will they reach a plateau soon? Might the capabilities of current frontier models and AI systems be already good enough to cause massive disruption as implementations ripple through the economy? Or will the limitations of large deep learning models trained on massive amounts of tokens, mainly text, limit their reach?

Emergent Consensus

At first sight, even experts seem to struggle to reach a consensus on these questions. I think that, on further investigation, this is no longer true. The predictions of imminent danger from superintelligent AI did not come to pass, shifting the "AI doomer" narrative to a less dramatic, yet still very serious scenario of massive job loss. On the "AI boomer" side, voices promising Artificial General Intelligence are noticeably subdued, while the focus shifts from models to complex agentic systems that seem to be able to automate (most) knowledge work (very soon?). In other words, after a period of hype, a consensus seems to form, albeit slowly. Let's speculate on what that consensus might look like, maybe even answering some of our initial questions on the way.

A World of Text

First, note that LLMs live in a world of symbols ("tokens"), nearly all of them sourced from massive amounts of text graciously copied from the Internet. Today, training an LLM consists of three stages: Data collection and preprocessing, pre-training, and post-training. Empirical results, i.e. "scaling laws" show that the capabilities of LLMs seem to scale with three factors: Model size, dataset size, and training compute. As more or less all easily obtainable data is already used in pre-training, scaling the remaining factors lead to the current  investments in massive data centers. Algorithmic advances can be understood as a less controllable fourth factor, which recently resulted in "reasoning models". These models are trained by Reinforcement Learning with Verifiable Rewards (RLVR), a sort of "self play", where language models increase their performance by trying to find solutions to (often synthetically created) problems that are hard to solve, but whose solutions are easy to verify, such as certain mathematical proofs. Even multi-modal models that can process images, videos, and sound in addition to text, are, for reasons of efficiency and data availability, mainly pre-trained on text.

Embodiment

Second, contrast how animals (including us) learn to how LLMs are trained. While animals continuously learn while interacting with an evolving physical and social world, current LLMs are pre-trained on a massive, but static corpus of mostly text tokens, then post-trained on a static set of tasks and human (i.e. often commercial) preferences. This naturally leads to book-smart, somewhat biased LLM agents whose "world models" fail at common sense, suggesting you take a walk to the car wash to save on gas. These limits of current LLMs clearly show the importance of "true" understanding by "grasping" the real world in a literal sense, as an embodied intelligent being.

In principle, the deep learning paradigm should be sufficient to create these embodied artifical intelligences. In practice, building efficient neural architectures and large datasets to learn from true interactions with the real world, augmented with simulations where possible, is a considerably more complex task than collecting an Internet worth of text and training on that. Current "world models" trained on video data or video games already demonstrate what is possible beyond LLMs, but are also limited by the lack of true interaction with the physical world.

AI Agents and the Future of Work

Applying these ideas to current and future LLMs, I'd conclude that the "street-smarts" of LLM-based AI agents will remain limited in often surprising ways. Their flexibility will stay constrained by what is possible through "in-context learning" for several years to come, meaning that LLM-based AI agents will not be able to truly gain experience on their jobs. As these agents continue to "live" in a world of text, they should not be able to distinguish between fact and fiction in a dependable way for the foreseeable future, excluding certain high-stakes use cases and necessitating complex "guard rails". Their "creativity" should be limited to the combinatorial kind.

Taking these constraints into account, I think it's safe to say that LLM-based AI agents should not be able to directly replace humans in most roles. Productivity gains should arrive slowly, as organizations will need to change processes to create roles compatible with the limitations of LLM-based AI agents. When these roles are created though, AI agents should be able to automatically explore and synthesize solutions from existing ideas present in their massive training data sets, leading to interesting, even transformative results.

Photo by Ben Blumentritt