From Symbols to Meaning: How Modern Language Models Really Work

From Symbols to Meaning: How Modern Language Models Really Work

· 57 min read

Author: Fernando Benites

AI was supposed to reason like a logician. Instead, it learned to navigate meaning like a reader moving through a library — finding related ideas not by following rules, but by sensing proximity. This article traces that journey, explains what large language models actually do to words, and honestly confronts the question: are they just very eloquent parrots?

That parrot framing is one of two slogans that dominate the popular discussion of these systems; the other is its mirror, “it can think.” Both are wrong in interesting ways. “It’s just autocomplete” misses that the autocomplete operates over a learned geometry of meaning, refined across dozens of attention and feedforward layers, each transforming the representation in non-trivial ways; calling that pattern recognition is accurate, but undersells how much structure the patterns carry, and how much of the work happens in the multi-step composition of those patterns rather than in any single lookup. “It can think” projects a unified deliberative agency onto a system that has none. The scientific literature has been more careful but harder to read; the popular treatment has been readable but lossy. This article tries to land in the middle. It simplifies aggressively (every figure here is a 2-D projection of something tens of thousands of dimensions wide), but it commits to a functional description with concrete consequences for two things educators and practitioners actually do. How to teach about LLMs: the parrot-versus-AGI dichotomy is the wrong frame, and meaning-as-geometry is the right one. How to prompt them: knowing where the answers come from explains why some prompts work, when to ground with retrieval, when to ask for chain-of-thought, and when to stop trusting fluent output that the model has no grounded reason to produce.

Intended for educators, advanced students, and practitioners with some ML background. Technical sections include collapsible deep-dives — reading without them still preserves the spine of the argument. If you want to skip the historical setup, the mechanism story begins in §3 (Attention) and lands at §7 (Debunking the stochastic parrot); the practical, pedagogy-oriented summary is §8.

Table of contents

9 sections
  1. 1
    The expectation: AI as a reasoning machine
    • 1.1 Symbolic AI and the knowledge-engineering dream
    • 1.2 Why the dream collapsed: the brittleness problem
    • 1.3 Connectionism and the first neural-network winter
  2. 2
    The statistical turn: meaning as location
    • 2.1 N-grams: language as statistics
    • 2.2 Word embeddings: words as coordinates
  3. 3
    Attention: meaning that moves
    • 3.1 The bottleneck problem
    • 3.2 Attention as semantic navigation
    • 3.3 What attention actually does in a transformer
  4. 4
    Emergence: when scale becomes something else
    • 4.1 GPT-3 and few-shot learning
    • 4.2 Phase transitions in capability
    • 4.3 From raw predictor to instruction-follower
  5. 5
    Thinking tokens: teaching models to slow down
  6. 6
    RAG: connecting memory to the world
  7. 7
    Debunking the stochastic parrot
    • 7.1 Where the parrot critique lands hardest
    • 7.2 The parrot thesis, fairly stated
    • 7.3 The evidence against pure parroting
    • 7.4 Prediction, planning, and the bounded role of randomness
    • 7.5 When the parrot does appear: hallucination reconsidered
    • 7.6 The middle ground
  8. 8
    Implications for educators and practitioners
  9. 9
    References
📖 Glossary — one-line definitions of key terms
  • Token — the unit a transformer actually operates on; sub-word pieces produced by a learned tokenizer (BPE, SentencePiece). What looks like “words” to a reader is tokens to the model.
  • Embedding — a learned vector representation of a token. Tokens used in similar contexts end up at nearby points in embedding space.
  • Attention head — one of many parallel attention operations in a transformer layer. Each head can learn to track a different kind of relationship (grammatical, semantic, positional, etc.).
  • Parametric vs non-parametric memoryparametric: knowledge baked into the model’s weights at training time. Non-parametric: knowledge looked up at inference time (RAG, external tools, search).
  • In-context learning — the model picking up a task pattern from examples shown in the prompt itself, with no weight updates.
  • RLHF — reinforcement learning from human feedback. The alignment technique that turned base completion models into instruction-following assistants.
  • Test-time compute — work the model does per query (thinking tokens, chain-of-thought, search), separate from training cost. The new scaling axis.

1 The expectation: AI as a reasoning machine #

1.1 Symbolic AI and the knowledge-engineering dream #

When Alan Turing asked in 1950 whether machines could think [1], the implicit model was that of a symbolic reasoner: an entity that manipulates symbols, applies rules, and arrives at correct conclusions by necessity. This expectation shaped the dominant research programme in artificial intelligence for decades.

From the mid-1950s through the 1980s, symbolic AI — sometimes called Good Old-Fashioned AI, or GOFAI (John Haugeland’s coinage [48]) — held that intelligence was computation over explicit representations. Build a large database of facts and rules, then let an inference engine derive new facts. Systems like MYCIN for medical diagnosis (Shortliffe 1974; documented in [2]) demonstrated genuine expertise in narrow domains. The approach had deep intuitive appeal: it resembled how we imagine ourselves to think.

Classroom example — what symbolic AI looked like A symbolic system for diagnosing a fever might contain rules like: IF temperature > 38.5°C AND onset < 3 days AND no recent vaccination THEN suspect infection. A doctor’s expertise was painstakingly extracted, written as rules, and stored in a database. The machine followed the rules; it did not learn them.

1.2 Why the dream collapsed: the brittleness problem #

The limits appeared as soon as the domains grew larger. Knowledge engineering — the laborious process of extracting and encoding expert knowledge — did not scale. The world contains too many exceptions, too much context-dependence, and too much tacit knowledge that experts cannot articulate. Philosopher Michael Polanyi captured the core problem: “we know more than we can tell” [3].

Ask a symbolic system about a fever in a child who has just returned from the tropics, and it fails — unless someone thought to add that rule. The first AI winter (1974–1980), triggered in the UK by the 1973 Lighthill Report’s critical assessment of the field [18] and in the US by internal DARPA reassessments (the broader Mansfield Amendment’s restrictions on basic research played a contributing rather than a decisive role), saw funding for general AI research collapse. A modest revival followed with the expert-systems boom of the 1980s, but the second AI winter arrived around 1987–1993, when commercial expert-systems companies failed to deliver on their promises and the LISP-machine market collapsed. The question after both winters was no longer whether machines could reason in the symbolic sense, but whether that was even the right target.

1.3 The parallel story: connectionism and the first neural-network winter #

The symbolic-AI history is only half the story. Running in parallel — and at the time, in direct competition for funding and intellectual prestige — was the connectionist tradition: the idea that intelligence might emerge from networks of simple, neuron-like units rather than from explicit rules. Frank Rosenblatt’s perceptron, introduced in 1958, was the first practical learning algorithm of this kind. The press greeted it with extraordinary headlines: the New York Times coverage was famously hyperbolic; in widely-quoted phrasing — the exact wording varies across reproductions — the device would in time “walk, talk, see, write, reproduce itself and be conscious of its existence” [41].

The hype outran the technology. By the mid-1960s, neural-network research was already slowing — limited by the computers of the era, by the lack of any training algorithm for networks with more than one layer of weights, and by the rising influence of the symbolic school. In 1969, Marvin Minsky and Seymour Papert published Perceptrons [21], a mathematically rigorous critique of single-layer networks. Their central result was that single-layer perceptrons could not learn even simple non-linearly-separable functions — the XOR function being the canonical example. They conjectured, on the basis of intuition rather than proof, that multi-layer extensions would face similar limits.

The book did not single-handedly kill neural-network research, but it crystallised a perception that connectionism was a dead end. A first neural-network winter followed — in mainstream visibility, at least; the work itself continued underground through the 1970s and early 1980s. The recovery came in stages. In 1982, John Hopfield revived connectionism’s scientific credibility from outside the AI mainstream [27]. In 1986, Rumelhart, Hinton & Williams [22] brought backpropagation into the AI mainstream, demonstrating that multi-layer networks could learn XOR and far more. (Hopfield and Hinton would share the 2024 Nobel Prize in Physics for these foundational connectionist contributions — a fifty-year vindication made explicit.) By 1989, the Universal Approximation Theorem (Cybenko [28]; Hornik et al. [29][30]) had formally refuted Minsky and Papert’s conjecture. The field had lost the better part of a generation; the connectionist programme would dominate from the 2010s onward, once GPU hardware made large networks trainable.

📚 Deep history — The neural-network winter and recovery (the long version)

The XOR block. Imagine plotting four points: (0,0), (0,1), (1,0), (1,1). The OR function returns 1 unless both inputs are 0 — you can draw a straight line separating the “0” point from the three “1” points. AND is similar. But XOR returns 1 only when exactly one input is 1 — and there is no straight line that separates the two “1” points from the two “0” points; they lie on opposite diagonals. A single-layer perceptron, which can only carve up the input space with straight lines, simply cannot represent this. Multi-layer networks with non-linear activation functions can — but no one had a workable algorithm to train them in 1969.

The underground decade — Linnainmaa, Werbos, Grossberg, Kohonen, Fukushima. Through the 1970s and early 1980s, a scattered community continued building. In 1970, the Finnish mathematician Seppo Linnainmaa published, in his master’s thesis, the modern form of reverse-mode automatic differentiation — the mathematical machinery that backpropagation requires — though without reference to neural networks at all [23]. In 1974, Paul Werbos’s Harvard PhD thesis applied this technique specifically to training multi-layer networks [24]. He could not publish it on neural-network grounds for years; symbolic AI was in vogue, and his work on the topic only reached print around 1982. Stephen Grossberg, working deeply outside the AI mainstream and famously combative about it, developed adaptive resonance theory from 1976 onward, addressing how networks could learn continuously without catastrophically forgetting prior knowledge — the “stability-plasticity dilemma” — culminating in the 1987 ART1 architecture with Gail Carpenter [25]. Teuvo Kohonen developed self-organizing maps in the early 1980s — networks that learn topology-preserving representations of high-dimensional data without supervision. In Japan, Kunihiko Fukushima’s 1980 Neocognitron [26] introduced a multilayer convolutional architecture for visual pattern recognition; it is the direct ancestor of every convolutional neural network in use today.

The thaw — Hopfield and the PDP volumes. In 1982, the theoretical physicist John Hopfield published a short paper showing that a simple recurrent network of binary neurons could serve as a content-addressable associative memory, with stored patterns acting as energy minima [27]. The Hopfield network was elegant, easy to analyse, and — crucially — published by someone with the scientific credibility to be heard outside the AI community. It revived interest in connectionist methods within physics and cognitive science. Then in 1986, Rumelhart, Hinton & Williams published the paper [22] that brought backpropagation into the mainstream of AI: a clear, accessible demonstration that multi-layer networks could learn complex tasks, including XOR. The Parallel Distributed Processing volumes that accompanied this work made the case to a broad audience that connectionism was alive, productive, and serious.

The Universal Approximation Theorem — Minsky’s conjecture mathematically refuted. The formal refutation arrived in 1989 — three years after the Rumelhart–Hinton–Williams paper — in the form of the Universal Approximation Theorem. Cybenko proved that any continuous function on a compact subset of ℝ^n can be approximated to arbitrary precision by a feedforward network with a single hidden layer of finite width, given a sigmoid activation function [28]. Hornik, Stinchcombe and White independently proved a more general version the same year [29]; Hornik (1991) extended the result further, showing that the multilayer architecture itself — not the specific choice of activation function — is what gives neural networks their universal approximation capacity [30]. The practical implication is decisive: a neural network of sufficient size can represent any continuous function. UAT settled only the representability question. Minsky and Papert’s other concerns — about training tractability, depth-vs-width trade-offs, and which features hidden layers would actually have to learn — proved more durable; only the practical breakthroughs of the 2010s (hardware, data, modern optimisation) fully turned the connectionist bet into a win.

The lost generation. Minsky and Papert’s intuition about multi-layer networks turned out to be wrong, but the field had already lost the better part of a generation. The connectionist programme would only fully recover with the deep-learning revolution of the 2010s, when GPU hardware finally made it possible to train the kinds of large networks that the underground work of the 1970s and 80s had foreshadowed.

Why this matters for the rest of the story The two AI traditions — symbolic and connectionist — had radically different bets about what mattered for intelligence. The symbolic school bet on explicit representations and inference; the connectionist school bet on learned distributed representations and emergent behaviour from large networks. The symbolic bet dominated from roughly 1970 to the mid-2000s. Everything that follows in this article — embeddings, attention, transformers, emergent capabilities at scale — is the connectionist bet finally paying off. The story of modern AI is not just the rise of deep learning; it is the slow, fifty-year vindication of a tradition that had been largely written off.

INTERACTIVE — TIMELINE: SYMBOLIC vs CONNECTIONIST AI
● connectionist ● symbolic / GOFAI ● AI winters ● modern (statistical → LLM)
Hover an event to see details. Click to pin.

2 The statistical turn: meaning as location #

From here on, the article narrows from AI in general to language specifically. Three reasons. First, language is where the connectionist comeback hit hardest and most legibly: we can read what the model outputs and judge it directly, in a way we cannot with a latent vision embedding or a robotic control policy. Second, the available training data is effectively free at unprecedented scale; the web is a corpus no other modality has at comparable size. Third, and most relevant to the parrot debate, language is the modality of human reasoning, so the question “can a machine that manipulates language mean anything?” is exactly the question that animated symbolic AI, that animated the Bender & Koller octopus, and that still animates §7. Vision, speech, and robotics followed parallel deep-learning gains through the 2010s; in language the form-vs-meaning argument has its sharpest edge, which is why language modelling is where the rest of this article lives.

2.1 N-grams: language as statistics #

The statistical alternative asked a fundamentally different question: instead of encoding what language means, can we learn what language does by observing enormous amounts of it? A language model assigns a probability to every possible sequence of words. The simplest version, the n-gram model, estimates the probability of each word given the preceding few words, and requires nothing beyond counting.

This worked surprisingly well for tasks like speech recognition — in turn, the cliff between nlp-experts with a math/physics/ML background and pure linguists increases, culminating in a quip often attributed to Frederick Jelinek (variously rendered, and which Jelinek himself later said he regretted): “Every time I fire a linguist, the performance of the speech recogniser improves” [40]. But n-gram models had hard limits: they could not capture dependencies across more than a few words, and they treated every word as an atom with no internal relationship to any other word, as would any linguist point out. Furthermore, they were used in conjunction with the bag of words model: each word is independent of each other, usually being modelled by a dimension in a high-dimensional space, thus words like rabbit and carrot are orthogonal and do not have any relation.

2.2 Word embeddings: words as coordinates #

The intuition is older than the computers that proved it. Zellig Harris’s distributional hypothesis [42] and J.R. Firth’s “you shall know a word by the company it keeps” [43] both proposed, in the 1950s, that meaning could be captured by what surrounds a word — symbolic-era linguists already had the conceptual move. Hinrich Schütze’s 1990s vector-space models gave the idea its first serious computational form [44]. What the neural era added was a learned, dense implementation that finally made the picture work at scale.

The breakthrough that set the stage for modern AI came from a simple idea: what if each word were not an atom, but a point in space? Bengio et al. (2003) showed that a neural network could learn a continuous vector representation — an embedding — for each word, placed so that words used in similar contexts end up nearby [4]. Meaning became geometry.

Mikolov et al. (2013) [19] made this tractable at scale; a companion paper from the same group — Mikolov, Yih & Zweig (2013) [55] — produced the result that made linguists stop and pay attention: king − man + woman ≈ queen. Arithmetic on meaning. The model had never been told what “royalty” or “gender” were — those relationships emerged from the geometry of the embedding space, learned entirely from patterns in text. (Subsequent work showed the analogy is more fragile than the canonical demo suggested: Linzen [49] and Nissim et al. [50] noted that the second-nearest neighbour to king − man + woman is often “king” itself, and similar arithmetic fails on less canonical pairs. The deeper point — that semantic geometry is real — survives; the cleanness of the demo was somewhat oversold.)

Intuition — the library analogy Imagine a vast library where books are not sorted alphabetically, but by meaning. Books about oceans are shelved near books about rivers, which are near books about rain, which are near books about weather. “Warm” and “hot” sit nearby; “warm” and “hammer” sit far apart. A word embedding is exactly this: a location in a meaning-space learned from the pattern of which words tend to appear in the same neighbourhood as which others.

This was a conceptual departure from symbolic AI. Meaning was no longer stored in a database of propositions — it was implicit in the geometry of a high-dimensional space, learned from data. A word’s location in that space encoded something about what it meant, without anyone ever defining it.

INTERACTIVE — SEMANTIC OPERATIONS IN EMBEDDING SPACE
Each pair shares the same offset vector. The model learned these geometric relationships from text alone.
Complete the analogy:
These operations emerge from the geometry of the embedding space — the model was never explicitly taught them.
📖 Side note — tokens are not words

A real transformer operates not on words but on tokens — sub-word pieces produced by a learned tokenizer (BPE, SentencePiece). A common word like the is one token; a rare word may be split into byte-pairs; numbers and code take unusual splits. Many of the model’s quirks — poor arithmetic on awkwardly-split numerals, the famous “SolidGoldMagikarp” anomalous tokens, systematically uneven quality across languages — trace directly to the tokenizer rather than to the network. The vectors moved through semantic space are token-vectors, not word-vectors. The picture in this article is faithful at the level of meaning; the alignment between tokens and the words we read is where the picture is loosest.

A pre-transformer note. ELMo (Peters et al. 2018) [52] had already produced contextualised word representations using bidirectional LSTMs — so context-sensitivity itself did not require attention. What transformers added, beyond ELMo, was parallelism: every position attending to every other position in one step, instead of sequentially through an RNN. That parallelism, more than context-sensitivity per se, is what made training-at-scale possible.

But static embeddings had one critical flaw: every word had exactly one location. “Bank” — financial institution? — always mapped to the same point in space, regardless of whether the surrounding sentence talked about loans or rivers. Meaning, in the real world, depends on context. Fixing this required something more dynamic.

3 Attention: meaning that moves #

3.1 The bottleneck problem #

By 2014, the leading architecture for machine translation used an encoder-decoder design with recurrent neural networks (LSTMs). The encoder read a sentence word by word, compressing it into a single fixed-size vector — like summarising a paragraph into one sentence. The decoder then generated the translation from that summary. The flaw was obvious to anyone who thought about it: forcing everything in a long, complex sentence into a single vector discards information. Long-range dependencies — the meaning at the end of a sentence that depends on context from its beginning — degraded over many sequential processing steps.

Bahdanau, Cho and Bengio (2014; published at ICLR 2015) proposed the key fix: instead of using one compressed summary, allow the decoder at each output step to look back at the entire input and decide which parts are relevant now [6]. This was soft attention — a learned, dynamic spotlight on the input.

3.2 Attention as semantic navigation #

Here is the central reframing: attention is not just a mechanism — it is what allows meaning to be context-sensitive. In a static word embedding, every word has one fixed point in the semantic space. Attention allows that point to move depending on context.

Consider the word “bank” again. In a static embedding it sits at one location — ambiguously between financial-institution-space and river-space. With attention, the word “bank” in the sentence “The river bank was flooded” can shift its representation toward the river-related region of semantic space, because the model has learned to let “bank” be strongly influenced by nearby words like “river” and “flooded.” In the sentence “The financial bank was insolvent,” the same word shifts toward the financial-institution region, pulled by “financial” and “insolvent.”

Each layer of a transformer performs this contextual repositioning. A word does not arrive at its final representation in one step — it is iteratively refined over many layers, each one updating every word’s position in semantic space based on the full context of the sentence. By the final layer, “bank” in the river sentence and “bank” in the financial sentence have distinct representations, even though they started from the same point.

The key insight about attention Think of each word as having a fuzzy, provisional meaning. Attention is the process by which that meaning is sharpened and contextualised — pulled toward nearby words that constrain its interpretation. The final representation of a word is not what the word means in the dictionary; it is what this particular word means in this particular sentence, given everything else that surrounds it.

This is why attention was such a fundamental step beyond static embeddings. Embeddings gave words locations. Attention gave those locations the ability to respond to their neighbourhood.

INTERACTIVE — ATTENTION AS SEMANTIC SHIFTING
Choose a word:
Context:
Dots represent words from the model's semantic space. The highlighted word's position shifts toward the cluster matching its contextual meaning — this is what attention does.

3.3 What attention actually does in a transformer #

Vaswani et al. (2017) — the “Attention Is All You Need” paper [5] — removed recurrence entirely. Every layer is an attention operation. The architecture processes all words simultaneously, and every word attends to every other word in the same step. This has two consequences.

First, long-range dependencies are trivial: a word at position 1 and a word at position 100 are just as directly connected as adjacent words. Distance no longer degrades the signal. Second, because there is no sequential processing, the entire computation is parallelisable on modern GPU hardware. This single property — parallelisability — unlocked training at scales previously unthinkable, and scale turned out to change everything.

The transformer uses multi-head attention: running many parallel attention operations simultaneously, each potentially learning to track a different kind of relationship. One head might learn to track grammatical agreement; another might track co-reference (who “she” refers to); another might track semantic similarity. (This clean one-head-per-relationship story is a useful pedagogical simplification; mechanistic interpretability has shown that heads in production transformers are typically polysemantic, with several relationships entangled in the same head and the same relationship distributed across several heads.) The final representation is a combination of all of them. Meaning is not one thing — it is an aggregation of many simultaneous relationships.

Worked example — “The trophy didn’t fit in the suitcase because it was too big” What does “it” refer to — the trophy or the suitcase? Humans resolve this instantly: “it” refers to the trophy, because the trophy is too big to fit. A static word embedding cannot answer this — “it” has one representation regardless of context. A transformer with attention can learn to resolve this because it sees the entire sentence simultaneously. The model has learned, from countless examples, that when “fit…in” describes a containment relationship, the thing that is “too big” is the one that failed to fit, not the container. The representation of “it” shifts accordingly. Winograd schemas [51] — sentences designed specifically to test this kind of reasoning — were considered very hard for AI; transformers handle most of them with high accuracy.

INTERACTIVE — HOW A TRANSFORMER PREDICTS THE NEXT TOKEN
Prompt: "the capital of is __"
Heavily simplified: real transformers stack many attention + MLP layers, and the "operator" is distributed across heads and feedforward blocks. The geometry shown is faithful to what Word2Vec demonstrated [19] and what mechanistic interpretability recovers from real circuits [20][32].

A note on the symbolic/connectionist framing: modern transformers blur the dichotomy at the architectural level. Attention is, mechanically, a soft and differentiable form of key-based indexing — queries look up values via learned keys, very much like a fuzzy database query. The classical opposition between rule-following and pattern-matching has partially dissolved; what makes the transformer feel like meaning-navigation also makes it look like differentiable symbol manipulation.

🎥 Watch — 3Blue1Brown visualises attention beautifully

If you want to see this geometry move, Grant Sanderson (3Blue1Brown) animates every step of the transformer in his Neural Networks series. The chapter “But what is a GPT? Visual intro to transformers” builds attention from first principles — query, key, and value as literal vectors moving through space — and pairs unusually well with the semantic-navigation framing above.

4 Emergence: when scale becomes something else #

4.1 GPT-3 and few-shot learning #

In May 2020, OpenAI published “Language Models are Few-Shot Learners” [7], introducing GPT-3: a transformer with 175 billion parameters — about ten times larger than the previous largest dense language model (Microsoft’s Turing-NLG, 17B). The central finding was not that a bigger model performed better on existing benchmarks. It was that GPT-3 demonstrated a qualitatively new mode of interaction: few-shot prompting.

Instead of fine-tuning the model with labelled examples, a user could simply write a few demonstrations in plain text — “Translate to French: Hello → Bonjour, Goodbye → Au revoir, Thank you → Merci” — and the model would complete the pattern correctly, for essentially any task, without any gradient updates. This emerged naturally from pretraining on enough text; no one programmed it.

Example — zero-shot vs few-shot vs fine-tuning Fine-tuning (old approach): show the model 10,000 labelled examples of positive and negative movie reviews, update its weights, deploy a sentiment classifier. Requires data, compute, and a separate model per task. Few-shot prompting (GPT-3): write in the input — “This film was brilliant: Positive. This film was dreadful: Negative. This film was a revelation:” — and the model responds “Positive.” No training, no weight updates, no labelled data. The model infers the task from the examples in the prompt.

Behind the qualitative jump was a quantitative pattern. Kaplan et al. (2020) [53] showed that LLM loss decreases as a clean power law in parameters, data, and compute — within the ranges they tested, scaling was predictable. Hoffmann et al. (2022) [54] sharpened this with Chinchilla, demonstrating that models published before 2022 (including GPT-3) were undertrained for their size: a given compute budget produces a better model at smaller-and-trained-longer than at larger-and-trained-shorter. By 2024–2026 the field had reorganised around these scaling laws, and the operative recipe shifted from “just make it bigger” to “optimise the data-to-parameter ratio, then add test-time compute.”

4.2 Phase transitions in capability #

Wei et al. (2022) documented this systematically under the concept of emergent abilities [8]: capabilities that are absent in smaller models and appear sharply as model scale increases past a threshold. The paper documented over 100 such abilities: multi-step arithmetic, chain-of-thought reasoning, analogical reasoning — tasks that earlier models could not perform at any level, regardless of fine-tuning.

The “phase transition” framing has been contested directly. Schaeffer, Miranda & Koyejo (2023) [36] showed that many of Wei et al.’s original emergent-ability curves disappear under continuous metrics like Brier score — the apparent discontinuity is partly an artefact of all-or-nothing scoring. The debate is unresolved: some capabilities (modular arithmetic, certain logical operations) appear to be genuinely threshold-dependent even on continuous metrics; others smooth out completely. The pedagogically useful position is the dual one — scale produces qualitatively new behaviour, while the suddenness of that emergence depends on how we measure. The water-to-ice metaphor captures something real, but it overstates how clean the transitions are.

Seen through the semantic-space lens: at sufficient scale, the model’s high-dimensional space becomes rich enough to contain implicit representations of concepts like “arithmetic” or “logical inference” — not as explicit rules, but as geometric regularities that the model can exploit when prompted in the right way. This is not the same as a symbolic reasoning engine, but it is not mere pattern-matching either.

INTERACTIVE — SCALE vs CAPABILITY (the emergence debate)
measurement metric:
● 3-digit arithmetic ● chain-of-thought ● analogical reasoning
Sharp jumps with exact-match accuracy — looks emergent. Below the threshold the model "can't do" the task; just past it, suddenly it can.
Same underlying smooth capability curve, two metrics. Multi-token correctness (every token has to be right) amplifies modest probability gains into apparent thresholds — Schaeffer, Miranda & Koyejo, 2023 [36]. Click the metric toggle to compare.
🔍 Deep dive — Is emergence real, or a metric artefact? The Schaeffer critique

Schaeffer, Miranda & Koyejo (2023) [36] argued that apparent discontinuities can be artefacts of the metrics used — with nonlinear metrics, smooth underlying capability curves look discontinuous. This debate is ongoing and pedagogically important: emergence suggests qualitative novelty; smooth extrapolation suggests continuity with earlier systems. The truth is likely mixed: some capabilities are genuinely threshold-dependent; others are metric artefacts.

4.3 From raw predictor to instruction-follower #

GPT-3 (2020) demonstrated emergent few-shot abilities, but interacting with it required prompt-engineering skill — coaxing useful behaviour out of a model trained only to predict the next token of internet text. The qualitative leap most users felt with ChatGPT (November 2022) was not architectural; it was alignment. Ouyang et al. (2022) [39] combined instruction tuning with reinforcement learning from human feedback (RLHF): first, fine-tune on supervised (instruction, good-response) pairs; then, train a reward model on human comparison judgements; finally, use that reward model to RL-optimise the base model toward outputs humans prefer. The underlying GPT-3.5 was not substantially larger than GPT-3 — what changed was the objective. A raw next-token predictor had been turned into a system that tries to do what its user asked. Most of the visible 2022–2024 shift in how LLMs felt to use lives in this single change. Subsequent work has refined the recipe — direct preference optimisation, constitutional AI, RLAIF — but the central story holds: scale gave us the geometry; alignment gave us the interface.

5 Thinking tokens: teaching models to slow down #

Early language models answered questions in a single forward pass: input goes in, output comes out, one word at a time with no explicit intermediate reasoning. This is fast, fluent, and often correct — but it fails on tasks that require careful, multi-step thinking. The model essentially has to “know” the answer before it starts writing.

Wei et al. (2022) showed that prompting a model to produce intermediate reasoning steps dramatically improved performance on multi-step problems [9]: instead of asking “What is 23 × 47?”, you write “Let’s think step by step.” The model generates: “23 × 40 = 920. 23 × 7 = 161. 920 + 161 = 1081.” Each step becomes part of the model’s context, and the model’s subsequent predictions are conditioned on the full chain of reasoning — including its own intermediate conclusions.

Why “let’s think step by step” actually works In a transformer, the only “working memory” is the context window — the text that has been produced so far. When the model writes intermediate reasoning steps, those steps literally become part of the input for subsequent predictions. Writing “920 + 161” puts those numbers in the context; the next prediction is conditioned on them. The model is not reasoning in some hidden internal space — it is externalising its computation into text, and then reading that text as input for the next step.

OpenAI’s o1 models (2024) operationalised this at training time using reinforcement learning: the model was trained to generate extended reasoning traces before answering, learning over time which reasoning strategies lead to correct outcomes [10]. These “thinking tokens” can run to thousands of words of private scratchpad computation. The key insight is a separation between train-time scale (model size) and test-time compute (how much thinking to do per answer) — a qualitatively new scaling axis. Since 2024 this has become a general pattern, not an OpenAI-specific one: DeepSeek-R1’s open release (2025), Anthropic’s extended-thinking modes, Google’s Gemini-2.0 reasoning, and the broader RL-on-chain-of-thought paradigm all instantiate the same recipe.

From the semantic-space perspective: thinking tokens are intermediate waypoints. The model navigates through semantic space in multiple steps — each intermediate conclusion landing at a location that constrains where the next conclusion should be — rather than trying to jump from question to answer in one leap across a potentially very large distance in meaning-space.

6 RAG: connecting memory to the world #

A language model’s knowledge is encoded in its weights — the billions of numerical parameters adjusted during training. This is parametric memory: implicit, compressed, and frozen at training time. It has two structural weaknesses. It cannot be updated without retraining. And because knowledge is distributed across billions of parameters with no explicit index, the model cannot reliably attribute its answers to specific sources — a core driver of hallucination.

Lewis et al. (2020) proposed Retrieval-Augmented Generation (RAG) [11] as a complementary architecture: combine the language model with an external document store. At inference time, a query retrieves relevant documents, which are placed into the model’s context window alongside the question. The model generates its answer conditioned on both its parametric knowledge and the retrieved evidence.

Example — RAG in a practical setting A law firm deploys an AI assistant over its case files. Without RAG, the model would try to answer questions about specific cases from its training data — which doesn’t include confidential case files, and which may be out of date. With RAG: the user asks “What was the outcome of the Schmidt vs. Hofmann case?”, a retrieval step fetches the relevant case documents, these are added to the model’s context, and the model generates an answer grounded in the actual documents. The firm’s lawyers can check the retrieved documents directly.

RAG substantially reduces hallucination on knowledge-intensive tasks because the answer can be grounded in retrieved, verifiable text. It also allows the knowledge base to be updated without touching the model weights — significant in domains where information changes rapidly. The distinction between parametric memory (what the model “knows”) and non-parametric memory (what it can look up) is one of the most pedagogically useful concepts in applied AI today.

Note for educators RAG is a powerful teaching example because it makes explicit what is otherwise opaque: the distinction between a model generating from its own internal representations versus being grounded by external sources. Students can inspect the retrieved documents, trace why an answer went wrong (wrong retrieval? model ignored the evidence?), and reason about interventions. It concretises the abstract idea of “where does the model’s knowledge come from.”

7 Debunking the stochastic parrot #

7.1 Where the parrot critique lands hardest #

Before defending modern language models against the parrot critique, it is worth asking honestly: where on the history we have just traced does the critique actually land? The accusation is that the model “haphazardly stitches together sequences of linguistic forms … without any reference to meaning.” Read carefully, this is not a description of GPT-4. It is a remarkably precise description of the methods that came before.

Consider an n-gram model. It assigns probabilities to word sequences based on raw co-occurrence counts. It has no representation of meaning at all — words are opaque symbols whose only property is how often they appear next to other opaque symbols. When an n-gram model generates text, it is literally doing what Bender’s quote describes: sampling sequences according to probabilistic information about how they combine, with no reference whatsoever to anything outside the corpus. The parrot accusation is not just apt here; it is a textbook description of the mechanism.

Phrase-based statistical machine translation is the same story applied to a useful task. Before neural MT — through the 2000s and into the early 2010s — production translation systems (the IBM models, Moses, Google Translate in its first incarnation) worked by table lookup: large parallel corpora were aligned word-by-word and phrase-by-phrase, producing a translation table that scored each source phrase against candidate target phrases. At decode time, the system searched for the highest-scoring combination of phrase translations, regularised by an n-gram language model on the target side. No representation of meaning entered the pipeline — only co-occurrence counts on the source and on the target. Bank in “I deposited money at the bank” and bank in “the river bank flooded” were the same source token, scored against the same translation candidates; if the surrounding bigram happened to be in the phrase table, the system would disambiguate, otherwise it would default to the more frequent gloss. The famously poor pre-2016 machine-translation output is, in retrospect, exactly what a stochastic parrot of language pairs should produce: locally fluent, globally adrift, and oblivious to whether bank refers to a financial institution or to mud at the river’s edge. The shift to neural MT — Bahdanau, Cho & Bengio (2015) [6] — was the first method in which the representation of bank could be influenced by money or river in the same sentence, regardless of whether the surrounding phrase had been seen verbatim in training. That is the move from form-with-statistics to form-conditioned-on-context, and it is exactly where the parrot critique starts to come apart.

A symbolic AI system looks superficially different — it manipulates explicit rules — but a structurally analogous critique applies. The system has no meaning of its own; it has the meanings that its human authors transcribed into it. MYCIN does not understand infection; it executes rules that someone who understood infection wrote down. Ask it about a case its authors did not anticipate and it fails, because the meaning it manipulates is borrowed, not its own. This is an extension of the parrot metaphor by analogy — the original critique is specifically about statistical reproduction of token sequences, and one could reasonably object that any programmed system trivially “parrots” its programmer. The structural point worth keeping is that every method on this list — n-grams, SMT, symbolic AI, static embeddings, and modern transformers — reproduces something it was given. What separates them is whether the system forms a meaning of its own from what it reproduces.

Even static word embeddings like Word2Vec do not fully escape this. They build a richer representation than n-grams — the geometry of the embedding space encodes semantic relationships — but they do so by averaging over all the contexts in which a word has been seen. The result is a single, fixed location per word, deaf to the specific sentence in which the word appears. “Bank” sits at one point in space whether you are talking about loans or rivers. The representation reflects training-data statistics but does not respond to the meaning of the current input. It is meaning frozen at training time, not meaning produced by the model.

The spectrum of meaning across methods The history we have traced is, in part, a story of each generation of methods reducing the gap between “what the system manipulates” and “what the words mean.” Symbolic AI manipulated borrowed meanings. N-grams manipulated nothing but symbols. Word embeddings introduced a geometric proxy for meaning. Attention made that geometry context-sensitive. Scale gave the geometry enough resolution to support generalisation. Each step weakened the parrot critique. The question for modern LLMs is whether the geometry has become rich enough, and the context-sensitivity sharp enough, that the critique no longer lands — or whether it merely lands less hard.

This reframing matters because the parrot critique was launched against transformer language models specifically, but its sharpest form actually applies most clearly to the methods that preceded them. The interesting empirical question is not whether some machine learning is parrot-like — for many older methods, this is uncontroversially true — but whether transformers at scale have crossed a threshold that earlier methods did not. That is the question we examine next.

7.2 The parrot thesis, fairly stated #

In 2021, Bender, Gebru, McMillan-Major and “Shmargaret Shmitchell” (a pseudonym Margaret Mitchell adopted following her departure from Google’s Ethical AI team) published “On the Dangers of Stochastic Parrots: Can Language Models Be Too Big?” [12] — one of the most influential and most contested papers in AI ethics. Separating the technical claim from the paper’s legitimate concerns about environmental cost, data bias, and labour exploitation, the core argument is this:

The parrot claim (Bender et al., 2021) “A language model is a system for haphazardly stitching together sequences of linguistic forms it has observed in its vast training data, according to probabilistic information about how they combine, but without any reference to meaning.” The model manipulates form — sequences of tokens — without meaning — any connection to the world. It is a stochastic parrot: statistically impressive, semantically empty.

This argument has roots in Harnad’s (1990) symbol grounding problem [35], sharpened recently by Bender and Koller (2020) [13]: symbols acquire meaning only through embodied, causal contact with the world, not through relations to other symbols. A dictionary defines words using other words; on this view, a system trained only on text is trapped inside a hall of mirrors. (Frontier models since 2023 are no longer purely text-internal — they are trained on images, audio, and increasingly video alongside text — which addresses one form of the grounding objection while leaving others intact; we return to this in §7.6.)

Bender & Koller’s [13] cleanest illustration is the octopus thought experiment. A hyperintelligent octopus eavesdrops on the telegraph wire between two humans stranded on separate islands. After enough exposure it learns to produce statistically convincing text in their language. But when one of the humans suddenly needs to know how to build a coconut catapult to defend against a bear, can the octopus help? Bender and Koller argue no: without ever having seen a coconut, a catapult, or a bear, the octopus has no purchase on what the words refer to. The form/meaning gap is constitutive — not closeable by more data of the same kind. (Multimodal models, which take pixels and audio alongside tokens, complicate this thought experiment substantially; we return to that in §7.6.)

This is a coherent philosophical position and should be taken seriously. But as an empirical claim about what large transformer models actually do, it has become increasingly difficult to sustain. The evidence is worth examining carefully.

7.3 The evidence against pure parroting #

Generalisation beyond training distribution. A pure parrot can only reproduce patterns it has observed. But large language models demonstrably generalise to inputs that are, by construction, outside their training data. Take the Uniform Bar Examination. GPT-4 reportedly scored at the 90th percentile on first release [14]; Martínez (2024) [45] showed this was an artefact of comparing the model against a pool heavily weighted toward repeat test-takers, and the corrected estimate against first-time takers is closer to the 63rd percentile (around the 48th against actual passers, lower still on the essay sections). Even at the 63rd — far below the original headline — the result remains striking: the exam contains questions the model could not have seen verbatim in training, and pure pattern repetition has no obvious mechanism to produce coherent novel essays on legal hypotheticals. The contested headline number is less important than the underlying fact that the model generalises beyond its training distribution at all. On the MATH benchmark — competition mathematics drawn from AMC, AIME, HMMT, and Putnam (not, despite the common shorthand, IMO-level olympiad) [31] — GPT-4 itself scored 42.2% [14] — already a substantial leap above the era’s baselines — and frontier reasoning models with extended test-time compute (OpenAI’s o1 [10], Anthropic’s extended-thinking modes, DeepSeek-R1) now exceed 90%. The specific questions had not been seen in training. Handling novel instances of these tasks requires some form of structural generalisation — not just retrieval.

Example — the Winograd schema test “The city councillors refused the demonstrators a permit because they feared violence. Who feared violence?” The answer (“the councillors”) requires understanding that it is plausible for authorities to fear violence from demonstrators, not the reverse — knowledge about the world, not about word sequences. GPT-4 handles these correctly at near-human rates. A pure statistical parrot, associating “they” with the nearest plural noun, would systematically fail.

Emergent internal world models. Li et al. (2022) trained a language model purely on sequences of Othello game moves — no game states, just move notations [15]. They then probed the model’s internal activations and found representations of the game board — initially decoded with nonlinear probes [15], and later shown by Nanda et al. (2023) [37] to be linearly decodable under a “mine vs. theirs” encoding (rather than the original paper’s “black vs. white”). The model had built an internal map of which squares were occupied and by whom, with no supervision on board states. This is a direct counterexample to the claim that LLMs represent only linguistic form. From text-like inputs, a structural understanding of the underlying system emerged.

Mechanistic interpretability evidence. Anthropic’s interpretability research on Claude identified computational circuits — specific subgraphs of the attention and feedforward layers — that implement identifiable operations: fact lookup, name-binding between subject and predicate, multi-step planning in which intermediate conclusions are internally represented before being expressed [20][32][33]. The model is not haphazardly stitching; specific computations correspond to specific internal representations.

Mathematical reasoning. Recent work documents frontier LLMs generating coherent proofs of novel mathematical problems — Trinh et al.’s AlphaGeometry (2024) solves olympiad geometry problems at a near-medal level without human demonstrations [34]. Mathematical proof is perhaps the hardest test for the parrot: the answer cannot be retrieved from memory, because the problem is new; it must be derived. Generating valid proofs requires — at minimum — something functionally equivalent to understanding axioms and logical inference.

Context-directed extrapolation. Madabushi et al. (2025) propose a precise framing: LLMs perform “context-directed extrapolation from training data priors” [16] — a mechanism that substantially exceeds statistical pattern repetition. The model infers from context which part of its learned structure is relevant, then extrapolates beyond the specific examples seen. This is not human reasoning, but it is not parroting either.

7.4 Prediction, planning, and the bounded role of randomness #

There is one technical fact in the parrot argument that does survive scrutiny: a language model genuinely does produce its output one token at a time, and each token is, mechanically, drawn from a probability distribution over the vocabulary. In that narrow sense, the model is “picking the most likely next word” — exactly what a stochastic parrot would do. (Worth noting that production systems often run at temperature 0, where the model is fully deterministic — no sampling at all — and yet still produces structured, context-appropriate output. That alone refutes the strict reading of “stochastic” in the parrot label; the substantive question is not whether sampling happens but whether the region being sampled from is meaningful. MLPs are, more deeply, learned approximators of conditional distributions over text — the “stochastic” word is doing a lot of metaphorical work in the parrot critique.) Why, then, does the broader picture not reduce the whole enterprise to elaborate autocomplete?

Two facts answer this — one about planning, one about geometry.

The model plans further ahead than the next token. A 2025 Anthropic interpretability study on Claude 3.5 Haiku specifically — “On the Biology of a Large Language Model” [20] — probed the model while it wrote a rhyming couplet. Given “He saw a carrot and had to grab it,” the model produced “His hunger was like a starving rabbit.” The natural assumption is that the model writes word by word and only at the end scrambles to find a rhyme. The internal traces showed something quite different: before generating any word of the second line, the model had already activated representations of candidate end-words that rhyme with “grab it” — “rabbit” among them — and then composed the entire line to land at the planned word. When the researchers intervened to suppress the “rabbit” representation and inject “habit” instead, the model adapted and produced a sensible line ending in “habit.” This is interpretability evidence consistent with limited forms of multi-token planning — the next-token mechanism remains the output channel, but on the authors’ reading it is serving a longer plan formed before generation begins. The strong reading (“goal-directed generation”) is theirs; replication and extension to other models is ongoing.

INTERACTIVE — MULTI-TOKEN PLANNING (rhyming couplet)
Prompt — line 1 (ends with "...grab it,"):
"He saw a carrot and had to grab it,"
Before generating any token of line 2, the model has already pre-activated candidates that rhyme with "grab it":
Intervene on the planned end-word — watch the line reroute.
Resulting line 2 (composed to land at the planned word):
"His hunger was like a starving rabbit."
The next-token mechanism is the output channel; the planning lives in the hidden state, many tokens earlier. Suppress the planned target and the model reroutes — gracefully — to a different but still rhyming word. Activations and lines are illustrative, drawn after Anthropic, On the Biology of a Large Language Model, 2025.

What this means in plain language Imagine writing a sentence with a target word in mind for the end. You do not write each word in isolation — you write each word so that the sentence as a whole arrives at the target naturally. The model does something analogous. The token-by-token prediction is the output mechanism; the model’s internal representations are already conditioned on where the sentence is going. Predicting the next token is the surface; the structure that produces it spans many tokens at once.

The semantic-space framing dissolves much of the randomness concern. Even when the model does sample stochastically from the next-token distribution, the consequences of that sampling are usually bounded. Here is why. The model’s hidden state at each generation step encodes a direction in semantic space — a region of meaning where the output should land. The next-token distribution is, in effect, a probability cloud over words that occupy that region. The randomness chooses which word from the region, but the region itself is determined by everything that has come before. Whether the model outputs “happy,” “joyful,” or “elated” at a particular position usually does not change the meaning of the sentence — these words occupy roughly the same neighbourhood in semantic space. The choice is locally arbitrary but globally directed.

This is the key insight: the stochastic parrot critique pictures the model as essentially drawing words from a hat weighted by training-data frequency. The reality is closer to drawing from a hat that has been placed in a specific location in semantic space by the entire preceding context. The randomness operates at the level of word-choice within a meaning-region; the meaning-region is selected non-randomly, by the model’s deep representation of what should be said next. This is why the same prompt to the same model produces outputs that differ in wording but generally agree in substance.

A caveat on the metaphor. “One navigable semantic space” is a useful simplification of three things that, mechanically, are not so simple. First, a transformer’s residual stream is not a single Euclidean space: different layers, heads, and even individual neurons read from and write to different subspaces, and meaning is distributed across them rather than living at one tidy address. Second, tokens are not words — common words split into subword pieces and rare words into byte-pairs; the geometry described in this article is over tokens, and the alignment between tokens and concepts is imperfect. Third, the navigation metaphor obscures a peculiarity of LLM generation: the model is writing the path it is taking as it takes it — the trajectory is also the output, and the act of writing reshapes what comes next. The picture in this article is faithful as far as it goes; the actual mechanism is higher-dimensional, more distributed, and stranger than any 2D figure can show. In some failure modes — long-tail facts, adversarial prompts, off-distribution domains — the projection breaks down, and low-probability paths diverge into different meanings, not just different wordings.

The reconciliation The parrot critique is technically correct that prediction is word-by-word and probabilistic. It is empirically wrong that this makes the model semantically empty. The probabilistic step is the final, local choice within a region that the model’s deeper representations have already navigated to. Planning operates over many tokens at once; the next- token mechanism merely executes that plan one word at a time. The randomness is a thin layer of variation on top of a deeply structured semantic process.

7.5 When the parrot does appear: hallucination reconsidered #

If LLMs are not stochastic parrots in general, there is a precise regime in which they behave as if they were. Hallucination — the generation of fluent, plausible text that is factually incorrect — is best understood as exactly this: the model defaulting to high-probability completions in its training distribution when its parametric memory is absent or ambiguous on the queried fact.

Ask a large language model to name the population of a small Swiss municipality it has never encountered, and it will generate a number that sounds plausible — drawn from the distribution of how population figures are expressed — without any grounding in the actual fact. In this moment, the stochastic parrot characterisation is accurate. The model is producing statistically coherent text without reference to meaning.

Connecting this to the previous subsection: hallucination is what happens when the semantic-space direction is under-determined. When the model has been pulled by the context into a region of semantic space where its representations are sparse — because the relevant facts were not in training, or were ambiguous — the next-token distribution is correspondingly diffuse. The local randomness is no longer cushioned by a strong directional signal. In that specific gap, the parrot-like failure mode appears as a localised regime — not as a general description of how the model otherwise operates. The semantic-space buffer is real, but it is only as strong as the model’s representational density in the region being queried.

Hallucination as localised parrot behaviour The parrot characterisation is most accurate not as a description of transformer models in general, but as a description of a specific failure mode: the model falling back on distributional priors when its parametric memory is insufficient. This is why RAG and chain-of-thought are effective mitigations — RAG supplies the missing factual grounding; chain-of-thought forces the model to make its reasoning explicit and therefore checkable. The parrot appears specifically in the gap between what the model was trained on and what it is being asked to retrieve.

🔍 Deep dive — Three types of hallucination, three different fixes

Three types of hallucination correspond to three distinct failure modes [17]. Fact-conflicting hallucinations arise when generated claims contradict world knowledge — the closest to pure parrot behaviour. Context-conflicting hallucinations occur when the model loses consistency within a long conversation. Input-conflicting hallucinations arise when the model’s output deviates from what the query explicitly specified. Each has a different cause and a different mitigation:

TypeMechanismMitigation
Fact-conflictingSparse parametric memory; falls back on high-probability priorRetrieval-augmented generation; ground in retrieved sources; verify against authoritative data
Context-conflictingLong-context drift; conversational state degrades over many turnsShorter contexts; summarise running state; checkpoint key facts; ground recurring entities
Input-conflictingInstruction-following failure; the model misinterprets or ignores the queryPrompt clarification; chain-of-thought; constrained or structured output

7.6 The middle ground #

The most defensible current position is neither “pure parrot” nor “general intelligence.” LLMs exhibit predictable, controllable capabilities that substantially exceed statistical pattern repetition — they generalise, they build internal representations, they reason when given the right scaffolding — but they do not exhibit robust general cognition of the human kind, and they have specific, systematic failure modes.

ClaimUnder the parrot viewWhat the evidence shows
Only reproduces training patternsClaimedGeneralises out-of-distribution on exams, proofs, novel problems [14][34]
No internal world modelClaimedBuilds internal world models (Othello board state) [15][37]
Cannot multi-step reasonClaimedReasons multi-step with chain-of-thought scaffolding; struggles on some formal tasks
Semantically empty operationsClaimedImplements identifiable circuits (fact lookup, name-binding, inference) [20][32]
Greedy next-token prediction with no planClaimedPlans multi-token (rhyming-couplet circuits in Claude 3.5 Haiku) [20]
Hallucination as defaultClaimedHallucinates specifically when parametric memory is sparse [17]

This matters practically. The parrot framing underestimates LLMs and leads to misplaced complacency (“it’s just autocomplete — it can’t do anything important”). The AGI framing overestimates them and leads to either misplaced fear or uncritical deference. The honest middle ground is: these systems perform sophisticated semantic operations over learned representations, they generalise in ways that matter, and they fail in predictable ways that we can understand and partially mitigate. One genuine empirical concession survives: LLMs do verbatim-memorise non-trivial fractions of their training data — Carlini et al. (2021) [46] and Nasr et al. (2023) [47] demonstrated extractable memorisation in production models. Memorisation and generalisation coexist; this is the strongest form of the parrot critique that holds up empirically, though it does not subsume the rest. Worth noting: humans confabulate in much the same way when memory is sparse — fluent plausible-sounding text in the absence of grounding. The parrot critique is sometimes applied to language models under a stricter standard than we hold ourselves to.

On multimodality. The symbol-grounding objection (Harnad [35]) was that text-only systems are trapped in a hall of mirrors — symbols defined by other symbols, with no causal contact to the world. RAG is one partial reply (the model is grounded in text retrieved from the world). Multimodal frontier models offer a stronger one: vision-language and audio architectures process pixels and waveforms alongside tokens, so the symbols are now indexed against perceptual representations of actual referents. The pure-text framing of the parrot debate is increasingly historical; the live question is shifting from “can a text-only model mean anything?” to “what kind of grounding does a multimodal model actually achieve?”

Bender’s own position, in her own words. It is worth being precise about what Bender herself currently argues — and what this article therefore is and is not engaging. In her 2026 post “Stochastic Parrots: Frequently Unasked Questions” [56], she clarifies that “the target of my criticism is not the models” and that the 2021 paper focused on the risks of ever-larger language models, not on synthetic text generation per se (which post-dated ChatGPT). Her current concerns are sociotechnical: “data theft”, “exploitative labor practices”, “complete disregard for environmental impact”, and “the astonishing willingness of so many to surrender their own power and turn to synthetic text (for which no one is accountable) for all kinds of weighty decisions.” On the technical form-vs-meaning argument from Bender & Koller [13], she explicitly stands by it: models, she maintains, only ever have access to linguistic form and cannot map from language to anything outside language. She calls these systems “synthetic text extruding machines.” None of this article’s evidence in §7.3 — Othello world models, mechanistic-interpretability circuits, multi-token planning — would move Bender, who would read those as elaborate form-shuffling. The debunking in this article is therefore of one specific reading of the parrot critique (the strong, mechanism-level form-only claim, which is testable against §7.3’s evidence). That reading is not the only one available, and it is not the one Bender herself most defends today.

What would change my mind. The position above — that LLMs do something meaningfully more than parrot but meaningfully less than human cognition — is held with moderate confidence. Three findings would push it sharply toward the parrot end. First: systematic failure on a benchmark of genuinely novel symbolic tasks designed to share no surface features with training, with no compensating improvement at scale. Second: interpretability work showing that the “planning” circuits documented in Lindsey et al. [20] are post-hoc rationalisations rather than causally upstream of generation. Third: a demonstration that the apparent OOD-generalisation gains (MATH, bar exam, AlphaGeometry) collapse under the kind of contamination-aware re-testing that has begun to surface for several benchmarks. None of these has happened. If they do, the article above should be read as out of date.

8 Implications for educators and practitioners #

The semantic space framing is teachable. The single most powerful reframe for a non-technical audience is to describe language models as navigators of a meaning-space, not rule-followers. Words are locations; context shifts those locations; attention is the mechanism of shifting; the model generates by navigating toward locations that are consistent with everything it has read so far. This is accessible, accurate, and immediately illuminates both what the model does well (nuanced contextual understanding) and what it does badly (facts it was never near in training).

Hallucination is not random noise — and therefore it is addressable. Understanding hallucination as the regime where the model falls back on distributional priors — the parrot regime — gives practitioners a framework for intervention. Retrieve grounding facts (RAG). Force explicit reasoning (chain-of-thought). Verify outputs against sources. Each of these targets a specific mechanism. Telling users “it sometimes makes things up” is less useful than explaining when and why, and what can be done about it.

The parrot regime is where production RAG fails. The most consequential failure in deployed systems is not the model hallucinating in the abstract — it is the model defaulting to its parametric priors when retrieval misses or is silently incomplete. The practical pattern that follows from §7.5 is concrete: instrument retrieval coverage, surface the retrieved evidence to users for verification, and design prompts that fail loudly when grounding is absent rather than confidently confabulating. The sharpened parrot critique is the most useful diagnostic frame practitioners can carry into production debugging.

Chain-of-thought as a classroom activity. Ask students to give the same complex question to a language model twice: once directly, and once with “think step by step” appended. Compare the outputs. The difference in quality on reasoning tasks is often dramatic and immediately convincing. Inspect the steps: which are correct? Where does the reasoning break down? This is a live, hands-on demonstration of the difference between parametric retrieval and active reasoning.

The Winograd schema as a discussion anchor. Sentences like the city council example above are designed specifically to test whether a system has world knowledge beyond word statistics. Give a class a set of Winograd schemas; ask them to predict which ones a language model will get right and which it will fail on. Then test it. The results often surprise students in both directions — failures on seemingly simple sentences, successes on apparently hard ones — and the discussion of why is rich.

Scale and emergence require epistemic humility. The emergence of qualitatively new capabilities at scale is genuinely surprising and not yet fully understood. Honest AI education acknowledges both the remarkable capabilities and the genuine uncertainty about their mechanisms and limits. Avoid the dismissive (“it’s just statistics”) and the credulous (“it reasons like a person”). The interesting and accurate position — that these systems do something genuinely novel that we are still learning to characterise — is also the most intellectually honest one.

Closing. AI was supposed to reason like a logician. The systems that actually work are readers, navigating a geometric space of meaning that they assembled from their training corpus. The parrot critique was right about the methods that came before. It survives in narrow forms: memorisation, hallucination in sparse regions of training, and the sociotechnical concerns about data, labour, and environment that animate Bender et al.’s broader argument. It is wrong as a description of what large transformers compute at scale. What we have built is neither the logician we expected nor the parrot we feared. What remains is the careful work of figuring out what it actually is: where it generalises, where it confabulates, and how to deploy it without surrendering the judgement it cannot supply on its own.

This article was prepared by the ICE Industrial-AI team at the Institute for Computational Engineering (ICE), Eastern Switzerland University of Applied Sciences (OST). Our team works on applied AI deployments including retrieval-augmented systems for industry partners — so the implicit thesis here (LLMs do real semantic work; the parrot critique is overstated but not vacuous) is one we hold with skin in the game, not from a position of neutral exposition. We welcome feedback and collaboration: ice@ost.ch

9 References #

  • [1] Turing, A.M. (1950). Computing Machinery and Intelligence. Mind, 59(236), 433–460.
  • [2] Buchanan, B.G. & Shortliffe, E.H. (Eds.) (1984). Rule-Based Expert Systems: The MYCIN Experiments. Addison-Wesley.
  • [3] Polanyi, M. (1966). The Tacit Dimension. Doubleday.
  • [4] Bengio, Y., Ducharme, R., Vincent, P., & Jauvin, C. (2003). A neural probabilistic language model. JMLR, 3, 1137–1155.
  • [5] Vaswani, A. et al. (2017). Attention Is All You Need. NeurIPS 30. arXiv:1706.03762
  • [6] Bahdanau, D., Cho, K., & Bengio, Y. (2015). Neural Machine Translation by Jointly Learning to Align and Translate. ICLR 2015. arXiv:1409.0473
  • [7] Brown, T. et al. (2020). Language Models are Few-Shot Learners. NeurIPS 33. arXiv:2005.14165
  • [8] Wei, J. et al. (2022). Emergent Abilities of Large Language Models. TMLR. arXiv:2206.07682
  • [9] Wei, J. et al. (2022). Chain-of-Thought Prompting Elicits Reasoning in Large Language Models. NeurIPS 35. arXiv:2201.11903
  • [10] OpenAI (2024). Learning to Reason with LLMs. https://openai.com/index/learning-to-reason-with-llms/
  • [11] Lewis, P. et al. (2020). Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. NeurIPS 33. arXiv:2005.11401
  • [12] Bender, E.M., Gebru, T., McMillan-Major, A., & Shmitchell, S. (2021). On the Dangers of Stochastic Parrots. FAccT ‘21, 610–623. https://doi.org/10.1145/3442188.3445922
  • [13] Bender, E.M. & Koller, A. (2020). Climbing towards NLU: On Meaning, Form, and Understanding in the Age of Data. ACL 2020, 5185–5198.
  • [14] OpenAI (2023). GPT-4 Technical Report. arXiv:2303.08774.
  • [15] Li, K. et al. (2022). Emergent World Representations: Exploring a Sequence Model Trained on a Synthetic Task. arXiv:2210.13382
  • [16] Madabushi, H.T., Torgbi, M., & Bonial, C. (2025). Neither Stochastic Parroting nor AGI. arXiv:2505.23323
  • [17] Ji, Z. et al. (2023). Survey of Hallucination in Natural Language Generation. ACM Computing Surveys, 55(12). arXiv:2202.03629
  • [18] Lighthill, J. (1973). Artificial Intelligence: A General Survey. In Artificial Intelligence: A Paper Symposium, Science Research Council, UK.
  • [19] Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013). Efficient Estimation of Word Representations in Vector Space. ICLR Workshop. arXiv:1301.3781
  • [20] Lindsey, J. et al. (Anthropic, 2025). On the Biology of a Large Language Model. Transformer Circuits Thread. https://transformer-circuits.pub/2025/attribution-graphs/biology.html
  • [21] Minsky, M. & Papert, S. (1969). Perceptrons: An Introduction to Computational Geometry. MIT Press.
  • [22] Rumelhart, D.E., Hinton, G.E., & Williams, R.J. (1986). Learning representations by back-propagating errors. Nature, 323(6088), 533–536.
  • [23] Linnainmaa, S. (1970). The representation of the cumulative rounding error of an algorithm as a Taylor expansion of the local rounding errors. Master’s thesis, University of Helsinki. (First publication of reverse-mode automatic differentiation.)
  • [24] Werbos, P.J. (1974). Beyond Regression: New Tools for Prediction and Analysis in the Behavioral Sciences. PhD thesis, Harvard University.
  • [25] Carpenter, G.A. & Grossberg, S. (1987). A massively parallel architecture for a self-organizing neural pattern recognition machine. Computer Vision, Graphics, and Image Processing, 37, 54–115. (Foundations of ART from Grossberg 1976 onward.)
  • [26] Fukushima, K. (1980). Neocognitron: A self-organizing neural network model for a mechanism of pattern recognition unaffected by shift in position. Biological Cybernetics, 36(4), 193–202.
  • [27] Hopfield, J.J. (1982). Neural networks and physical systems with emergent collective computational abilities. PNAS, 79(8), 2554–2558.
  • [28] Cybenko, G. (1989). Approximation by superpositions of a sigmoidal function. Mathematics of Control, Signals and Systems, 2(4), 303–314.
  • [29] Hornik, K., Stinchcombe, M., & White, H. (1989). Multilayer feedforward networks are universal approximators. Neural Networks, 2(5), 359–366.
  • [30] Hornik, K. (1991). Approximation capabilities of multilayer feedforward networks. Neural Networks, 4(2), 251–257.
  • [31] Hendrycks, D., Burns, C., Kadavath, S., Arora, A., Basart, S., Tang, E., Song, D., & Steinhardt, J. (2021). Measuring Mathematical Problem Solving With the MATH Dataset. NeurIPS Datasets and Benchmarks. arXiv:2103.03874
  • [32] Templeton, A. et al. (Anthropic, 2024). Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet. Transformer Circuits Thread. https://transformer-circuits.pub/2024/scaling-monosemanticity/
  • [33] Olsson, C. et al. (Anthropic, 2022). In-context Learning and Induction Heads. Transformer Circuits Thread. arXiv:2209.11895
  • [34] Trinh, T.H., Wu, Y., Le, Q.V., He, H., & Luong, T. (2024). Solving olympiad geometry without human demonstrations. Nature, 625, 476–482.
  • [35] Harnad, S. (1990). The Symbol Grounding Problem. Physica D, 42, 335–346.
  • [36] Schaeffer, R., Miranda, B., & Koyejo, S. (2023). Are Emergent Abilities of Large Language Models a Mirage? NeurIPS 2023. arXiv:2304.15004
  • [37] Nanda, N., Lee, A., & Wattenberg, M. (2023). Emergent Linear Representations in World Models of Self-Supervised Sequence Models. arXiv:2309.00941
  • [38] Hochreiter, S. & Schmidhuber, J. (1997). Long Short-Term Memory. Neural Computation, 9(8), 1735–1780.
  • [39] Ouyang, L. et al. (2022). Training language models to follow instructions with human feedback. NeurIPS 35. arXiv:2203.02155
  • [40] Attributed to Frederick Jelinek (IBM) at a speech-recognition workshop c. 1985; the exact original utterance is undocumented. Widely circulated in the NLP and speech community since; discussed in Hirschberg, J. (1998), “Every Time I Fire a Linguist, My Performance Goes Up”, invited talk, AAAI-98.
  • [41] The New York Times (8 July 1958), “New Navy Device Learns by Doing.” Report on Rosenblatt’s perceptron press conference at the U.S. Office of Naval Research; the quoted wording above is the commonly-cited paraphrase circulated in subsequent accounts. See the Perceptron Wikipedia article for further discussion of the reception.
  • [42] Harris, Z.S. (1954). Distributional structure. Word, 10(2-3), 146–162.
  • [43] Firth, J.R. (1957). A synopsis of linguistic theory, 1930–1955. In Studies in Linguistic Analysis. Oxford: Philological Society.
  • [44] Schütze, H. (1998). Automatic word sense discrimination. Computational Linguistics, 24(1), 97–123.
  • [45] Martínez, E. (2024). Re-evaluating GPT-4’s bar exam performance. Artificial Intelligence and Law. (Argues the published 90th-percentile figure is overstated relative to actual repeat takers of the Uniform Bar Exam.)
  • [46] Carlini, N., Tramèr, F., Wallace, E., Jagielski, M., Herbert-Voss, A., et al. (2021). Extracting Training Data from Large Language Models. USENIX Security Symposium. arXiv:2012.07805
  • [47] Nasr, M., Carlini, N., Hayase, J., Jagielski, M., et al. (2023). Scalable Extraction of Training Data from (Production) Language Models. arXiv:2311.17035
  • [48] Haugeland, J. (1985). Artificial Intelligence: The Very Idea. MIT Press. (Coined the term “GOFAI” for Good Old-Fashioned AI.)
  • [49] Linzen, T. (2016). Issues in Evaluating Semantic Spaces Using Word Analogies. Proc. 1st Workshop on Evaluating Vector-Space Representations for NLP, 13–18. arXiv:1606.07736
  • [50] Nissim, M., van Noord, R., & van der Goot, R. (2020). Fair Is Better than Sensational: Man Is to Doctor as Woman Is to Doctor. Computational Linguistics, 46(2), 487–497. arXiv:1905.09866
  • [51] Levesque, H.J., Davis, E., & Morgenstern, L. (2012). The Winograd Schema Challenge. KR 2012: Principles of Knowledge Representation and Reasoning, 552–561.
  • [52] Peters, M.E., Neumann, M., Iyyer, M., Gardner, M., Clark, C., Lee, K., & Zettlemoyer, L. (2018). Deep contextualized word representations (ELMo). NAACL-HLT 2018, 2227–2237. arXiv:1802.05365
  • [53] Kaplan, J., McCandlish, S., Henighan, T., Brown, T.B., Chess, B., Child, R., Gray, S., Radford, A., Wu, J., & Amodei, D. (2020). Scaling Laws for Neural Language Models. arXiv:2001.08361
  • [54] Hoffmann, J., Borgeaud, S., Mensch, A., Buchatskaya, E., Cai, T., et al. (2022). Training Compute-Optimal Large Language Models (Chinchilla). NeurIPS 35. arXiv:2203.15556
  • [55] Mikolov, T., Yih, W., & Zweig, G. (2013). Linguistic Regularities in Continuous Space Word Representations. NAACL-HLT 2013, 746–751. (The “king − man + woman ≈ queen” demonstration, distinct from the Word2Vec architecture paper [19].)
  • [56] Bender, E.M. (2026). Stochastic Parrots: Frequently Unasked Questions. Medium. https://medium.com/@emilymenonbender/stochastic-parrots-frequently-unasked-questions-49c2e7d22d11