Language Models Show Surprising Grasp of Human Sensory Experience Without Direct Perception
A language model trained only on text correctly predicted that two musical notes an octave apart would sound similar to humans — despite never having heard a single sound.
This remarkable finding, published in Nature Scientific Reports under the title “Large language models predict human sensory judgments across six modalities,” emerged from research by CDS Faculty Fellow Ilia Sucholutsky and colleagues, who tested large language models’ ability to make human-like judgments about sensory experiences across six different domains: pitch, loudness, colors, consonants, taste, and timbre.
The challenge was figuring out how to study these models’ internal representations of sensory experiences they’ve never had. The researchers turned to methods from cognitive science that were originally developed to study human perception non-invasively. “Just as cognitive scientists can’t directly examine what’s happening inside people’s heads, we can’t look inside closed-source language models like GPT,” Sucholutsky said. “So we borrowed their experimental techniques.”
The key method they adopted was similarity judgments — a classic cognitive science approach where participants rate how similar pairs of stimuli seem. For instance, to understand how humans mentally represent colors, researchers show participants pairs of colors and ask them to rate their similarity. These ratings can then be used to reconstruct the underlying mental organization of colors — which often resembles the familiar color wheel.
The researchers applied this same approach to language models, presenting them with pairs of sensory stimuli encoded as text (like hex codes for colors or frequency values for musical pitches) and asking for similarity ratings. The models showed remarkably human-like judgment patterns, with particularly strong correlations for pitch (r=0.92), loudness (r=0.89), and colors (r=0.84).
Most strikingly, when analyzing musical pitch similarity, the language models reproduced a key feature of human perception — the “octave effect,” where humans perceive notes that are exactly one octave apart as highly similar. The models also reconstructed the “pitch spiral” that represents how humans mentally organize musical notes.
For colors, the models not only recreated the familiar color wheel arrangement that emerges from human similarity judgments, but also captured subtle linguistic differences in color perception. When tested in different languages, the models replicated known variations in how English and Russian speakers categorize colors. Perhaps most surprisingly, when testing GPT-4’s multimodal capabilities, the researchers found that having direct visual access to colors didn’t significantly improve the model’s performance compared to using text descriptions alone.
This work is part of a broader research program examining how well language can capture human perceptual experience. In “Words are all you need? Language as an approximation for human similarity judgments,” Sucholutsky and colleagues found that language-based methods could predict human similarity judgments better than models trained directly on images, audio, and video. They also discovered that combining language with other modalities produced the best results for approximating human perception.
The researchers extended this line of inquiry in “Studying the Effect of Globalization on Color Perception using Multilingual Online Recruitment and Large Language Models.” In a study of 2,280 participants speaking 22 different languages, they found that despite increasing global connectivity, different languages maintain distinct ways of categorizing colors. Even highly multilingual AI models preserved these language-specific differences in color representation.
The practical implications of this research are significant. In “Alignment with human representations supports robust few-shot learning,” Sucholutsky demonstrated that AI systems whose internal representations align well with human perception perform better at learning from small amounts of data and show more robustness to various types of adversarial attacks.
Taken together, these studies suggest something profound about the relationship between language and perception. By using language models as a tool to quantify how much perceptual information can be recovered from text alone, the researchers established a concrete lower bound that was higher than previously thought. The fact that language models can reconstruct human perceptual structures like the color wheel and pitch spiral — without ever seeing colors or hearing sounds — demonstrates that language captures rich information about how we experience the physical world.
This finding is particularly striking given that even providing direct visual input to GPT-4 didn’t improve its performance over text descriptions alone. The results suggest that language may be sufficient for developing human-like sensory representations — an encouraging sign for the future of human-AI communication, as it indicates these systems may be able to understand and discuss our physical world meaningfully despite experiencing it so differently from us.
By Stephen Thomas