Do LLM Outputs Mirror Their Internal Semantic Maps? A Large-Scale Behavioral Probing Study

TL;DR: How faithfully does an LLM’s text output reflect the semantic geometry encoded in its hidden states? Forced-choice behavioral probing recovers substantially more internal similarity structure than open-ended generation, and behavioral features improve prediction of unseen hidden-state similarities above lexical and cross-model baselines.

Written by Louis Schiekiera

The core question

Cognitive scientists have long inferred semantic structure from observable behavior: show someone the word dog, record what they associate with it (cat, leash, bark), repeat across many cues, and the resulting response patterns sketch an approximate map of an otherwise hidden meaning system (De Deyne et al., 2019). LLMs offer a model system to test this logic. Unlike human participants, a language model’s internal representations are directly accessible alongside its behavioral output. So we can ask a sharper question: when we probe an LLM with word-association tasks, how much of its hidden-state semantic geometry actually shows up in the responses it produces?

That is the focus of our recent preprint, From Associations to Activations led by Louis Schiekiera. Rather than comparing model behavior to human norms, we compare each model’s behavior to its own layerwise hidden states—treating the model as both the subject and the ground truth.

Figure 1: Overview of the approach. A shared vocabulary feeds two pipelines: (i) layerwise hidden-state extraction produces a hidden-state similarity matrix, and (ii) behavioral association tasks (forced choice or free association) yield a behavioral similarity matrix. Representational similarity analysis (RSA) then quantifies how well the two geometries match.

Experimental setup at a glance

Models under study

We tested eight instruction-tuned decoder-only transformers spanning 7B to 14B parameters: Falcon3, Gemma-2, Llama-3.1, Mistral-7B, Mistral-Nemo, Phi-4, Qwen2.5, and rnj-1. All experiments share a single 5,000-noun vocabulary drawn from the SUBTLEX-US frequency list (Brysbaert et al., 2012).

Two ways to elicit semantic behavior

We borrowed two classic paradigms from psycholinguistics.

Forced choice and free association paradigms

Figure 2: Illustration of the two behavioral paradigms. In forced choice (left), a cue word is paired with a candidate set and the model picks the most related items. In free association (right), the model generates associates from scratch. Both produce cue–response count matrices whose row-wise cosine similarities define behavioral semantic geometries.

Forced choice (FC). Each cue appears with 16 candidate words; the model selects exactly two that are most semantically related. A deterministic shuffle of the remaining vocabulary produces 313 unique candidate sets per cue.

Free association (FA). Each cue is presented alone and the model generates five single-word associates. We repeat this across 126 stochastic runs per cue to accumulate stable response distributions.

Responses from both paradigms are aggregated into sparse cue–response count matrices. We apply positive pointwise mutual information (PPMI) reweighting to down-weight globally frequent responses, then compute cue–cue similarity via cosine between PPMI-weighted row vectors. Altogether, the dataset spans more than 17.5 million trials across both paradigms and all eight models.

Extracting hidden-state geometry

For every model and every word in the vocabulary, we pulled layerwise hidden states under four contextual embedding strategies:

Averaged — the word embedded in 50 naturally occurring C4 sentences (Raffel et al., 2020), hidden states averaged across contexts (Bommasani et al., 2020).
Meaning — a fixed definitional prompt (“What is the meaning of the word {w}?”).
Task (FC) — the word embedded in the forced-choice instruction prompt, minus the candidate list.
Task (FA) — the word embedded in the free-association instruction prompt.

Cosine similarity between mean-centered layerwise vectors (Ethayarajh et al., 2019) yields a hidden-state similarity matrix for each model, layer, and extraction strategy.

Reference baselines

Three external baselines anchor the comparison: FastText static word vectors (Bojanowski et al., 2017), BERT contextual embeddings (Devlin et al., 2019), and a cross-model consensus geometry that averages hidden-state similarities from all other models—motivated by evidence for a shared semantic subspace across architectures (Huh et al., 2024).

How we measure alignment

We employed three complementary metrics:

RSA (Kriegeskorte et al., 2008; Nili et al., 2014) — Pearson correlation between vectorized upper-triangular entries of the hidden-state and reference similarity matrices, computed per layer.
Nearest-neighbor overlap ($\mathrm{NN@}k$) — fraction of shared $k$-nearest neighbors between hidden-state and reference similarity spaces.
Held-out-words ridge regression — can behavioral similarity predict hidden-state similarities for words the model never saw during training of the regression? This tests generalization beyond lexical baselines and cross-model consensus.

Key findings

Constrained tasks recover far more internal structure

The gap between paradigms is substantial. Forced-choice behavior aligns with hidden-state geometry far more strongly than free association—consistently, across every model and evaluation metric.

Under the best extraction strategy (Task FC), mean RSA reaches $r = .463$ for forced choice versus only $r = .199$ for free association. Even the weakest FC condition (Averaged extraction, $r = .346$) outperforms the strongest FA condition.

Summary RSA and nearest-neighbor overlap results

Figure 3: Aggregate alignment across models. Left: RSA correlation by layer. Right: nearest-neighbor overlap as a function of neighborhood size $k$ (log scale). Forced-choice behavior (green) tracks hidden-state structure far more closely than free association (red). Cross-model consensus (black) sets the ceiling.

Why does FC win so decisively? Its controlled candidate sets force every response to emerge from an explicit comparison, concentrating observations onto shared supports and producing a denser, less noisy cue–response matrix (Roads et al., 2021). Free association, by contrast, disperses probability mass across a long tail of idiosyncratic responses, yielding sparser vectors with lower signal-to-noise for recovering geometric structure.

Extraction context shifts where alignment peaks

The choice of how hidden states are extracted determines which layers show the strongest match.

Layerwise RSA for FC and FA under different extraction strategies

Figure 4: Layerwise RSA profiles under different extraction strategies. Task-aligned and meaning-focused prompts peak at earlier, mid-depth layers. Averaging over natural contexts shifts the peak to later layers.

Task-aligned and meaning-based prompts push the model into a comparable semantically focused processing mode, and peak alignment appears at earlier to mid-depth layers—consistent with evidence that core lexical-semantic representations crystallize in intermediate transformer blocks. Averaging over diverse natural contexts, by contrast, mixes senses and topics, diluting the word-level signal and shifting alignment peaks toward the final layers.

The paradigm advantage holds across all eight models

Model-by-model heatmaps confirm that the FC superiority is universal, though its magnitude varies with architecture:

Figure 5: Per-model RSA heatmaps. Each panel contrasts forced-choice (left sub-panel) and free-association (right sub-panel) behavioral similarity against hidden states, broken down by extraction strategy and summarized across layers.

Behavior predicts hidden structure on unseen words

The held-out regression provides the most stringent test. After controlling for FastText, BERT, and cross-model consensus, adding FC behavioral similarity still improves mean test $R^2$ by $+.022$; FA adds a marginal $+.002$. The full model achieves mean $R^2 = .587$ (baseline: $.569$), peaking at $R^2 = .844$ for Llama-3.1-8B-Instruct.

Ridge regression performance across models

Figure 6: Held-out ridge regression results for all eight models. Bold values indicate $R^2$ for the full predictor set (behavioral + baselines); parenthetical values show the baseline without behavioral features.

This means that behavioral probing captures something about a model’s internal semantic organization that lexical vectors and cross-model structure alone do not—especially when the behavioral measurement is carefully constrained.

Broader implications

Black-box interpretability

When logits and activations are unavailable, behavioral probing remains an important path to interpretability. Forced-choice paradigms are especially promising: their constrained response sets act as structured measurement instruments that concentrate informative signal.

Lessons for cognitive science

Our fully transparent LLM setup lets us rigorously test a foundational cognitive-science assumption—that structured behavior is constrained by, and therefore partially reveals, internal states. The sharp FC–FA divergence demonstrates that whether behavior reveals internal structure depends critically on the measurement protocol. Open-ended tasks are not inherently less informative; they simply distribute responses too thinly for cosine-based geometry recovery. Protocol design is itself a variable.

A shared semantic substrate

One of the most important observations is the strength of cross-model consensus. Similarity structure aggregated from the other seven models explains a large share of variance in any target model’s hidden-state geometry, lending further support to the hypothesis of a common, low-dimensional semantic subspace across diverse LLM architectures (Huh et al., 2024).

References

Bojanowski, P., Grave, E., Joulin, A., & Mikolov, T. (2017). Enriching word vectors with subword information. Transactions of the Association for Computational Linguistics, 5, 135–146. doi:10.1162/tacl_a_00051
Bommasani, R., Davis, K., & Cardie, C. (2020). Interpreting pretrained contextualized representations via reductions to static embeddings. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (pp. 4758–4781). doi:10.18653/v1/2020.acl-main.431
Brysbaert, M., New, B., & Keuleers, E. (2012). Adding part-of-speech information to the SUBTLEX-US word frequencies. Behavior Research Methods, 44(4), 991–997. doi:10.3758/s13428-012-0190-4
De Deyne, S., Navarro, D. J., Perfors, A., Brysbaert, M., & Storms, G. (2019). The Small World of Words: English word association norms for over 12,000 cue words. Behavior Research Methods, 51(3), 987–1006. doi:10.3758/s13428-018-1115-7
Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. (2019). BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of NAACL-HLT 2019 (pp. 4171–4186). doi:10.18653/v1/N19-1423
Ethayarajh, K. (2019). How contextual are contextualized word representations? Comparing the geometry of BERT, ELMo, and GPT-2 embeddings. arXiv preprint. arxiv:1909.00512
Huh, M., Cheung, B., Wang, T., & Isola, P. (2024). The platonic representation hypothesis. arXiv preprint. arxiv:2405.07987
Kriegeskorte, N., Mur, M., & Bandettini, P. A. (2008). Representational similarity analysis—connecting the branches of systems neuroscience. Frontiers in Systems Neuroscience, 2, 4. doi:10.3389/neuro.06.004.2008
Nili, H., Wingfield, C., Walther, A., Su, L., Marslen-Wilson, W., & Kriegeskorte, N. (2014). A toolbox for representational similarity analysis. PLoS Computational Biology, 10(4), e1003553. doi:10.1371/journal.pcbi.1003553
Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S., Matena, M., Zhou, Y., Li, W., & Liu, P. J. (2020). Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research, 21(140), 1–67. jmlr.org
Roads, B. D., & Love, B. C. (2021). Enriching ImageNet with human similarity judgments and psychological embeddings. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (pp. 3547–3557). doi:10.1109/CVPR46437.2021.00355