The implicit geometry of language : structure, semantics, and dynamics in next-token prediction

UBC Theses and Dissertations

Featured Collection

UBC Theses and Dissertations

The implicit geometry of language : structure, semantics, and dynamics in next-token prediction Zhao, Yize

Abstract

Language models trained with next-token prediction acquire internal representations that encode rich semantic structure, including syntactic regularities, linear semantic analogies, and a coarse-to-fine progression in learned concepts. While these phenomena are well documented, their origin remains conceptually opaque: how does predicting next-token distributions produce such structure? This thesis tackles this question from first principles. We frame next-token prediction as a soft-label classification problem, where each context is paired with a sparse target distribution over next tokens. Variation in support across contexts creates a structured pattern of label sparsity, captured by a data co-occurrence matrix. To formalize this, we adopt the Unconstrained Features Model (UFM), a minimal setting where logits are parameterized as a product of word and context embeddings and trained with cross-entropy loss plus ridge regularization. We prove that gradient descent drives directional convergence: context embeddings with identical support sets collapse to a shared direction, partitioning the embedding space by support pattern. This process uncovers latent structure in the data, even without explicit semantic supervision. We further show that the singular vectors of the co-occurrence matrix align with interpretable semantic directions in embedding space. Each vector induces a partition over words or contexts based on alignment with a latent concept. These directions emerge sequentially during training, with dominant concepts learned first—yielding a spectral explanation for the progression from coarse to fine-grained structure. Experiments on synthetic and real data confirm these predictions. Trained transformers exhibit the same alignment between embeddings and the singular directions of the co-occurrence matrix. Semantic structure appears in both synthetic tasks and pretrained models. Tracking learning in text data and step-imbalanced MNIST shows coarse distinctions—such as part-of-speech or majority class—emerging early, followed by finer categories, consistent with spectral ordering. Finally, we extend the analysis to settings with class imbalance, showing that reweighting the loss flattens the singular spectrum of the co-occurrence matrix, thereby accelerating the acquisition of underrepresented features. These results demonstrate that semantic structure in next-token prediction arises predictably from the interaction between label structure—particularly its spectral properties—and the implicit bias of optimization.

Item Metadata

Title	The implicit geometry of language : structure, semantics, and dynamics in next-token prediction
Creator	Zhao, Yize
Supervisor	Thrampoulidis, Christos
Publisher	University of British Columbia
Date Issued	2025
Description	Language models trained with next-token prediction acquire internal representations that encode rich semantic structure, including syntactic regularities, linear semantic analogies, and a coarse-to-fine progression in learned concepts. While these phenomena are well documented, their origin remains conceptually opaque: how does predicting next-token distributions produce such structure? This thesis tackles this question from first principles. We frame next-token prediction as a soft-label classification problem, where each context is paired with a sparse target distribution over next tokens. Variation in support across contexts creates a structured pattern of label sparsity, captured by a data co-occurrence matrix. To formalize this, we adopt the Unconstrained Features Model (UFM), a minimal setting where logits are parameterized as a product of word and context embeddings and trained with cross-entropy loss plus ridge regularization. We prove that gradient descent drives directional convergence: context embeddings with identical support sets collapse to a shared direction, partitioning the embedding space by support pattern. This process uncovers latent structure in the data, even without explicit semantic supervision. We further show that the singular vectors of the co-occurrence matrix align with interpretable semantic directions in embedding space. Each vector induces a partition over words or contexts based on alignment with a latent concept. These directions emerge sequentially during training, with dominant concepts learned first—yielding a spectral explanation for the progression from coarse to fine-grained structure. Experiments on synthetic and real data confirm these predictions. Trained transformers exhibit the same alignment between embeddings and the singular directions of the co-occurrence matrix. Semantic structure appears in both synthetic tasks and pretrained models. Tracking learning in text data and step-imbalanced MNIST shows coarse distinctions—such as part-of-speech or majority class—emerging early, followed by finer categories, consistent with spectral ordering. Finally, we extend the analysis to settings with class imbalance, showing that reweighting the loss flattens the singular spectrum of the co-occurrence matrix, thereby accelerating the acquisition of underrepresented features. These results demonstrate that semantic structure in next-token prediction arises predictably from the interaction between label structure—particularly its spectral properties—and the implicit bias of optimization.
Genre	Thesis/Dissertation
Type	Text
Language	eng
Date Available	2025-08-21
Provider	Vancouver : University of British Columbia Library
Rights	Attribution-NonCommercial-NoDerivatives 4.0 International
DOI	10.14288/1.0449817
URI	http://hdl.handle.net/2429/92005
Degree (Theses)	Master of Applied Science - MASc
Program (Theses)	Electrical and Computer Engineering
Affiliation	Applied Science, Faculty of; Electrical and Computer Engineering, Department of
Degree Grantor	University of British Columbia
Graduation Date	2025-11
Campus	UBCV
Scholarly Level	Graduate
Rights URI	http://creativecommons.org/licenses/by-nc-nd/4.0/
Aggregated Source Repository	DSpace

Open Collections

UBC Theses and Dissertations

UBC Theses and Dissertations

The implicit geometry of language : structure, semantics, and dynamics in next-token prediction Zhao, Yize

Abstract

Item Metadata

Item Media

Item Citations and Data

Rights