UBC Theses and Dissertations

UBC Theses Logo

UBC Theses and Dissertations

Learning dynamics of deep learning -- force analysis of deep neural networks Ren, Yi (Joshua)

Abstract

This thesis investigates the learning dynamics of deep learning systems through a local, physics-inspired analytical lens. Motivated by the need for fine-grained insights into model behavior, we begin with the step-wise influence that a single training example exerts on a specific observing example during learning. Central to our approach is the proposed AKG decomposition, which dissects this influence into three interpretable components: similarity (K), normalization (A), and prediction gap (G). This decomposition enables an analogy with classical force analysis: the force originates from G, is shaped by AK, and is ultimately applied to the target object, e.g., to the model confidence, output, hidden representations, or parameters. Building upon this foundation, we gradually scale the analysis from individual interactions to cumulative effects over time, akin to tracking an object’s motion under multiple forces. We apply it to the following problems. Supervised classification: We study the learning trajectories of examples with varying difficulty and reveal an interesting "zig-zag" pattern that emerges during optimization. Our analysis explains this behavior and inspires a novel knowledge distillation method, Filter-KD, which improves supervision signals for student models. Large language model (LLM) finetuning: We extend the framework to account for the autoregressive nature of LLMs and the presence of negative gradients. The unified perspective explains behaviors across finetuning methods such as SFT, DPO, and GRPO. We also highlight the critical role of negative gradients. In particular, we identify the "squeezing effect": a counterintuitive phenomenon caused by improperly applied gradient ascent. Representation learning: We explore the dynamics of hidden features, revealing how adaptation energy and directions influence the feature drift. Our analysis leads to a provable pattern of feature adaptation in a head-probing then finetuning pipeline, offering insights and inspiring several practical strategies. Simplicity bias and compositional learning: Revisiting foundational questions about why structured representations are learned faster, we apply our framework to a compositional learning setting. Our findings align with principles such as Occam’s Razor and the idea of "compression for AGI," offering a novel dynamical explanation rooted in compression and learning speed.

Item Media

Item Citations and Data

Rights

Attribution-NonCommercial-NoDerivatives 4.0 International