"Understanding Gradient Descent through the Training Jacobian"

Playback speed

Share post at current time

0:00

Transcript

"Understanding Gradient Descent through the Training Jacobian"

The podcast on this paper is generated with Google's Illuminate.

Rohan Paul

Jan 04, 2025

Training a neural-network actually modifies only a small portion of neural network parameters, leaving most unchanged.

The paper introduces a novel way to understand neural network training by analyzing how final parameters relate to their initial values through the training Jacobian matrix.

-----

https://arxiv.org/abs/2412.07003

🤔 Original Problem:

Understanding how neural networks learn during training remains a black box, particularly in determining which parameter changes matter most and how the training process operates in high-dimensional space.

-----

🔬 Solution in this Paper:

→ The researchers examine the Jacobian matrix of trained network parameters with respect to their initial values.

→ They discovered the singular value spectrum of this Jacobian has three distinct regions: chaotic (values > 1), bulk (values ≈ 1), and stable (values < 1).

→ The bulk region, spanning about two-thirds of parameter space, remains virtually unchanged during training.

→ These bulk directions don't affect in-distribution predictions but significantly impact out-of-distribution behavior.

-----

🎯 Key Insights:

→ Training is intrinsically low-dimensional, with most parameter changes happening in a small subspace

→ The bulk subspace is independent of initialization and labels but depends strongly on input data

→ Training linearization remains valid much longer along bulk directions than chaotic ones

→ The bulk overlaps significantly with the nullspace of parameter-function Jacobian on test data

-----

📊 Results:

→ ~3000 out of 4810 singular values are extremely close to one

→ Bulk directions show near-perfect linear behavior across 7 orders of magnitude

→ Training restricted to bulk complement performs similar to unconstrained training

→ Bulk subspaces from different random seeds show high similarity (much higher than random chance)

Rohan's Bytes

"Understanding Gradient Descent through the Training Jacobian"

Discussion about this video