Stealing Part of a Production Language Model

Linear algebra + API probing = exposed LLM architecture secrets

Rohan Paul

Nov 07, 2024

Linear algebra + API probing = exposed LLM architecture secrets

• Extracted ada and babbage projection matrices for <$20 USD

• Confirmed hidden dimensions: 1024 (ada), 2048 (babbage)

• Recovered gpt-3.5-turbo hidden dimension size

• Achieved mean squared error of 10^-4 in weight reconstruction

Original Problem 🔍:

LLMs like ChatGPT and PaLM-2 are black boxes with little known about their inner workings. Model stealing attacks aim to extract information, but have been limited to small models.

Solution in this Paper 🧠:

• Introduces first model-stealing attack on production LLMs

• Recovers embedding projection layer (up to symmetries) using API access

• Exploits final layer projection from hidden dimension to logit vector

• Uses targeted queries with logit bias and logprob information

• Applies linear algebraic techniques to reconstruct logits and weights

Key Insights from this Paper 💡:

• Black-box LLMs vulnerable to precise parameter extraction

• Logit bias and logprob API features enable attacks

• Embedding dimension and projection matrix recoverable

• Attack works on OpenAI's ada, babbage, and gpt-3.5-turbo models

• Potential defenses include API restrictions and architectural changes

🧠 The attack works by exploiting the fact that the final layer of a language model projects from the hidden dimension to a higher dimensional logit vector.

By making targeted queries to a model's API, the researchers are able to extract:

The embedding dimension (hidden dimension size) of the model
The entire projection matrix of the final layer

Rohan's Bytes

Discussion about this post