Stealing Part of a Production Language Model
Linear algebra + API probing = exposed LLM architecture secrets
Linear algebra + API probing = exposed LLM architecture secrets
• Extracted ada and babbage projection matrices for <$20 USD
• Confirmed hidden dimensions: 1024 (ada), 2048 (babbage)
• Recovered gpt-3.5-turbo hidden dimension size
• Achieved mean squared error of 10^-4 in weight reconstruction
Original Problem 🔍:
LLMs like ChatGPT and PaLM-2 are black boxes with little known about their inner workings. Model stealing attacks aim to extract information, but have been limited to small models.
Solution in this Paper 🧠:
• Introduces first model-stealing attack on production LLMs
• Recovers embedding projection layer (up to symmetries) using API access
• Exploits final layer projection from hidden dimension to logit vector
• Uses targeted queries with logit bias and logprob information
• Applies linear algebraic techniques to reconstruct logits and weights
Key Insights from this Paper 💡:
• Black-box LLMs vulnerable to precise parameter extraction
• Logit bias and logprob API features enable attacks
• Embedding dimension and projection matrix recoverable
• Attack works on OpenAI's ada, babbage, and gpt-3.5-turbo models
• Potential defenses include API restrictions and architectural changes
🧠 The attack works by exploiting the fact that the final layer of a language model projects from the hidden dimension to a higher dimensional logit vector.
By making targeted queries to a model's API, the researchers are able to extract:
The embedding dimension (hidden dimension size) of the model
The entire projection matrix of the final layer