Generative LLMs usually generate text token-by-token. This paper explores methods to directly select whole answers from a set of candidates without generating text.
The paper proposes using the initial "logits" (raw output scores) of an LLM to estimate the probability of each candidate answer.
-----
https://arxiv.org/abs/2501.17338
📌 This paper's method allows skipping the costly autoregressive token generation. It directly estimates probabilities for entire candidate answers. This offers huge speed improvements for tasks with pre-defined answer choices.
📌 By using raw output scores, the method avoids issues like halting, bad output formats. It is useful when models struggle with instruction following.
📌 The approach reveals how much "knowledge" is encoded in the initial, raw LLM outputs. This is before the token-by-token decoding refines it. It opens interesting capabilities for LLM interpretability.
----------
Methods Explored in this Paper 🔧:
→ The core idea revolves around "decoding-free generative candidate selection." Instead of generating text token by token, various methods estimate the probability of each candidate answer directly from the initial logits.
→ These methods include using the logit of the first or last token of each candidate ("First" and "Last"), averaging logits across all tokens ("Average"), or summing the logits ("Sum").
→ A baseline that uses full decoding then matches the output to a candidate is also tested.
→ Dense retrieval methods is a baseline. Question and each answer are encoded into vectors. Cosine similarity between the question vector and each answer option vector is computed to get a score of relevance.
-----
Key Insights 💡:
→ Estimation methods can be reasonable in challenging tasks or when base LLM is bad at generating good answers.
→ When full decoding works well, estimation methods do not perform as good. Estimation methods are sensitive to the LLM and dataset characteristics.
→ Using the logits from the first output step is the best and most efficient approach. Using all tokens in the candidate answer for estimation is better than just using a few "key" tokens.
-----
Results 📊:
→ Estimation methods outperform full decoding on some tasks using non-instruction-tuned models, with up to a +29.25 increase in recall.
→ Decoding-free methods are much faster than full decoding, showing speedups of 25.1x to 57.6x on tasks with large candidate pools.
Share this post