0:00
/
0:00
Transcript

"Inferring from Logits: Exploring Best Practices for Decoding-Free Generative Candidate Selection"

Below podcast on this paper is generated with Google's Illuminate.

Generative LLMs usually generate text token-by-token. This paper explores methods to directly select whole answers from a set of candidates without generating text.

The paper proposes using the initial "logits" (raw output scores) of an LLM to estimate the probability of each candidate answer.

-----

https://arxiv.org/abs/2501.17338

📌 This paper's method allows skipping the costly autoregressive token generation. It directly estimates probabilities for entire candidate answers. This offers huge speed improvements for tasks with pre-defined answer choices.

📌 By using raw output scores, the method avoids issues like halting, bad output formats. It is useful when models struggle with instruction following.

📌 The approach reveals how much "knowledge" is encoded in the initial, raw LLM outputs. This is before the token-by-token decoding refines it. It opens interesting capabilities for LLM interpretability.

----------

Methods Explored in this Paper 🔧:

→ The core idea revolves around "decoding-free generative candidate selection." Instead of generating text token by token, various methods estimate the probability of each candidate answer directly from the initial logits.

→ These methods include using the logit of the first or last token of each candidate ("First" and "Last"), averaging logits across all tokens ("Average"), or summing the logits ("Sum").

→ A baseline that uses full decoding then matches the output to a candidate is also tested.

→ Dense retrieval methods is a baseline. Question and each answer are encoded into vectors. Cosine similarity between the question vector and each answer option vector is computed to get a score of relevance.

-----

Key Insights 💡:

→ Estimation methods can be reasonable in challenging tasks or when base LLM is bad at generating good answers.

→ When full decoding works well, estimation methods do not perform as good. Estimation methods are sensitive to the LLM and dataset characteristics.

→ Using the logits from the first output step is the best and most efficient approach. Using all tokens in the candidate answer for estimation is better than just using a few "key" tokens.

-----

Results 📊:

→ Estimation methods outperform full decoding on some tasks using non-instruction-tuned models, with up to a +29.25 increase in recall.

→ Decoding-free methods are much faster than full decoding, showing speedups of 25.1x to 57.6x on tasks with large candidate pools.

Discussion about this video