"Unraveling the Capabilities of Language Models in News Summarization"
Below podcast on this paper is generated with Google's Illuminate.
https://arxiv.org/abs/2501.18128
Staying informed about current events is crucial yet challenging due to the overwhelming volume of news. Automatic news summarization offers a solution but the effectiveness of recent smaller LLMs compared to larger models remains unclear.
This paper addresses this by benchmarking the news summarization capabilities of 20 recent LLMs, with a focus on smaller models, using zero-shot and few-shot learning approaches.
-----
📌 This benchmark practically guides LLM selection for news summarization. It empirically compares 20 models. It reveals performance trade-offs between model size and summarization quality across datasets.
📌 This study shows low quality gold summaries degrade few-shot learning for summarization. It underscores high-quality training data's critical role, even in prompt-based learning approaches.
📌 Multi-faceted evaluation using automatic metrics, human reviews, and LLM judge offers a robust summarization quality assessment. It captures aspects beyond basic lexical overlap metrics.
----------
Methods Explored in this Paper 🔧:
→ This research benchmarked 20 LLMs for news summarization.
→ The models included both publicly available and private models such as GPT-3.5-Turbo and Gemini.
→ Three datasets were used. These datasets were CNN/Daily Mail, Newsroom, and Extreme Summarization (XSum).
→ The study evaluated model performance in zero-shot and three-shot learning settings.
→ Automatic metrics like ROUGE, METEOR, and BERTScore were used for evaluation.
→ Human evaluation and LLM-as-a-judge (Claude 3 Sonnet) were also employed for comprehensive assessment.
→ The paper used carefully designed prompts to guide the models.
→ Default generation settings were maintained across all models for a fair comparison.
-----
Key Insights 💡:
→ Larger models like GPT-3.5-Turbo and GPT-4 generally outperform smaller models in news summarization tasks.
→ Providing demonstration examples in the few-shot setting did not consistently improve summarization quality. In some cases, it even worsened the quality.
→ This performance decrease in few-shot learning is mainly attributed to the low quality of gold standard summaries in the datasets.
→ Despite the dominance of larger models, certain smaller models like Qwen1.5-7B, SOLAR-10.7B-Instruct-v1.0, Meta-Llama-3-8B and Zephyr-7B-Beta showed promising results.
→ These smaller models are competitive alternatives to larger models for news summarization.
-----
Results 📊:
→ GPT-3.5-Turbo achieved top scores in automatic metrics on CNN/DM and Newsroom datasets.
→ Yi-9B model attained the highest ROUGE-L score of 0.2534 and BERTScore of 0.8884 on the XSum dataset.
→ Gemini-1.5-Pro achieved the highest METEOR score of 0.2923 on the XSum dataset, outperforming GPT models in this metric.
→ Human evaluators and LLM-as-a-judge consistently favored models like Qwen1.5-7B, Meta-Llama-3-8B and Zephyr-7B-Beta among smaller models.