"Unraveling the Capabilities of Language Models in News Summarization"
Below podcast on this paper is generated with Google's Illuminate.
https://arxiv.org/abs/2501.18128
Staying informed about current events is crucial yet challenging due to the overwhelming volume of news. Automatic news summarization offers a solution but the effectiveness of recent smaller LLMs compared to larger models remains unclear.
This paper addresses this by benchmarking the news summarization capabilities of 20 recent LLMs, with a focus on smaller models, using zero-shot and few-shot learning approaches.
-----
๐ This benchmark practically guides LLM selection for news summarization. It empirically compares 20 models. It reveals performance trade-offs between model size and summarization quality across datasets.
๐ This study shows low quality gold summaries degrade few-shot learning for summarization. It underscores high-quality training data's critical role, even in prompt-based learning approaches.
๐ Multi-faceted evaluation using automatic metrics, human reviews, and LLM judge offers a robust summarization quality assessment. It captures aspects beyond basic lexical overlap metrics.
----------
Methods Explored in this Paper ๐ง:
โ This research benchmarked 20 LLMs for news summarization.
โ The models included both publicly available and private models such as GPT-3.5-Turbo and Gemini.
โ Three datasets were used. These datasets were CNN/Daily Mail, Newsroom, and Extreme Summarization (XSum).
โ The study evaluated model performance in zero-shot and three-shot learning settings.
โ Automatic metrics like ROUGE, METEOR, and BERTScore were used for evaluation.
โ Human evaluation and LLM-as-a-judge (Claude 3 Sonnet) were also employed for comprehensive assessment.
โ The paper used carefully designed prompts to guide the models.
โ Default generation settings were maintained across all models for a fair comparison.
-----
Key Insights ๐ก:
โ Larger models like GPT-3.5-Turbo and GPT-4 generally outperform smaller models in news summarization tasks.
โ Providing demonstration examples in the few-shot setting did not consistently improve summarization quality. In some cases, it even worsened the quality.
โ This performance decrease in few-shot learning is mainly attributed to the low quality of gold standard summaries in the datasets.
โ Despite the dominance of larger models, certain smaller models like Qwen1.5-7B, SOLAR-10.7B-Instruct-v1.0, Meta-Llama-3-8B and Zephyr-7B-Beta showed promising results.
โ These smaller models are competitive alternatives to larger models for news summarization.
-----
Results ๐:
โ GPT-3.5-Turbo achieved top scores in automatic metrics on CNN/DM and Newsroom datasets.
โ Yi-9B model attained the highest ROUGE-L score of 0.2534 and BERTScore of 0.8884 on the XSum dataset.
โ Gemini-1.5-Pro achieved the highest METEOR score of 0.2923 on the XSum dataset, outperforming GPT models in this metric.
โ Human evaluators and LLM-as-a-judge consistently favored models like Qwen1.5-7B, Meta-Llama-3-8B and Zephyr-7B-Beta among smaller models.


