"Unraveling the Capabilities of Language Models in News Summarization"

Below podcast on this paper is generated with Google's Illuminate.

Rohan Paul

Feb 12, 2025

Article voiceover

0:00

-5:29

https://arxiv.org/abs/2501.18128

Staying informed about current events is crucial yet challenging due to the overwhelming volume of news. Automatic news summarization offers a solution but the effectiveness of recent smaller LLMs compared to larger models remains unclear.

This paper addresses this by benchmarking the news summarization capabilities of 20 recent LLMs, with a focus on smaller models, using zero-shot and few-shot learning approaches.

-----

📌 This benchmark practically guides LLM selection for news summarization. It empirically compares 20 models. It reveals performance trade-offs between model size and summarization quality across datasets.

📌 This study shows low quality gold summaries degrade few-shot learning for summarization. It underscores high-quality training data's critical role, even in prompt-based learning approaches.

📌 Multi-faceted evaluation using automatic metrics, human reviews, and LLM judge offers a robust summarization quality assessment. It captures aspects beyond basic lexical overlap metrics.

----------

Methods Explored in this Paper 🔧:

→ This research benchmarked 20 LLMs for news summarization.

→ The models included both publicly available and private models such as GPT-3.5-Turbo and Gemini.

→ Three datasets were used. These datasets were CNN/Daily Mail, Newsroom, and Extreme Summarization (XSum).

→ The study evaluated model performance in zero-shot and three-shot learning settings.

→ Automatic metrics like ROUGE, METEOR, and BERTScore were used for evaluation.

→ Human evaluation and LLM-as-a-judge (Claude 3 Sonnet) were also employed for comprehensive assessment.

→ The paper used carefully designed prompts to guide the models.

→ Default generation settings were maintained across all models for a fair comparison.

-----

Key Insights 💡:

→ Larger models like GPT-3.5-Turbo and GPT-4 generally outperform smaller models in news summarization tasks.

→ Providing demonstration examples in the few-shot setting did not consistently improve summarization quality. In some cases, it even worsened the quality.

→ This performance decrease in few-shot learning is mainly attributed to the low quality of gold standard summaries in the datasets.

→ Despite the dominance of larger models, certain smaller models like Qwen1.5-7B, SOLAR-10.7B-Instruct-v1.0, Meta-Llama-3-8B and Zephyr-7B-Beta showed promising results.

→ These smaller models are competitive alternatives to larger models for news summarization.

-----

Results 📊:

→ GPT-3.5-Turbo achieved top scores in automatic metrics on CNN/DM and Newsroom datasets.

→ Yi-9B model attained the highest ROUGE-L score of 0.2534 and BERTScore of 0.8884 on the XSum dataset.

→ Gemini-1.5-Pro achieved the highest METEOR score of 0.2923 on the XSum dataset, outperforming GPT models in this metric.

→ Human evaluators and LLM-as-a-judge consistently favored models like Qwen1.5-7B, Meta-Llama-3-8B and Zephyr-7B-Beta among smaller models.

Rohan's Bytes

Discussion about this post