"Zero-Shot Prompting and Few-Shot Fine-Tuning: Revisiting Document Image Classification Using Large Language Models"

Playback speed

Share post at current time

0:00

Transcript

"Zero-Shot Prompting and Few-Shot Fine-Tuning: Revisiting Document Image Classification Using Large Language Models"

Generated below podcast on this paper with Google's Illuminate.

Rohan Paul

Jan 07, 2025

This paper explores zero-shot prompting and few-shot fine-tuning for document image classification using LLMs, aiming to reduce the need for large annotated datasets.

-----

https://arxiv.org/abs/2412.13859

Original Problem 🤔:

→ Document image classification traditionally requires large annotated datasets, which are costly and time-consuming to create.

→ Existing methods struggle with adapting to new document types or datasets without extensive retraining.

-----

Solution in this Paper 💡:

→ The researchers investigate zero-shot prompting and few-shot fine-tuning for document classification using LLMs.

→ They evaluate various models, including text-based LLMs, embedding models, image-based models, and multi-modal LLMs.

→ The study uses the RVL-CDIP dataset, containing 16 document classes, and explores different training scenarios: zero-shot, one-shot, and few-shot (160, 800, and 1600 samples).

→ They employ Amazon Textract for high-quality OCR and use Low-Rank Adaptation (LoRA) for fine-tuning LLMs.

-----

Key Insights from this Paper 🔑:

→ Multi-modal models (GPT-4-Vision) outperform text-only models in zero-shot scenarios

→ Generative fine-tuning of LLMs shows better performance with very few samples compared to classifier-based approaches

→ Embedding-based methods perform well with more training samples but are outperformed by fine-tuned LLMs

→ Image-based models (Donut) show consistent performance across different test sets but are generally outperformed by text-based methods in few-shot scenarios

-----

Results 📊:

→ Zero-shot prompting with GPT-4-Vision: 69.9% accuracy on RVL-CDIP-160x5

→ Few-shot fine-tuning of Mistral-7B (generative) with 160 samples: 72.5% accuracy

→ Best overall result: 83.4% accuracy with Mistral-7B (classifier) fine-tuned on 1600 samples

→ These results are promising compared to fully trained models using 320,000 samples (85.0% for BERT)

Rohan's Bytes

"Zero-Shot Prompting and Few-Shot Fine-Tuning: Revisiting Document Image Classification Using Large Language Models"

Discussion about this video