This paper explores zero-shot prompting and few-shot fine-tuning for document image classification using LLMs, aiming to reduce the need for large annotated datasets.
-----
https://arxiv.org/abs/2412.13859
Original Problem 🤔:
→ Document image classification traditionally requires large annotated datasets, which are costly and time-consuming to create.
→ Existing methods struggle with adapting to new document types or datasets without extensive retraining.
-----
Solution in this Paper 💡:
→ The researchers investigate zero-shot prompting and few-shot fine-tuning for document classification using LLMs.
→ They evaluate various models, including text-based LLMs, embedding models, image-based models, and multi-modal LLMs.
→ The study uses the RVL-CDIP dataset, containing 16 document classes, and explores different training scenarios: zero-shot, one-shot, and few-shot (160, 800, and 1600 samples).
→ They employ Amazon Textract for high-quality OCR and use Low-Rank Adaptation (LoRA) for fine-tuning LLMs.
-----
Key Insights from this Paper 🔑:
→ Multi-modal models (GPT-4-Vision) outperform text-only models in zero-shot scenarios
→ Generative fine-tuning of LLMs shows better performance with very few samples compared to classifier-based approaches
→ Embedding-based methods perform well with more training samples but are outperformed by fine-tuned LLMs
→ Image-based models (Donut) show consistent performance across different test sets but are generally outperformed by text-based methods in few-shot scenarios
-----
Results 📊:
→ Zero-shot prompting with GPT-4-Vision: 69.9% accuracy on RVL-CDIP-160x5
→ Few-shot fine-tuning of Mistral-7B (generative) with 160 samples: 72.5% accuracy
→ Best overall result: 83.4% accuracy with Mistral-7B (classifier) fine-tuned on 1600 samples
→ These results are promising compared to fully trained models using 320,000 samples (85.0% for BERT)
Share this post