Simulation-trained small VLMs (Vision-Language Models ) outperform giant models in understanding physical world
Vision-Language Models (VLMs) struggle with physical reasoning tasks. This paper introduces two methods using simulated data to enhance VLMs' ability to understand and predict physical interactions.
https://arxiv.org/abs/2412.08619
Original Problem 🤔:
→ Current VLMs fail to interpret basic spatial relationships and physical interactions
→ They struggle with understanding object attributes and predicting physical events
→ Existing training datasets lack specific annotations of physical events and causal relationships
-----
Solution in this Paper 🔧:
→ Fine-tuning VLMs with automatically generated physics question-answer pairs from simulations
→ Creating Physics Context Builders (PCBs) - specialized VLMs that generate detailed physical scene descriptions
→ Implementing a multi-agent framework where PCBs provide physics context to assist LLMs
→ Using simulation data to train models without human intervention
-----
Key Insights 💡:
→ Simple QA fine-tuning with simulation data outperforms larger models
→ PCBs can enhance LLM reasoning without extensive retraining
→ Simulation-trained models successfully transfer to real-world scenarios
→ Multi-agent framework enables modular integration with commercial LLMs
-----
Results 📊:
→ Fine-tuned PaLiGemma-3B achieved:
- 92.9% descriptive accuracy
- 94.7% explanatory accuracy
- 83.6% predictive accuracy
- 68.4% counterfactual accuracy
→ PCB integration improved baseline model performance by up to 30% in stability prediction tasks
Share this post