3D-LLaVA processes point clouds directly to understand and interact with 3D scenes, no complex pipelines needed
3D-LLaVA introduces a streamlined architecture that processes point clouds directly, using a novel Omni Superpoint Transformer to handle multiple 3D vision-language tasks without complex pipelines.
https://arxiv.org/abs/2501.01163
🌍 Original Problem:
→ Current 3D LMMs rely on complex pipelines with offline feature extraction and task-specific modules, making them cumbersome to deploy and limiting their accessibility
→ Existing solutions struggle with fine-grained scene understanding and flexible human-agent interaction
🔧 Solution in this Paper:
→ 3D-LLaVA uses a minimalist design that takes only point clouds as input
→ The core innovation is the Omni Superpoint Transformer (OST) which serves three functions: visual feature selection, visual prompt encoding, and mask decoding
→ OST selectively retains visual tokens by distinguishing between foreground and background superpoints
→ The system uses distance-adaptive self-attention to guide superpoint queries towards relevant entities
→ A two-stage training process combines instance segmentation supervision with 2D-to-3D knowledge distillation
💡 Key Insights:
→ Complex preprocessing pipelines can be eliminated while maintaining high performance
→ A single unified architecture can handle multiple 3D vision-language tasks effectively
→ Direct point cloud processing is feasible without offline feature extraction
📊 Results:
→ Achieved 92.6% CiDER on ScanQA dataset, improving previous best by 4.9%
→ Reached 43.3% mIoU on ScanRefer and 42.7% mIoU on Multi3DRefer
→ Outperformed existing methods across multiple benchmarks
Share this post