Hierarchical transformer combines LiDAR, video and text to track group activities
📚 https://arxiv.org/abs/2410.21108
🎯 Original Problem:
Group Activity Recognition (GAR) faces challenges in handling complex multi-agent interactions and occlusions in real-world scenarios, especially when relying on single-modality data.
-----
🔧 Solution in this Paper:
• Introduces LiGAR - a LiDAR-guided hierarchical transformer that uses LiDAR as structural backbone
• Implements Multi-Scale LiDAR Transformer (MLT) for creating hierarchical scene representations
• Uses Cross-Modal Guided Attention (CMGA) to align features across modalities
• Employs Adaptive Fusion Module (AFM) for dynamic weighting of modal contributions
• Processes information at 3 scales through the pipeline for granular understanding
-----
💡 Key Insights:
• LiDAR data significantly improves handling of occlusions and spatial arrangements
• Multi-modal fusion performs better than single modality approaches
• Hierarchical processing at multiple scales captures both fine details and broader contexts
• Model maintains high performance even without LiDAR during inference
-----
📊 Results:
• Achieves 10.6% improvement in F1-score on JRDB-PAR dataset
• Shows 5.9% gain in Mean Per Class Accuracy on NBA dataset
• Matches/exceeds supervised approaches while using only weak supervision
• Maintains performance across diverse scenarios from sports to surveillance
• Uses only 14.2M parameters, making it most efficient among competitors
Share this post