0:00
/
0:00
Transcript

"LiGAR: LiDAR-Guided Hierarchical Transformer for Multi-Modal Group Activity Recognition"

The podcast on this paper is generated with Google's Illuminate.

Hierarchical transformer combines LiDAR, video and text to track group activities

📚 https://arxiv.org/abs/2410.21108

🎯 Original Problem:

Group Activity Recognition (GAR) faces challenges in handling complex multi-agent interactions and occlusions in real-world scenarios, especially when relying on single-modality data.

-----

🔧 Solution in this Paper:

• Introduces LiGAR - a LiDAR-guided hierarchical transformer that uses LiDAR as structural backbone

• Implements Multi-Scale LiDAR Transformer (MLT) for creating hierarchical scene representations

• Uses Cross-Modal Guided Attention (CMGA) to align features across modalities

• Employs Adaptive Fusion Module (AFM) for dynamic weighting of modal contributions

• Processes information at 3 scales through the pipeline for granular understanding

-----

💡 Key Insights:

• LiDAR data significantly improves handling of occlusions and spatial arrangements

• Multi-modal fusion performs better than single modality approaches

• Hierarchical processing at multiple scales captures both fine details and broader contexts

• Model maintains high performance even without LiDAR during inference

-----

📊 Results:

• Achieves 10.6% improvement in F1-score on JRDB-PAR dataset

• Shows 5.9% gain in Mean Per Class Accuracy on NBA dataset

• Matches/exceeds supervised approaches while using only weak supervision

• Maintains performance across diverse scenarios from sports to surveillance

• Uses only 14.2M parameters, making it most efficient among competitors

Discussion about this video