Bird's Eye View (BEV) Lean transformer for Instance Prediction slashes self-driving prediction costs while matching accuracy.
https://arxiv.org/abs/2411.06851
Original Problem 🎯:
Current self-driving vehicle systems separate detection, tracking, and prediction stages, leading to accumulated errors. Existing solutions have high processing times and parameter counts, making real-world deployment challenging.
-----
Solution in this Paper 🛠️:
→ Introduces a Bird's Eye View (BEV) instance prediction architecture focusing only on instance segmentation and flow prediction
→ Uses EfficientNet-B4 for multi-camera feature extraction and BEV projection
→ Implements two parallel SegFormer-based branches for segmentation and flow prediction
→ Features a flow warping mechanism to track instances across frames
→ Offers two configurations: full version (13M parameters) and tiny version (7.42M parameters)
-----
Key Insights 💡:
→ Simplified paradigm focusing on just two tasks (segmentation and flow) can match SOTA performance
→ Efficient transformer architecture can significantly reduce parameters while maintaining accuracy
→ Flow warping at individual BEV position level minimizes instance association errors
-----
Results 📊:
→ Achieves 53.7 VPQ (Video Panoptic Quality) at short ranges, outperforming PowerBEV (53.4)
→ Reduces parameters from 39.13M (PowerBEV) to 13.46M (full) and 7.42M (tiny)
→ Decreases latency to 60-63ms compared to PowerBEV (70ms)
Share this post