"Learning Priors of Human Motion With Vision Transformers"

Below podcast on this paper is generated with Google's Illuminate.

Rohan Paul

Feb 11, 2025

Article voiceover

0:00

-5:16

https://arxiv.org/abs/2501.18543

Predicting human movement patterns is key for robot navigation and understanding human behavior in urban spaces. However, current methods may not fully capture spatial context for accurate predictions.

This paper proposes a novel neural network architecture using Vision Transformers to better predict human motion patterns from semantic maps.

-----

📌 Vision Transformer architecture in semapp2 excels at capturing global context from semantic maps. This enables superior prediction of human motion priors by considering spatial relations unlike local Convolutional Neural Networks.

📌 Masked Autoencoder approach in MAE-semapp2 showcases self-supervised learning efficacy. Masking forces robust feature extraction from semantic maps, enhancing generalization for motion prior prediction.

📌 Expanding semantic classes to 13 in semapp2 refines environmental understanding. This directly translates to improved prediction accuracy, evidenced by lower Kullback-Leibler divergence and Earth Mover's Distance metrics.

----------

Methods Explored in this Paper 🔧:

→ The paper introduces Semantic Map-Aware Pedestrian Prediction 2 (semapp2).

→ semapp2 is a Vision Transformer based autoencoder.

→ It takes semantic maps as input. Semantic maps have multiple channels for different classes like pedestrian area, road, building etc.

→ The model predicts human occupancy distribution, velocity, and stop priors.

→ A Masked Autoencoder (MAE) variation of semapp2 is also explored with 75 percent masking ratio.

→ The model uses Vision Transformer encoder and decoder blocks.

→ The model is trained using Mean Squared Error loss.

-----

Key Insights 💡:

→ Vision Transformers effectively learn spatial relationships from semantic maps.

→ This spatial understanding improves human motion prior prediction compared to CNN based methods.

→ Masked Autoencoder based semapp2 shows strong generalization ability.

→ Increasing semantic classes from 9 to 13 refines semantic understanding and improves prediction accuracy.

-----

Results 📊:

→ semapp2 achieves average KL-divergence of 0.46±0.16, reverse KL-divergence of 2.19±1.50 and Earth Mover's Distance of 27.65±19.89 on the Stanford Drone Dataset.

→ MAE-semapp2 achieves average KL-divergence of 0.34±0.21, reverse KL-divergence of 2.19±1.84 and Earth Mover's Distance of 45.77±30.74 on the Stanford Drone Dataset.

→ ViT-Huge backbone gives best performance with KL-divergence 0.31±0.15, reverse KL-divergence 1.69±1.11, and Earth Mover's Distance 39.64±30.16.

Rohan's Bytes

Discussion about this post