0:00
/
0:00
Transcript

"Gaze-LLE: Gaze Target Estimation via Large-Scale Learned Encoders"

Generated below podcast on this paper with Google's Illuminate.

This paper introduces Gaze-LLE, a novel approach for gaze target estimation using frozen pretrained visual encoders and a lightweight decoder.

-----

https://arxiv.org/abs/2412.09586

🤔 Original Problem:

Existing gaze estimation methods use complex multi-branch architectures with separate encoders for head and scene features, requiring careful fusion and large numbers of parameters.

-----

💡 Solution in this Paper:

→ Gaze-LLE leverages a single frozen DINOv2 encoder to extract scene features.

→ It applies a learned head position embedding to condition on a specific person.

→ A small transformer decoder processes the features to predict gaze targets.

→ The model uses only 2.8M learnable parameters, compared to 25-135M in prior work.

→ It eliminates the need for separate head branches or auxiliary depth/pose models.

-----

🔑 Key Insights from this Paper:

→ Pretrained visual encoders like DINOv2 already capture relevant gaze cues

→ Complex multi-branch architectures are unnecessary for gaze estimation

→ Head position can be effectively incorporated via prompting after feature extraction

→ A lightweight decoder is sufficient when leveraging strong pretrained features

-----

📊 Results:

→ State-of-the-art performance on GazeFollow (AUC 0.958) and VideoAttentionTarget (AUC 0.937)

→ Strong cross-dataset generalization without finetuning

→ 50x fewer parameters than prior methods

→ Faster convergence: <1.5 hours on a single GPU to achieve SotA

Discussion about this video

User's avatar