This paper introduces Gaze-LLE, a novel approach for gaze target estimation using frozen pretrained visual encoders and a lightweight decoder.
-----
https://arxiv.org/abs/2412.09586
🤔 Original Problem:
Existing gaze estimation methods use complex multi-branch architectures with separate encoders for head and scene features, requiring careful fusion and large numbers of parameters.
-----
💡 Solution in this Paper:
→ Gaze-LLE leverages a single frozen DINOv2 encoder to extract scene features.
→ It applies a learned head position embedding to condition on a specific person.
→ A small transformer decoder processes the features to predict gaze targets.
→ The model uses only 2.8M learnable parameters, compared to 25-135M in prior work.
→ It eliminates the need for separate head branches or auxiliary depth/pose models.
-----
🔑 Key Insights from this Paper:
→ Pretrained visual encoders like DINOv2 already capture relevant gaze cues
→ Complex multi-branch architectures are unnecessary for gaze estimation
→ Head position can be effectively incorporated via prompting after feature extraction
→ A lightweight decoder is sufficient when leveraging strong pretrained features
-----
📊 Results:
→ State-of-the-art performance on GazeFollow (AUC 0.958) and VideoAttentionTarget (AUC 0.937)
→ Strong cross-dataset generalization without finetuning
→ 50x fewer parameters than prior methods
→ Faster convergence: <1.5 hours on a single GPU to achieve SotA