AVAgent, proposed in this paper, uses a multi-modal LLM to convert audio and visual data into language descriptions separately (i.e., tool
use)
It's an LLM-based assistant that implements an agentic workflow to align audio signals with visual content.
📚 https://arxiv.org/abs/2410.23230
Original Problem 🔍:
Audio-visual (AV) data pairs often suffer from background noise interference and non-synchronization, limiting joint representation quality and downstream performance.
-----
Solution in this Paper 🛠️:
• Introduces AVAgent, an LLM-based assistant implementing an agentic workflow
• Uses multi-modal LLM to convert AV data into language descriptions
• Plans audio editing actions using predefined filters and augmentations
• Employs Vision-Language Model for evaluation and feedback
• Implements 8 audio editing actions (4 for noise filtering, 4 for coordination)
• Utilizes Vicuna-v1.5-7b as base model with LoRA tuning
-----
Key Insights from this Paper 💡:
• Data-centric approach improves AV alignment before representation learning
• Cyclic workflow of tool use, planning, and reflection enhances AV synchronization
• Separate audio-visual processing enables better discrepancy identification
• Predefined actions target both noise reduction and temporal alignment
-----
Results 📊:
• Outperformed previous methods by 2-4% on VGGSound and AudioSet classification
• Achieved higher precision, AP, and F1 scores on Flickr-SoundNet localization
• Improved mIoU and F1 scores on AVSBench segmentation
• Better SDR and SAR metrics on MUSIC and VGGSound separation tasks
Share this post