0:00
/
0:00
Transcript

"Aligning Audio-Visual Joint Representations with an Agentic Workflow"

The podcast on this paper is generated with Google's Illuminate.

AVAgent, proposed in this paper, uses a multi-modal LLM to convert audio and visual data into language descriptions separately (i.e., tool

use)

It's an LLM-based assistant that implements an agentic workflow to align audio signals with visual content.

📚 https://arxiv.org/abs/2410.23230

Original Problem 🔍:

Audio-visual (AV) data pairs often suffer from background noise interference and non-synchronization, limiting joint representation quality and downstream performance.

-----

Solution in this Paper 🛠️:

• Introduces AVAgent, an LLM-based assistant implementing an agentic workflow

• Uses multi-modal LLM to convert AV data into language descriptions

• Plans audio editing actions using predefined filters and augmentations

• Employs Vision-Language Model for evaluation and feedback

• Implements 8 audio editing actions (4 for noise filtering, 4 for coordination)

• Utilizes Vicuna-v1.5-7b as base model with LoRA tuning

-----

Key Insights from this Paper 💡:

• Data-centric approach improves AV alignment before representation learning

• Cyclic workflow of tool use, planning, and reflection enhances AV synchronization

• Separate audio-visual processing enables better discrepancy identification

• Predefined actions target both noise reduction and temporal alignment

-----

Results 📊:

• Outperformed previous methods by 2-4% on VGGSound and AudioSet classification

• Achieved higher precision, AP, and F1 scores on Flickr-SoundNet localization

• Improved mIoU and F1 scores on AVSBench segmentation

• Better SDR and SAR metrics on MUSIC and VGGSound separation tasks

Discussion about this video