"LLMs can see and hear without any training"

Playback speed

Share post at current time

0:00

Transcript

"LLMs can see and hear without any training"

Below podcast on this paper is generated with Google's Illuminate.

Rohan Paul

Feb 10, 2025

The paper addresses the challenge of enabling LLMs to process and generate multimodal content (images, video, audio) without specific training for those modalities. It introduces MILS, a method that leverages LLMs test-time reasoning capabilities to achieve this.

MILS uses an LLM as a "GENERATOR" to create text outputs, and a multimodal model (like CLIP) as a "SCORER" to evaluate the outputs, iteratively refining the output via feedback.

-----

https://arxiv.org/abs/2501.18096

📌 MILS cleverly bypasses the need for paired multimodal training data. It repurposes existing, powerful LLMs and pre-trained vision-language models. It achieves this for diverse tasks.

📌 The iterative refinement process is key. MILS is not just a one-shot generation. The feedback loop between the LLM GENERATOR and the multimodal SCORER progressively improves output quality.

📌 MILS' ability to invert modalities to text is a powerful capability. This goes beyond simple combination of embeddings. It opens possibilities for cross-modal editing and creation.

----------

Methods Explored in this Paper 🔧:

→ MILS is a training-free, iterative optimization approach.

→ It has two core components: a GENERATOR (typically an LLM) and a SCORER (a pre-trained multimodal model like CLIP or SigLIP).

→ The GENERATOR produces candidate text outputs (e.g., captions, prompts). The SCORER evaluates these outputs against a test sample (image, video, audio) providing a similarity score.

→ The scores from SCORER are given as text feedback to the GENERATOR. Then GENERATOR uses this to generate improved outputs in the next iteration. This process repeats until convergence.

-----

Key Insights 💡:

→ MILS shows emergent zero-shot capabilities across different tasks (captioning, generation, editing) and modalities (image, video, audio). LLMs do not require explicit multimodal training.

→ MILS can improve text-to-image generation. It uses LLM as "prompt rewriter".

→ Because MILS operates without gradient updates, It can reverse multimodal inputs into text form, enabling unusual tasks like "cross-modal arithmetic".

-----

Results 📊:

→ Image Captioning on MSCOCO: Achieves 8.0 BLEU, 33.3 CIDEr, 15.0 METEOR, and 9.6 SPICE.

→ Video Captioning on MSR-VTT: Achieves 2.3 CIDEr and 14.4 METEOR, beats prior work on METEOR.

→ Audio Captioning on Clotho: Achieves 2.7 BLEU, 23.1 ROUGE, 12.4 METEOR, and 7.6 SPICE.

Rohan's Bytes

"LLMs can see and hear without any training"

Discussion about this video