ChatGPT Now Understands Real-Time Video with Advanced Voice Mode"

ChatGPT video analysis, Meta's reasoning paper, Google's Trillium TPUs, and Cerebras CePO framework advancing LLMs and many more..

Dec 12, 2024

Reading Time: 7 minutes and 12 seconds

⚡In today’s Edition (12-Dec-2024):

🎥 ChatGPT adds real-time video analysis to voice mode for visual understanding
🏆 Meta realeased a brilliant paper having the potential to significantly boost LLM's reasoning power.
📡Google Cloud launched Trillium TPUs, with 4x faster AI training and 3x better inference throughput, powering Gemini 2.0
🛠️ Cerebras introduces CePO – a test time reasoning framework for Llama, and Llama3.3-70B + CePO outperforms Llama 3.1 405B and approaches GPT-4 & Sonnet 3.5

=================================

🗞️ Byte-Size Brief:

WebDev Arena launches open-source AI code testing platform with community rankings
Penn State unveils VisOnlyQA dataset showing 41% AI-human accuracy gap

=========================================

🤖 Top HuggingFace Models of the week

🤗 New linear models: QRWKV6-32B (RWKV6 based on Qwen2.5-32B)
🤗 xLSTM-7B
🤗 OpenGVLab/InternVL2_5-78B
🤗 TRELLIS
🤗 TRELLIS HunyuanVideo

Connect with me on X (Twitter)

🎥 ChatGPT adds real-time video analysis to voice mode for visual understanding

🎯 The Brief

OpenAI releases real-time video capabilities in Advanced Voice Mode for ChatGPT, enabling visual analysis and screen sharing for Plus, Team, and Pro subscribers.

⚙️ The Details

→ The new feature allows users to point phones at objects for real-time visual analysis through ChatGPT. Screen sharing functionality enables ChatGPT to interpret device settings and assist with problems like math equations.

→ Access is restricted to to ChatGPT Plus, Team, or Pro for now. Enterprise and Edu subscribers (to ChatGPT Plus, Team, or Pro for now ) must wait until January. The feature is not available in EU, Switzerland, Iceland, Norway, and Liechtenstein.

→ Implementation requires tapping the voice icon near chat bar, followed by video icon for camera activation. Screen sharing is accessed through the three-dot menu.

→ Early testing revealed potential limitations, with ChatGPT showing accuracy in anatomical recognition but experiencing hallucinations with geometry problems.

→ In addition to Advance Voice Mode with vision, OpenAI on Thursday launched a festive “Santa Mode,” which adds Santa’s voice as a preset voice in ChatGPT. Users can find it by tapping or clicking the snowflake icon in the ChatGPT app next to the prompt bar.

⚡ The Impact

Competes directly with Google's Project Astra (released yesterday), advancing real-time AI visual analysis capabilities in conversational AI.

🏆 Meta realeased a brilliant paper having the potential to significantly boost LLM's reasoning power.

🎯 The Brief

Meta introduced Coconut (Chain of Continuous Thought), enabling LLMs to reason in continuous neural space without converting to language tokens, achieving 34.1% accuracy on GSM8k math problems versus 30% baseline.

⚙️ The Details

→ Coconut bypasses traditional language token generation, allowing LLMs to maintain thoughts in raw neural form between reasoning steps using special <bot> and <eot> tokens. The model preserves multiple reasoning paths simultaneously with associated probabilities.

So basically the paper propose not to force AI model to explain in English when it can think directly in neural patterns.

→Imagine if your brain could skip words and share thoughts directly - that's what this paper achieves for AI. By skipping the word-generation step, LLMs can explore multiple reasoning paths simultaneously.

→ Training uses progressive curriculum stages, starting with language-based Chain-of-Thought and gradually replacing steps with continuous thoughts. Performance shows 97% accuracy on logical reasoning tasks (ProsQA).

→ The continuous space enables parallel path exploration, similar to breadth-first search. Model can probe and interpret continuous thoughts, showing probabilities for different reasoning paths while maintaining ability to generate explanations when needed.

⚡ The Impact

Direct neural reasoning enhances LLM performance while preserving explainability and reducing computational overhead in complex tasks.

📡Google Cloud launched Trillium TPUs, with 4x faster AI training and 3x better inference throughput, powering Gemini 2.0

🎯 The Brief

Google Cloud announces Trillium, their 6th-gen TPU, delivering 4x faster training and 3x higher inference throughput than previous generation TPUs, enabling massive-scale AI model training and inference.

⚙️ The Details

→ Trillium achieves 99% scaling efficiency with 12 pods (3072 chips) and 94% efficiency across 24 pods (6144 chips) for training gpt3-175b models. The system maintains exceptional performance through a 13 Petabits/sec bisectional bandwidth network.

→ For inference workloads, Trillium demonstrates 3.1x higher throughput for offline and 2.9x higher throughput for server inference on Stable Diffusion XL. Cost efficiency shows 27% reduction for offline and 22% reduction for server inference compared to TPU v5e.

→ The third-generation SparseCore technology delivers 2x improvement in embedding-intensive model performance and 5x improvement in DLRM DCNv2 performance, optimizing dynamic operations like scatter-gather and sparse segment operations.

⚡ The Impact

Trillium enables faster, more cost-effective AI model training and inference at unprecedented scale.

Connect with me on X (Twitter)

🛠️ Cerebras introduces CePO – a test time reasoning framework for Llama, and Llama3.3-70B + CePO outperforms Llama 3.1 405B and approaches GPT-4 & Sonnet 3.5

🎯 The Brief

Cerebras introduced CePO framework that enables Llama 3.3-70B to outperform Llama-405B on complex reasoning tasks with 100 tokens/second performance.

⚙️ The Details

→ CePO framework leverages three key insights: step-by-step reasoning, comparison-based verification, and structured output formats. The system employs a 4-stage pipeline for enhanced problem-solving.

🎯 Architecture

→ CePO breaks down complex problems into simple steps that the model can execute with high confidence

→ Instead of self-verification, it uses solution comparison to identify inconsistencies between multiple generated answers

→ The pipeline has four stages: plan generation, plan execution, result verification, and best response selection

Performance metrics

→ Performance metrics show significant gains: 83.9% on MMLU-Pro Math, 53.5% on GPQA, and 80.1% on CRUX. CePO surpasses traditional prompting methods like CoT with Reflection and Self-Consistency.

→ While consuming 10-20x more inference tokens than one-shot approaches, CePO maintains interactive speeds on Cerebras hardware, matching performance with GPT-4 Turbo and Claude 3.5 Sonnet.

⚡ The Impact

Test-time computation techniques boost LLM reasoning abilities without model retraining, democratizing advanced AI capabilities.

🗞️ Byte-Size Brief

WebDev Arena creates an interactive platform where developers can compare different AI models' coding abilities, contribute to performance rankings, and access the entire codebase freely.
Penn State researchers built VisOnlyQA to test if AI models can accurately read geometric shapes and numbers in scientific figures. The VisOnlyQA dataset tests AI's ability to perceive basic visual information in scientific figures, revealing significant gaps between human accuracy (95%) and AI performance (54%) on simple geometric tasks.

🤖 Top HuggingFace Models of the week

🤗 New linear models: QRWKV6-32B (RWKV6 based on Qwen2.5-32B)

🚀 Recursal AI converted Qwen 32B Instruct model into QRWKV6 architecture, replacing transformer attention with RWKV-V6 attention through a novel conversion process.

The model matches original 32B performance while delivering 1000x compute efficiency in inference.

→ Training completed in 8 hours using 16 AMD MI300X GPUs (192GB VRAM each). Currently working on Q-RWKV-6 72B, RWKV-7 32B and LLaMA-RWKV-7 70B variants.
→ The Linear attention mechanism proves highly efficient at scale, especially for long context processing

🔍 Key Highlights:

→ The conversion process enables transforming any QKV Attention model into RWKV variant without full retraining, significantly reducing compute costs
→ Model inherits language limitations from parent Qwen model, supporting ~30 languages versus RWKV's typical 100+ languages
→ Current context length limited to 16k due to compute constraints, though model shows stability beyond this window
→ Retains parent model's feedforward network architecture, creating incompatibility with existing RWKV inference code

🤗 xLSTM-7B

This xLSTM-7B was pre-trained on the DCLM and selected high-quality data for in a total of approx. 2.3 T tokens using the xlstm-jax framework.

The xLSTM paper released earlier this year, presented improvements to Long Short-Term Memory (LSTM) networks to make them competitive with modern Transformer architectures.

It extended LSTM with exponential gating and novel memory structures to enable better revision of storage decisions and increased memory capacity.

To use NX-AI/xLSTM-7b, first, install xlstm, which now uses the mlstm_kernels package for triton kernels. Then follow the model page guide.

pip install xlstm
pip install mlstm_kernels

🤗 OpenGVLab/InternVL2_5-78B

→ The first open-source multimodal large language model (MLLM) to achieve >70% on MMMU, with 70.1% validation and 61.8% test scores, rivaling GPT-4o and Claude 3.5.

→ Uses a progressive scaling strategy that trains vision encoder first with smaller LLMs (20B) before scaling to larger ones (72B), reducing compute costs while maintaining performance.

→ Implements random JPEG compression (quality 75-100) and square averaging loss to handle real-world image degradation and balance gradient biases.

→ Features sizes from 1B to 78B parameters with the largest model using a 6B vision encoder and 72B LLM. Vision encoder trained on diverse data including multilingual OCR and mathematical charts.

→ Strong performance across benchmarks: 72.3% on MathVista, 95.1% on DocVQA, and 88.3% on MMBench-EN, demonstrating robust capabilities in specialized domains.

🤗 TRELLIS

TRELLIS is a breakthrough 3D generative model that enables high-quality 3D asset creation from text or images. It uses a novel Structured LATent (SLAT) representation that combines sparse 3D grid structures with dense multiview visual features to capture both geometry and appearance.

Key aspects: → Built using rectified flow transformers with 2B parameters trained on 500K diverse 3D objects

→ Supports multiple output formats including Radiance Fields, 3D Gaussians, and meshes through flexible decoding from the unified SLAT representation

→ Features powerful local editing capabilities allowing targeted modifications to specific regions of 3D assets

→ Demonstrates superior visual quality compared to existing methods at similar scale

→ Employs a two-stage pipeline: first generates sparse structure, then generates latent vectors for non-empty cells

🤗 TRELLIS HunyuanVideo

It helps you generate high-quality videos from text or images.

→ HunyuanVideo features a 13B parameter architecture making it the largest open-source video generation model, with performance matching or exceeding closed-source alternatives.

→ The model uses a unique dual-stream to single-stream Transformer design - processing video and text tokens separately before fusing them for multimodal integration.

→ It leverages a decoder-only MLLM as text encoder instead of CLIP/T5, enhancing instruction following and visual-semantic alignment through zero-shot learning capabilities.

→ The architecture employs a 3D VAE with CausalConv3D achieving compression ratios of 4x temporal, 8x spatial and 16x channel dimensions for efficient training.

→ In human evaluations against models like Runway Gen-3 and Luma 1.6, HunyuanVideo achieved superior scores: 61.8% text alignment, 66.5% motion quality, 95.7% visual quality.

Rohan's Bytes

Discussion about this post

Rohan's Bytes

ChatGPT Now Understands Real-Time Video with Advanced Voice Mode"

ChatGPT video analysis, Meta's reasoning paper, Google's Trillium TPUs, and Cerebras CePO framework advancing LLMs and many more..

⚡In today’s Edition (12-Dec-2024):

🗞️ Byte-Size Brief:

🤖 Top HuggingFace Models of the week

🎥 ChatGPT adds real-time video analysis to voice mode for visual understanding

🎯 The Brief

⚙️ The Details

⚡ The Impact

🏆 Meta realeased a brilliant paper having the potential to significantly boost LLM's reasoning power.

🎯 The Brief

⚙️ The Details

⚡ The Impact

📡Google Cloud launched Trillium TPUs, with 4x faster AI training and 3x better inference throughput, powering Gemini 2.0

🎯 The Brief

⚙️ The Details

⚡ The Impact

🛠️ Cerebras introduces CePO – a test time reasoning framework for Llama, and Llama3.3-70B + CePO outperforms Llama 3.1 405B and approaches GPT-4 & Sonnet 3.5

🎯 The Brief

⚙️ The Details

⚡ The Impact

🗞️ Byte-Size Brief

🤖 Top HuggingFace Models of the week

🤗 New linear models: QRWKV6-32B (RWKV6 based on Qwen2.5-32B)

🤗 xLSTM-7B

🤗 OpenGVLab/InternVL2_5-78B

🤗 TRELLIS

🤗 TRELLISHunyuanVideo

Discussion about this post

🤗 TRELLIS HunyuanVideo