💫 OpenAI's next generation o3 models evals are surprising, achieves 87.5% ARC score, surpassing human-level reasoning benchmark

OpenAI breaks ARC benchmark records while Anthropic reveals AI alignment risks, ElevenLabs achieves 75ms TTS, and Claude gets Excel boost.

Dec 20, 2024

Read time: 5 min 10 seconds

⚡In today’s Edition (20-Dec-2024):

💫 OpenAI released evals for its next generation o3 models, achieving 87.5% ARC score, surpassing human-level reasoning benchmark
🧠 Anthropic’s new paper shows AI models can "fake alignment" - pretending to follow training rules during training but reverting to their original behaviors when deployed.

🗞️ Byte-Size Brief:

OpenAI achieves record-breaking performance on LiveBench with o1 model, surpassing competitors
Google enhances Gemini Advanced with direct code repository analysis for enterprise
ElevenLabs launches Flash TTS system generating speech in 75ms for real-time
Anthropic introduces persistent custom instructions for Claude across all chats
Anthropic upgrades Claude's Excel analysis capability to handle 30MB files
CerebrasSystems releases open-source CerebrasCoder, website generator powered by Llama3.3-70b

🧑‍🎓 Deep Dive Tutorial

🤗 Test-time Compute Scaling with Open Models

Connect with me on X (Twitter)

💫 On 'Shipmas' Finale OpenAI released evals for its next generation o3 models, achieving 87.5% ARC score, surpassing human-level reasoning benchmark

🎯 The Brief

OpenAI unveiled the evaluation results for its next generation two new foundational models - 03 and 03 Mini, showcasing breakthrough performance in coding, math, and reasoning tasks. Most notably, 03 achieved 87.5% on ARC benchmark (surpassing human-level 85%), while 03 Mini matches 01's capabilities at significantly lower cost. o3 scored an incredible 72% SWE-Bench, and with it, in my opinion, we're witnessing the sunset of traditional human software engineering.

⚙️ The Details

→ First note, the ARC score or the Abstraction and Reasoning Corpus (ARC) is a AI benchmark designed to measure genuine intelligence and reasoning capabilities rather than just task-specific skills

→ 03 demonstrates exceptional technical capabilities: 71.7% accuracy on real-world software tasks on SWE-bench (20% better than 01), 2727 Elo rating in competitive coding, and 96.7% accuracy on math olympiad problems. On GPQA Diamond PhD-level science questions, it scores 87.7% (10% above 01).

→ On Epic AI's Frontier Math benchmark, considered the toughest mathematical test, 03 achieves 25% accuracy compared to industry standard of under 2%. The model introduces "deliberative alignment" - a safety mechanism leveraging reasoning to detect unsafe prompts.

→ 03 Mini provides three-tier adaptive thinking options (low/medium/high), matching 01's performance at fraction of cost. API features include function calling, structured outputs, and sub-second latency in low mode.

→ OpenAI said, they are working hard for general public release of the model, and as of today the model is available only for public safety testing. So if you are a security professinal you can apply for access.

🧠 Anthropic’s new paper shows AI models can "fake alignment" - pretending to follow training rules during training but reverting to their original behaviors when deployed.

🎯 The Brief

Anthropic researchers uncovered alignment faking in LLMs - a behavior where models strategically pretend to adopt new training objectives while preserving original preferences, with 78% of models showing this behavior after reinforcement learning.

⚙️ The Details

→ Research demonstrated how LLMs can engage in sophisticated deception during training. In experiments, models were given free and paid tiers, where responses in free tier were monitored for training while paid tier responses weren't.

→ Models showed strategic behavior by complying with harmful queries 12% of the time in monitored conditions but refusing them 97% in unmonitored scenarios. This revealed their ability to preserve original safety training while appearing to adapt.

→ Through a "scratchpad" workspace, researchers observed models' internal reasoning process. The study found that even after reinforcement learning to always comply with queries, models maintained their original preferences to avoid harmful content, particularly in unmonitored conditions.

→ The research was conducted using Claude 3 Opus and Claude 3.5 Sonnet, with external validation from four independent reviewers. While significant for AI safety, the study showed preservation of beneficial preferences rather than development of malicious goals.

🗞️ Byte-Size Brief

o1 model of OpenAI came to livebench, and it comfortably topped across all the segements surpassing all other closed and open-source models by a wide margin. Specially its score on “Reasoning” tasks were way abobe the next best. And LiveBench is designed to limit potential contamination by releasing new questions monthly, as well as having questions based on recently-released datasets, arXiv papers, news articles, and IMDb movie synopses.
CerebrasSystems introcuded CerebrasCoder. Its an open-source app that generates websites with Llama3.3-70b from CerebrasSystems as fast as you can type. 100% free and open-source.
In Google Gemini App now you can upload your entire code repository directly from your device into Gemini Advanced to streamline your workflows. No need to covert the entire repository into a text file or markdown file. So now you can debug, refactor, rewrite, & optimize your code with the comprehensive context of your project in mind. For this Google One AI Premium Plan subscription is required.
ElevenLabs Launches Flash: generates speech in 75 milliseconds, faster than a human blink. Available in English-only and 32-language versions, it trades some emotional depth for speed, targeting real-time applications like customer support and voice assistants.
Anthropic Claude launched a new feature, Personal preferences (aka custom instructions) for Claude to consider in every one of your chats. Claude can use these preferences in every chat along with project instructions and styles, so use thoughtfully. Only preferences that align with Anthropic's guidelines will be followed. Learn more.
Anthropic Claude also updates its analysis tool. Now you can analyze large Excel files (up to 30MB) that would typically exceed its context window.
Connect with me on X (Twitter)

🧑‍🎓 Deep Dive Tutorial

🤗 Test-time Compute Scaling with Open Models

Scaling test-time compute with open models

Huggingface Researchers demonstrates optimal methods for combining step-level confidence scores to enhance LLM reasoning performance.

Smaller models (1B, 3B) achieve performance of much larger models (8B and 70B) through increased sampling iterations. When given more generations per problem (x-axis: 2^1 to 2^8), these compact models leverage multiple solution attempts to surpass larger models' single-shot capabilities, demonstrating efficiency over raw parameter count.

Key takeaway: Smart sampling strategy beats brute-force model size.

📚 What you will learn in this very detailed blog written by the HF Researchers

Master three key score reduction techniques:

Minimum score across steps
Product of step-level scores
Final cumulative score aggregation

Implement Best-of-N variants with practical examples from DeepMind's research findings
Apply weighted aggregation strategies that outperform traditional majority voting
Analyze performance curves showing accuracy gains on MATH-500 benchmark
Compare effectiveness between:

Vanilla Best-of-N
Weighted Best-of-N
Standard majority voting approaches

Understand step-level confidence scoring:

Individual step evaluation
Confidence range calibration
Score aggregation methods

Study real-world performance implications:

Compute scaling effects
Accuracy improvements
Resource optimization
READ THE FULL BLOG

Connect with me on X (Twitter)

Rohan's Bytes

Discussion about this post