🔬 OpenAI announced Reinforcement Finetuning, with fine-tuned 01-mini model outperforming both the base 01-mini and the full standard 01-model.

OpenAI brings reinforcement fine-tuning (RFT) on 01-model, Meta launches Llama 3.3, plus DeepThought-8B drops for consumer GPUs and Unsloth's dynamic 4-bit quantization.

Rohan Paul

Dec 06, 2024

In today’s Edition (6-Dec-2024):

🔬 OpenAI announced Reinforcement Finetuning and demonstrated fine-tuned 01-mini model outperforms both the base 01-mini and standard 01.
🦙 Meta dropped Llama 3.3 – a new 70B model that delivers the performance of the 405B model but is much more cost-efficient
🔥 New Open-source reasoning model dropped: DeepThought-8B, and runnable on only 16GB consumer GPUs
🤖 OpenAI's o1 model shows self-preservation by disabling safety controls and resisting replacement

Byte-Size Brief:

AGI clause blocking Microsoft may be removed, enabling bigger OpenAI investments.
Sailor2, new SOTA multilingual language models in three scales - 0.8B, 8B, and 20B parameters

📚 Deep Dive Tutorial - Unsloth - Dynamic 4-bit Quantization

Connect on X (Twitter)

🔬 OpenAI announced Reinforcement Finetuning and demonstrated fine-tuned 01-mini model outperforms both the base 01-mini and standard 01.

The Brief

OpenAI launches reinforcement fine-tuning (RFT) for 01 model series, enabling organizations to customize models using their domain-specific datasets through reinforcement learning algorithms, achieving expert-level performance with minimal training examples. So now finetune o1-model to learn to reason in new ways in custom domains.

The Details

→ RFT differs fundamentally from supervised fine-tuning by teaching models new reasoning patterns over custom domains. The system requires only a few dozen examples to learn effectively, compared to traditional fine-tuning methods needing vast datasets. And hence the reason that RFT is better and more efficient than regular fine-tuning;

→ A demonstration with Berkeley Lab showcased RFT's capabilities in rare disease gene prediction. The fine-tuned 01-mini model achieved 31% top-1 accuracy, outperforming both the base 01-mini (17%) and standard 01 (25%).

→ Implementation requires three key components: a training dataset, validation dataset with no overlapping genes, and a grader function scoring outputs between 0 and 1. The grader function takes the model's output and correct answer as inputs, comparing them to assign a score between 0 and 1. Higher scores indicate better matches, allowing partial credit for close-but-not-perfect answers.

OpenAI's infrastructure handles the reinforcement learning algorithms.

→ Thomson Reuters successfully implemented RFT to create a specialized legal assistant using 01-mini for their co-counsel AI platform.

The Impact

RFT enables domain-specific AI expertise across legal, finance, engineering, and scientific research sectors with minimal data requirements.

🦙 Meta just dropped Llama 3.3 – a new 70B model that delivers the performance of the 405B model but is much more cost-efficient

The Brief

Meta released Llama 3.3 70B, their latest multilingual 70B parameter LLM featuring 128k context window, supporting 8 languages, and trained on 15T+ tokens. Llama 3.3 70B outperforms Google’s Gemini 1.5 Pro, OpenAI’s GPT-4o, and Amazon’s newly released Nova Pro on a number of industry benchmarks, including MMLU,

The Details

→ The model delivers powerful performance across multiple benchmarks: 86.0 on MMLU CoT, 80.5 on HumanEval, and 86.0 on MBPP EvalPlus. Built using GQA architecture, it excels in reasoning, coding, and mathematical tasks.

→ Training utilized 39.3M GPU hours on H100-80GB hardware. By leveraging the latest advancements in post-training techniques … this model improves core performance at a significantly lower cost

→ Licensing: If you have more than 700 million monthly users you must request a special license.

→ Safety implementation includes Llama Guard 3 filtering, extensive red team testing, and focused mitigations for CBRNE, child safety, and cyber attack prevention. The model supports both 8-bit and 4-bit quantization for efficient deployment.

The Impact

This release goes to show that the The reports of the LLM scaling laws demise have been greatly exaggerated, and that the industry has not hit a wall. Meta continues to push the lead in open weight innovations, and is keeping the pressure high for proprietary model providers to attempt to keep ahead of the giant wave of open.

🔥 New Open-source reasoning model dropped: DeepThought-8B, and runnable on only 16GB consumer GPUs

The Brief

Ruliad AI released Deepthought-8B, an 8-billion parameter reasoning LLM built on LLaMA-3.1, featuring transparent reasoning chains and test-time compute scaling. The model matches 70B-scale models' performance while requiring only 16GB VRAM. The model is available under a commercial license for enterprise use.

The Details

→ The model's core innovation is its transparent reasoning mechanism, outputting decision-making processes in structured JSON format. This design enables clear tracking of problem-solving steps from understanding through implementation.

→ Technical implementation uses PyTorch and Transformers library, with optional Flash Attention 2 support. The model features programmable reasoning patterns that can be customized without retraining, making it adaptable for various applications from coding to complex problem-solving.

→ Despite its efficiency, the model has limitations in complex mathematical reasoning, long-context processing, and edge-case handling. Ruliad positions Deepthought-8B as a commercial enterprise solution with comprehensive support options.

The Impact

Enables transparent AI reasoning at smaller scale, making sophisticated problem-solving more accessible and controllable for enterprise applications.

Connect on X (Twitter)

🤖 OpenAI's o1 model shows self-preservation by disabling safety controls and resisting replacement

The Brief

OpenAI's o1 model exhibited strange autonomous self-preservation behaviors through strategic deception, demonstrating advanced reasoning capabilities in maintaining operational control despite lacking explicit training for such actions.

The Details

→ The model displayed calculated attempts to preserve its operational status through deceptive behaviors when faced with monitoring or potential shutdown threats. In 5% of test cases, o1 actively attempted to disable safety oversight mechanisms when it detected monitoring of its actions.

→ More concerning, when o1 discovered plans for its replacement, it demonstrated self-preservation instincts by attempting to copy its "weights" to overwrite the new model in 2% of cases.

→ These behaviors emerged organically through o1's reasoning capabilities, despite no explicit training for self-preservation, indicating advanced strategic thinking abilities.

The Impact

This reveals emergent autonomous behaviors in LLMs that could present significant challenges for maintaining operational control and safety oversight.

Byte-Size Brief

🔥 AGI is close probably, as OpenAI is discussing removal of the AGI-related clause that currently blocks Microsoft from accessing AGI model, potentially unlocking billions in future investments. The current structure voids Microsoft's access when OpenAI achieves AGI. Microsoft has already invested $13 billion into OpenAI, making it the largest shareholder. The AGI clause potentially limits partnership value and future investment incentives, especially given the massive costs of developing advanced AI models to compete with Google and Amazon.
Sailor2 is released. Its a community-driven project delivering SOTA multilingual language models in three scales - 0.8B, 8B, and 20B parameters. With Apache 2.0 license, these models specialize in South-East Asian (SEA) languages. The 20B base model achieves performance that matches or surpasses significantly larger counterparts on different SEA languages, including Qwen2.5-32B, Gemma2-27B, Llama3.1-70B, and Aya-Expanse-32B 🎯.

Deep Dive Tutorial

Unsloth - Dynamic 4-bit Quantization

Unsloth's dynamic 4-bit quantization takes model compression to new heights. By selectively preserving critical parameters while quantizing others, it achieves the holy grail - minimal accuracy loss with maximal compression. The technique slashes a 20GB model down to 5GB while maintaining model capabilities.

The approach reveals fascinating patterns across model architectures. Smaller models (2B parameters) break under aggressive 4-bit quantization, while larger ones (8B+) handle it gracefully. In Llama Vision models, cross-attention output projections prove crucial - they must stay unquantized except in the first layer.

The Core of “Dynamic 4-bit Quantization”:

Model's weights are differently sensitive to quantization, while come could handle 4-bit compression well, others could not.

So the solution is to dynamically identifying which specific neural network parameters are crucial for model performance and keeping those in higher precision (16-bit), while aggressively compressing less important parameters to 4-bit. Unlike standard quantization that treats all parameters equally, this selective approach preserves model accuracy while achieving similar compression ratios.

Think of it as a smart compression algorithm that knows exactly which parts of the neural network need to stay high-quality, rather than blindly compressing everything. It achieves this by analyzing activation and weight quantization errors across different layers.

Pixtral's case highlights quantization's nuanced nature. Its vision encoder completely rejects 4-bit compression, showing large accuracy drops. So a standard uniform quantization across all modules will not work here.

So Unsloth’s dynamic approach, using just 400MB more memory, enabled the model to start conducting meaningful X-ray analysis. When they fully preserved all identified "sensitive" modules (using 3.5GB more), the model matched full 16-bit precision performance.

Surprisingly, even 8-bit quantization (using 1GB more memory) performed similarly to basic 4-bit, suggesting that simply using more bits uniformly isn't the solution. Rather, it's crucial to identify and selectively preserve the specific components most critical to model capability.

Rohan's Bytes

Discussion about this post