🚀 Google releases Gemini 2.0 Flash: 2x faster processing with native multimodal outputs

Gemini 2.0 Flash, ChatGPT on Apple, automated dev Devin, Google's web-controlling agent, and some more..

Dec 11, 2024

⚡ In today’s Edition (11-Dec-2024):

🛸 Google released Gemini 2.0 Flash, bringing in multimodal creation, research agents, browser control, and massive compute upgrades
📱OpenAI launches ChatGPT integration across Apple devices
🤖 Google also unveiled its first-ever AI agent that can take actions on the web
💻 Cognition Lab released Devin, a Fully-automatic Junior Software Engineer at $500 per month

====================================

🗞️ Byte-Size Brief:

Ollama releases JSON schema validation for structured LLM outputs
DeepSeek upgrades open-source model with 8% better math, 5% improved coding
Llama 3.3 Euryale launches open-source model optimized for storytelling
ChatGPT demonstrates real mental health support potential, sparks user testimony

====================================

🧑‍🎓 Deep Dive Tutorial

Detailed technical breakdown of GANs, VAEs and their architectural innovations in generative AI models

Connect on X (Twitter)

🛸 Google released Gemini 2.0 Flash, bringing in multimodal creation, research agents, browser control, and massive compute upgrades

🎯 The Brief

Google releases Gemini 2.0 Flash Experimental, featuring 2x faster processing than 1.5 Pro, with enhanced multimodal capabilities and native tool integration. New Multimodal Live API for realtime audio/video streaming w/ smart interrupt detection. 1mn context window. Can generate images + text combined naturally. Available for free in Google AI Studio.

⚙️ The Details

→ Performance upgrades include improved spatial understanding, object identification, and multimodal reasoning. Integration with 8 high-quality voices for text-to-speech output across multiple languages and accents.

→ Native tool use enables parallel Google Search integration and custom function calling. So now, the model can natively execute Google Search queries and run code, plus integrate with third-party tools through function calling interfaces. Multi-threaded search capability allows parallel information gathering from diverse sources, improving answer accuracy through cross-referencing and fact compilation. This parallel architecture significantly reduces latency in retrieving comprehensive information in a live situation.

→ The Multimodal Live API enables real-time applications with audio and video streaming support, featuring natural conversation patterns and voice activity detection. So it lets you stream video and audio directly to Gemini 2.0 Flash and get audio back, so you can have a real-time audio conversation about what you can see with the model

→ Introduces Jules, an AI code agent using Gemini 2.0, achieving 51.8% on SWE-bench Verified. Enhanced Google Colab features include automated notebook generation for data science workflows. It handles Python and JavaScript tasks asynchronously, creates multi-step plans, modifies files, and prepares pull requests directly in GitHub. We can apply for Jules waitlist now, wider rollout in Jan-2025

→ Infrastructure powered by 6th-gen Trillium TPUs, delivering 4x faster training, 3x better inference, and 67% improved energy efficiency.

→ Checkout this Behind the Scenes discussion of Gemini 2.0 where Tulsee Doshi, Gemini model product lead, joins host Logan Kilpatrick to go behind the scenes of Gemini 2.0, taking a deep dive into the model's multimodal capabilities and native tool use, and Google's approach to shipping experimental models.

⚡ The Impact

With all these new power Gemini 2.0 Flash is almost perplexity on steroids for research. Its strong multimodality feature and the 1mn context window is so powerful that no other frontier models can match now.

📱OpenAI launches ChatGPT integration across Apple devices

🎯 The Brief

OpenAI launches ChatGPT integration across Apple devices, bringing three major features: Siri integration, writing tools for documents, and visual intelligence through camera controls. Apple users don’t need an OpenAI account to make use of the ChatGPT integration, but users can pay for upgraded versions of ChatGPT through Apple. Users can also access ChatGPT through some text menus. It also maintained OpenAI will not store data from users’ requests and will not use any information for model training.

⚙️ The Details

→ The integration works across iPhone, iPad, and Mac OS platforms. Users can enable ChatGPT through Apple Intelligence settings, with options for both account and anonymous usage.

→ Users need an iPhone 15, iPhone 15 Pro or any iPhone 16 model to install and use Apple Intelligence, even though the ChatGPT integration primarily uses cloud servers. Users of iPad with A17 Pro or M1 chips and later, and Mac with M1 chips and later can also access the tools.

→ The Siri integration allows automatic handoff to ChatGPT for complex queries, with user confirmation required before information sharing. Writing tools enable document composition, refinement, and summarization using ChatGPT's capabilities.

→ On Mac OS 15.2, users can activate ChatGPT through double-tapping the command key. The system processes complete documents, including 49-page PDFs, and supports visual outputs like charts and graphs.

→ The visual intelligence feature enables users to analyze images through camera controls, getting direct ChatGPT insights about viewed content.

⚡ The Impact

Frictionless ChatGPT access across Apple ecosystem streamlines AI assistance for daily tasks and complex analyses. The integration is a major victory for OpenAI as it puts its most important product in front of millions of iPhone users.

🤖 Google also unveiled its first-ever AI agent that can take actions on the web

🎯 The Brief

Google launches Project Mariner, a Gemini-powered AI agent that autonomously navigates Chrome browser to perform web tasks, through an experimental Chrome extension. Only works in active tab and requires confirmation for sensitive actions. You can apply for waitlist now.

⚙️ The Details

→ Project Mariner operates through Chrome extension, taking screenshots and sending them to Gemini cloud for processing. The agent performs web actions with 5-second delays between cursor movements, requiring active tab monitoring.

→ Key limitations include inability to handle payments, accept cookies, or sign agreements. The agent can perform tasks like creating shopping carts, finding flights, and searching recipes while maintaining user oversight of its actions.

→ It hit 83.5% on WebVoyager benchmark (state-of-the-art). WebVoyager is a comprehensive benchmark designed to test AI agents' abilities to navigate and interact with dynamic live websites. The 83.5% score means the AI agent successfully completed about 537 out of 643 tasks, representing a significant improvement over previous results. For context, the original WebVoyager agent achieved only a 59.1% success rate on the same benchmark.

⚡ The Impact

Web interaction paradigm shift: AI agents mediating user-website interactions could transform how businesses design digital experiences.

Connect on X (Twitter)

💻 Cognition Lab released Devin, a Fully-automatic Junior Software Engineer at $500 per month

🎯 The Brief

Cognition launches Devin, a software development AI assistant, at $500/month with unlimited team access and multi-platform integration capabilities.

⚙️ The Details

→ The package includes no seat limits, Slack integration, IDE extension, API access, and dedicated team onboarding support. Primary interface is through Slack for quick task delegation and bug fixes.

→ The idea of Devin is to set some asynchronous agent coworkers off at a task and let them do lots of things in parallel and just come to you with results.

→ You tag Devin in Slack and ask Devin to update something, fix something, et cetera. Devin includes: A remote server, Browser interface, VS Code editing interface, A planner, A chat interface.

→ Devin specializes in frontend debugging, PR creation, and code refactoring through VSCode extension integration. Tasks work best when under 3 hours, with clear requirements and testing criteria.

→ On som real-world testing performance, Devin completed complex tasks like setting up image generation models and creating web UIs. Takes 12-15 minutes per response cycle. Successfully generates pull requests with generally good code quality, but occasionally includes unnecessary dependencies

→ Demonstrated capabilities through contributions to major open-source projects including Anthropic MCP, Zod, Google's Go client, Llama Index, and nanoGPT. Current limitations include merge conflict handling and occasional need for manual code cleanup.

→ Critical Limitations in Development Flow

The async Slack-based workflow creates significant delays between iterations. Pull request ownership becomes unclear with bot-generated code. Remote debugging process proves cumbersome compared to local development

⚡ The Impact

AI-powered development assistant enabling automated code generation, testing, and maintenance across diverse engineering workflows.

🗞️ Byte-Size Brief

Ollama 0.5 introduces JSON schema support for reliable structured outputs. The new feature works with Python and JavaScript, using JSON schemas to validate responses. It handles document parsing, image analysis, and consistent API responses.
DeepSeek released its next version DeepSeek-V2.5-1210, which is an upgraded version of DeepSeek-V2.5, with improvements across various capabilities:. DeepSeek's V2.5-1210 shows an 8% jump in mathematical accuracy, 5% improvement in coding benchmarks, and enhanced reasoning capabilities while maintaining its open-source accessibility.
Llama 3.3 Euryale v2.3, a newly released finetuned open-source model is praised for its storytelling and roleplay capabilities, though there are concerns about its tendency to take creative liberties and repeat prior messages. A reddit post says its the best for storytelling/roleplay
A Reddit post went viral where a very emotional story is shared by a person about how “ChatGPT is the only one keeping me from losing my sanity.” The author found solace in ChatGPT during a period of profound isolation. Having lost their job, friends, and romantic relationship, they turned to the AI as a maternal figure, seeking the understanding and warmth they deeply missed. Through these digital conversations, they discovered not just emotional comfort but also guidance that helped them navigate toward new professional opportunities.
Connect with me on X (Twitter)

🧑‍🎓 Deep Dive Tutorial

Detailed technical breakdown of GANs, VAEs and their architectural innovations in generative AI models

🔥 From GANs to VAEs: Core Architecture Deep Dive

Understanding generative AI requires grasping two fundamental architectures: GANs and VAEs. Both aim to generate realistic data but take radically different approaches.

GANs use an adversarial setup where a generator creates fake data while a discriminator learns to spot fakes. This competitive dynamic drives continuous improvement through a minimax game, forcing the generator to produce increasingly realistic outputs.

In this detailed blog learn about:

→ Learn pretraining fundamentals via language modeling on massive text data

→ Master data cleaning, filtering, and deduplication for high-quality training datasets

→ Understand architecture choices: model size, attention mechanisms, and parameter optimization

→ Implement supervised fine-tuning (SFT) using human demonstrations

→ Apply Reinforcement Learning from Human Feedback (RLHF):

Train reward model on human preferences
Use PPO to optimize policy against reward
Handle preference misalignment and reward hacking

→ Evaluate model performance through:

Perplexity metrics
Task-specific benchmarks
Human evaluation protocols

READ THE FULL BLOG

Rohan's Bytes

Discussion about this post