Thursday, December 12, 2024

Gemini 2.0 Google's Most Capable AI Model

Gemini 2.0: Google’s Most Advanced AI Model Yet

Gemini 2.0

Embracing the Agentic Era of AI 

A Message from Sundar Pichai, CEO of Google and Alphabet

Information is the foundation of human progress. Google has spent over 26 years organizing the world's information and making it accessible and useful. AI is the next frontier in organizing this information across every input and making it accessible via any output to make it truly useful.

Gemini 1.0 and 1.5 were groundbreaking in their multimodality and long context, enabling understanding of information across text, video, images, audio, and code. Millions of developers are currently building with Gemini, and it is helping to reimagine all of Google’s products. 
Over the past year, Google has been investing in more agentic models, meaning they can understand the world, think multiple steps ahead, and take action with supervision.

Introducing Gemini 2.0

Gemini 2.0 is Google's most capable model yet, designed for the agentic era. With advances in multimodality, like native image and audio output and native tool use, it enables new AI agents that bring us closer to the vision of a universal assistant.
Gemini 2.0 Flash experimental model will be available to all Gemini users. A new feature called Deep Research, which uses advanced reasoning and long context capabilities to act as a research assistant, is available in Gemini Advanced.

Gemini 2.0 and Search

AI has transformed Google Search. AI Overviews now reach 1 billion people, enabling them to ask new types of questions. The advanced reasoning capabilities of Gemini 2.0 are coming to AI Overviews to handle more complex topics and multi-step questions, including advanced math equations, multimodal queries, and coding.

The Technology Behind Gemini 2.0

Trillium

Gemini 2.0 was built on custom hardware like Trillium, Google's sixth-generation TPUs. TPUs powered 100% of Gemini 2.0 training and inference, and Trillium is now generally available to customers.

Gemini 2.0: From Information Organization to Enhanced Usefulness

Gemini 1.0 focused on organizing and understanding information, while Gemini 2.0 aims to make it significantly more useful.

Gemini 2.0 Flash

Gemini 2.0 Flash, the first in the Gemini 2.0 family of models, is an experimental version and Google’s workhorse model. It offers low latency and enhanced performance at scale.

Features and Enhancements

Enhanced performance compared to 1.5 Flash, even outperforming 1.5 Pro on key benchmarks at twice the speed.
  • Supports multimodal inputs like images, video, and audio.
  • Supports multimodal output like natively generated images mixed with text and steerable text-to-speech (TTS) multilingual audio.
  • Can natively call tools like Google Search, code execution, as well as third-party user-defined functions.

Availability

  • Available as an experimental model to developers via the Gemini API in Google AI Studio and Vertex AI.
  • Multimodal input and text output are available to all developers.
  • Text-to-speech and native image generation are available to early-access partners.
  • General availability will begin in January, along with more model sizes.

Gemini 2.0 in the Gemini App

Gemini users can access a chat-optimized version of 2.0 Flash experimentally by selecting it in the model drop-down on desktop and mobile web. It will be available in the Gemini mobile app soon.

Unlocking Agentic Experiences with Gemini 2.0

Gemini 2.0 Flash’s features, including its native user interface action capabilities, multimodal reasoning, long context understanding, complex instruction following and planning, compositional function-calling, native tool use, and improved latency, enable a new class of agentic experiences.

Project Astra: Agents Using Multimodal Understanding in the Real World

Project Astra

Project Astra is a research prototype exploring the future capabilities of a universal AI assistant, currently being tested on Android phones. Improvements in the latest version built with Gemini 2.0 include:

Improvements in the Latest Version

  • Better dialogue: Ability to converse in multiple languages and in mixed languages, with a better understanding of accents and uncommon words.
  • New tool use: With Gemini 2.0 Project Astra can use Google Search, Lens, and Maps.
  • Better memory: Improved ability to remember things with up to 10 minutes of in-session memory and ability to remember more past conversations.
  • Improved latency: Can understand language at about the latency of human conversation due to new streaming capabilities and native audio understanding.
Google is working to bring these capabilities to products like the Gemini app and other form factors like glasses. A small group will soon begin testing Project Astra on prototype glasses.

Project Mariner: Agents That Can Help You Accomplish Complex Tasks

Project Mariner

Project Mariner is an early research prototype built with Gemini 2.0 exploring the future of human-agent interaction within a web browser. It can understand and reason across information on a browser screen, including pixels and web elements like text, code, images, and forms, and then uses that information via an experimental Chrome extension to complete tasks.

Performance

Project Mariner achieved a state-of-the-art result of 83.5% on the WebVoyager benchmark, which tests agent performance on end-to-end real-world web tasks.

Safety and Responsibility

Project Mariner can only type, scroll, or click in the active browser tab and asks users for final confirmation before sensitive actions, like purchasing something.

Availability

Trusted testers are testing Project Mariner using an experimental Chrome extension, and conversations with the web ecosystem are underway.

Jules: Agents for Developers

Jules

Jules is an experimental AI-powered code agent built with Gemini 2.0 that integrates into a GitHub workflow. It can tackle an issue, develop a plan, and execute it under a developer's direction and supervision. This is part of Google’s goal of building AI agents that are helpful in all domains, including coding.

Agents in Games and Other Domains

Google DeepMind has used games to train AI models in following rules, planning, and logic. Genie 2, an AI model that can create endless playable 3D worlds from a single image, is a recent example of this. 

Agents in Video Games

Google has built agents using Gemini 2.0 that can help users navigate the virtual world of video games. These agents can reason about the game based on screen action, offer suggestions in real-time conversation, and even use Google Search to connect users with gaming knowledge on the web.
Google is collaborating with game developers like Supercell to explore how these agents work across various games.

Agents in the Physical World

Google is experimenting with agents that can help in the physical world by applying Gemini 2.0's spatial reasoning capabilities to robotics.

Building Responsibly in the Agentic Era

Google acknowledges the responsibility that comes with developing new AI technologies and the safety and security questions AI agents raise. Therefore, they are taking an exploratory and gradual approach to development.

Safety Measures

  • Working with the Responsibility and Safety Committee (RSC) to identify and understand potential risks.
  • Using Gemini 2.0's reasoning capabilities for AI-assisted red teaming, automatically generating evaluations and training data to mitigate risks.
  • Evaluating and training the model across image and audio input and output to improve safety.
  • Exploring mitigations against unintentional sharing of sensitive information with the agent in Project Astra, building in privacy controls for deleting sessions, and researching ways to ensure AI agents are reliable sources of information.
  • Ensuring Project Mariner prioritizes user instructions over third-party prompt injection attempts, preventing malicious instructions and protecting users from fraud and phishing.

Gemini 2.0, AI Agents, and Beyond

The release of Gemini 2.0 Flash and research prototypes exploring agentic possibilities mark a new chapter in the Gemini era. Google is committed to responsibly exploring new possibilities as it builds towards AGI.

No comments:

Post a Comment

Llama 4 by Meta

  Llama 4 by Meta Redefining Multimodal AI Through Architectural Innovation Llama 4 Native multimodality, MoE scalability, and 10M-token con...