
Google Plans to Combine Gemini and Veo AI Models to Create a Universal Digital Assistant
In a recent appearance on Possible, a podcast co-hosted by LinkedIn co-founder Reid Hoffman, Google DeepMind CEO Demis Hassabis revealed that Google plans to eventually combine its Gemini AI models with its Veo video-generating models. This merger aims to enhance Gemini’s understanding of the physical world, a crucial step toward creating a more sophisticated and context-aware AI.
What Are Gemini and Veo?
Gemini and Veo are two significant AI models developed by Google. Gemini is Google’s foundational AI model, designed to be multimodal from the very beginning. One of the features it possesses is the ability to process and generate text, images, and even audio, making it highly adaptable for a wide range of tasks. Veo, on the other hand, is focused on video generation and analysis. With that in mind, it was aimed at enhancing Google’s capabilities in video-related technologies, but with the planned integration with Gemini, its functionality will extend beyond just video content.
Veo’s video-generating capabilities are largely powered by data derived from YouTube, a platform owned by Google. Hassabis explained that by processing a vast number of YouTube videos, Veo can learn the “physics of the world,” thus making it adept at understanding real-world scenarios, such as object motion, physics, and interactions within video content.
Related article you may find interesting
A Vision for a Universal Digital Assistant
According to Hassabis, the decision to combine Gemini and Veo stems from Google’s broader vision of developing a universal digital assistant. “We’ve always built Gemini, our foundation model, to be multimodal from the beginning,” Hassabis explained. The goal is to create an AI that can help users not just in the digital realm but also in the real world by combining various types of media text, audio, images, and video into a cohesive, context-aware experience.
This approach aligns with the growing trend in the AI industry toward “omni” model systems that can process and synthesize multiple forms of media simultaneously. Google’s Gemini models are already capable of generating and are not limited to just text and images but also audio. With the planned integration of Veo, these capabilities will expand to include video content, resulting in a more powerful, encompassing AI system.
Other AI companies, such as OpenAI and Amazon, are also working on similar multi-modal models. OpenAI’s default model in ChatGPT can now generate images, including artistic styles like Studio Ghibli, and Amazon is expected to launch an “any-to-any” model later this year.
Training Data from YouTube
A key component of this integration is the vast amount of training data required to develop these omni models. Hassabis implied that much of the video data comes from YouTube when it comes to Veo. Google’s access to YouTube’s extensive video library allows Veo to train on a diverse range of content, enabling it to understand the complex dynamics of the physical world. By watching and analyzing a multitude of YouTube videos, Veo can gain insights into the physics of motion, object interactions, and real-world scenarios.
Google’s training practices have come under scrutiny before, with the company previously stating that its models “may be” trained on “some” YouTube content under its agreements with YouTube creators. It is said that Google is believed to have adjusted its terms of service last year to include more data for AI model training, enabling it to tap into a broader pool of content.
The Future of AI Models
As the development of Gemini and Veo continues, the combination of these two AI models will pave the way for a more powerful, multimodal system that can tackle a wider range of real-world tasks. The envisioned universal digital assistant could have applications in healthcare, entertainment, e-commerce, and beyond, providing smarter, more intuitive interactions with AI.
While a specific timeline for the integration of Gemini and Veo has not been disclosed, Hassabis’s statements indicate that Google is positioning itself to lead the charge in the next phase of AI innovation, one where models can understand and respond to complex stimuli across multiple forms of media.
Conclusion
Google’s plan to combine Gemini and Veo is a bold move toward creating a unified AI that can interact with users in more dynamic and contextually aware ways. By incorporating video analysis and expanding Gemini’s multimodal capabilities, the company is taking a significant step toward realizing the vision of a universal digital assistant, one that can assist in both the digital and physical realms. As Google continues to push the boundaries of AI, the integration of these models will likely have far-reaching implications for industries and users alike.