Google Shakes Up Competition With Multimodal AI

Google's Latest Gemini AI Model Can Process Text, Video, Audio, Images and Code
Google Shakes Up Competition With Multimodal AI
Alphabet CEO Sundar Pichai unveils Gemini. (Image: Google)

Competition for the emerging generative AI market is intensifying, and Google has unveiled an advanced AI tool named Gemini that can process text, video, audio, images and code. Google parent company Alphabet said Gemini is its most capable model yet and has sophisticated multimodal reasoning capabilities.

See Also: Live Webinar | Protecting Your AI: Strategies for Securing AI Systems

Alphabet CEO Sundar Pichai, in the AI tool's curtain-raiser, said Gemini is one of the company's biggest science and engineering efforts and that it represents a leapfrog in the development of artificial intelligence technology. "We believe it has the power to revolutionize our interaction with the world around us," he said.

Will Multimodal Capabilities Give Google an Edge?

Google's latest bet in the AI race significantly outpaces traditional AI systems that are mostly limited to text or voice commands. Gemini brings capabilities that are distinct from ChatGPT, OpenAI's conversational AI system which, released a year ago, gave internet users their first real experience with generative AI.

ChatGPT over the past year has undergone several upgrade cycles and survived grueling tests by developers and AI testers that few AI models have survived beyond the first week. ChatGPT's long-term survival could ultimately rest on its ability to match Gemini's capabilities.

In response to a question about the new Google tool, ChatGPT on Thursday said, "Gemini may be more advanced in scenarios that require handling multiple forms of media simultaneously. However, for pure text-based applications, ChatGPT's specialized capabilities might be more effective."

ChatGPT also said that the two AI systems could complement each other and continue to evolve to enable new use cases.

"In some cases, these models can be seen as complementary rather than directly competitive. For example, in a complex system, Gemini might handle multimedia tasks while ChatGPT manages text-based interactions," it said. "Both models are likely to continue evolving, with new features and capabilities being added over time."

Google executives, on the other hand, are intent on proving that Gemini brings in the best of both worlds. "We designed Gemini to be natively multimodal, pretrained from the start on different modalities. Then we fine-tuned it with additional multimodal data to further refine its effectiveness," said Demis Hassabis, CEO and co-founder of Google DeepMind, the AI division that created Gemini. "This helps Gemini seamlessly understand and reason about all kinds of inputs from the ground up, far better than existing multimodal models - and its capabilities are state of the art in nearly every domain."

A white paper unveiled on Wednesday details the superior performance of the most advanced version of Gemini, which surpassed ChatGPT-4 in various assessments, such as multiple-choice exams and grade-school math. But the document also acknowledges persistent challenges in advancing AI models to attain higher-level reasoning skills.

Google said Gemini demonstrates impressive reasoning capabilities, incorporating advanced programming and code-generation skills and excelling at identifying errors in complex mathematical responses.

Gemini outperforms its forerunner, PaLM 2, achieving a score exceeding 90% on the Massive Multitask Language Understanding benchmark. Gemini has 1.56 trillion parameters, and it has cost Google around $1 billion to $2 billion to build it. Google's generative AI chatbot, Bard, will be upgraded with Gemini. Pichai reportedly said the model will be integrated into Google's search engine at a later date, as well as into its Chrome browser and advertising products globally. "This is a significant milestone in the development of AI, and the start of a new era for us at Google," Hassabis said.

Google Announces 3 Versions of Gemini

Google launched three variants of Gemini, which each serve distinct purposes. Gemini Nano, a lightweight version, caters to native offline use on Android devices such as Pixel 8 Pro. Gemini Pro, a more robust variant, drives numerous Google AI services and serves as the foundation for Bard. Gemini Ultra, the most powerful large language model by Google to date, appears primarily tailored to data centers and enterprise applications. These versions are only available in English for now, but Google may announce additional language support in the future.

About the Author

Smruti Gandhi

Smruti Gandhi

Executive Editor, ISMG

Gandhi has more than a decade of experience in community engagement and incubating industry events. She is extremely proactive in building engagements with communities including CEO, CFOs and CIOs. Prior to joining ISMG, she worked with Dun & Bradstreet and Great Place to Work.

Around the Network

Our website uses cookies. Cookies enable us to provide the best experience possible and help us understand how visitors use our website. By browsing, you agree to our use of cookies.