At this time we’re releasing Gemini Embedding 2, our first totally multimodal embedding mannequin constructed on the Gemini structure, in Public Preview through the Gemini API and Vertex AI.
Increasing on our earlier text-only basis, Gemini Embedding 2 maps textual content, photographs, movies, audio and paperwork right into a single, unified embedding house, and captures semantic intent throughout over 100 languages. This simplifies complicated pipelines and enhances all kinds of multimodal downstream duties—from Retrieval-Augmented Era (RAG) and semantic search to sentiment evaluation and knowledge clustering.
New modalities and versatile output dimensions
The mannequin relies on Gemini and leverages its best-in-class multimodal understanding capabilities to create high-quality embeddings throughout:
- Textual content: helps an expansive context of as much as 8192 enter tokens
- Photos: able to processing as much as 6 photographs per request, supporting PNG and JPEG codecs
- Movies: helps as much as 120 seconds of video enter in MP4 and MOV codecs
- Audio: natively ingests and embeds audio knowledge while not having intermediate textual content transcriptions
- Paperwork: immediately embed PDFs as much as 6 pages lengthy
Past processing one modality at a time, this mannequin natively understands interleaved enter so you possibly can move a number of modalities of enter (e.g., picture + textual content) in a single request. This permits the mannequin to seize the complicated, nuanced relationships between completely different media varieties, unlocking extra correct understanding of complicated, real-world knowledge.
