Jina CLIP v2: Multilingual Multimodal Embeddings for Text and Images

Jina-CLIP v2, a 0.9B multimodal embedding model with multilingual support of 89 languages, high image resolution at 512x512, and Matryoshka representations.

Digital number "2" displayed in a mosaic of colorful squares against a dark background, creating a futuristic vibe.
jinaai/jina-clip-v2 · Hugging Face
We’re on a journey to advance and democratize artificial intelligence through open source and open science.
Jina AI - Your Search Foundation, Supercharged.
Best-in-class embeddings, rerankers, LLM-reader, web scraper, classifiers. The best search AI for multilingual and multimodal data.

jina-clip-v2 API is available under the "Embeddings" tab.

Multimodal embeddings enable searching and understanding data across different modalities through a coherent representation. They serve as the backbone of neural information retrieval and multimodal GenAI applications. Today, we're excited to release jina-clip-v2, a new general-purpose multilingual multimodal embeddings built upon jina-clip-v1 and our recently released jina-embeddings-v3, featuring several key improvements:

  • Improved Performance: v2 shows a 3% performance improvement over v1 in both text-image and text-text retrieval tasks. Similar to v1, v2's text encoder can serve as an effective multilingual long-context dense retriever. It performs on par with our frontier model jina-embeddings-v3 (currently the best multilingual embeddings under 1B parameters on MTEB).
  • Multilingual Support: Powered by jina-embeddings-v3 as the text tower, jina-clip-v2 supports 89 languages for multilingual-image retrieval, showing up to 4% improvement compared to nllb-clip-large-siglip on multilingual image retrieval tasks.
  • Higher Image Resolution: v2 now supports 512x512 input image resolution, a significant increase from v1's 224x224. This higher resolution enables better processing of detailed images, improved feature extraction, and more accurate recognition of fine-grained visual elements.
  • Matryoshka Representations: v2 allows users to truncate the output dimensions of both text and image embeddings from 1024 down to 64, reducing storage and processing overhead while maintaining strong performance.

Model Architecture

Diagram of Jina-Clip V2 with input specifications for 77 token English-only multilingual support and output specs from 64 to
Jina-CLIP v2 combines a text encoder (Jina XLM-RoBERTa, 561M parameters) and a vision encoder (EVA02-L14, 304M parameters) for a total of 865M parameters. The text encoder is also used in jina-embeddings-v3.

jina-clip-v2 is a 0.9B CLIP-style model that combines two powerful encoders: the text encoder Jina XLM-RoBERTa (the backbone of jina-embeddings-v3) and the vision encoder EVA02-L14 (an efficient vision Transformer developed by BAAI). These encoders are jointly trained to create aligned representations of images and text.

Feature Text Encoder Image Encoder
Base Model Jina XLM-RoBERTa EVA02-L
Parameters 561M 304M
Input Specification 8,192 tokens (max) 512×512 pixels
Min Output Dimensions 64 64
Max Output Dimensions 1,024 1,024
Layers 24 24
Attention Mechanism FlashAttention2 xFormers
Pooling Strategy Mean pooling CLS pooling
Additional Features 89 languages supported Patch size 14x14

Cross-Modal Retrieval Performance

Jina CLIP v2 provides multilingual support for 89 languages and with top performance in major languages including Arabic, Chinese, English, French, German, Japanese, Russian, and Spanish. In multilingual image retrieval benchmarks, Jina-CLIP v2 (865M parameters) matches or surpasses NLLB-CLIP-SigLIP, a state-of-the-art CLIP-style model using a pre-trained text encoder from NLLB models. Our model sits between the two NLLB-CLIP-SigLIP versions in terms of size: nllb-siglip-base (507M parameters, 41% smaller than ours) and nllb-siglip-large (1.2B parameters, 39% larger than ours).

English-Only Text and Images

On standard cross-modal retrieval benchmarks (Flickr30k and COCO), jina-clip-v2 demonstrates strong improvements across the board. It achieves state-of-the-art performance of 98.0% on Flickr30k image-to-text retrieval, surpassing both its predecessor and NLLB-CLIP-SigLIP. The model shows consistent gains across all retrieval scenarios, with notable improvements of up to 3.3% over v1 on COCO image-to-text retrieval, while maintaining competitive performance with NLLB-CLIP-SigLIP across different benchmarks and modality directions.

Flickr30k Recall@5 Performance:

Task Model Score Δ v1 Δ NLLB-L
Image-to-text jina-clip-v2 98.0 +1.7% +0.9%
jina-clip-v1 96.4 - -0.7%
nllb-siglip-large 97.1 - -
nllb-siglip-base 95.0 - -
Text-to-image jina-clip-v2 89.8 +0.9% -2.6%
jina-clip-v1 89.0 - -3.5%
nllb-siglip-large 92.2 - -
nllb-siglip-base 90.0 - -

COCO Recall@5 Performance:

Task Model Score Δ v1 Δ NLLB-L
Image-to-text jina-clip-v2 81.5 +3.3% +2.9%
jina-clip-v1 78.9 - -0.4%
nllb-siglip-large 79.2 - -
nllb-siglip-base 77.7 - -
Text-to-image jina-clip-v2 68.4 +2.9% -3.4%
jina-clip-v1 66.5 - -6.1%
nllb-siglip-large 70.8 - -
nllb-siglip-base 69.1 - -

Multilingual Text and Images

On multilingual cross-modal benchmarks, jina-clip-v2 demonstrates robust performance, particularly excelling in image-to-text retrieval where it outperforms NLLB-SigLIP across all datasets, with up to +3.8% improvement on Crossmodal 3600. While NLLB-SigLIP shows slightly stronger text-to-image retrieval capabilities, the performance gap remains small, typically within 3%.

Image2Text Recall@5 Performance:

Benchmark Model Score Δ NLLB-L
Crossmodal 3600 jina-clip-v2 83.23 +3.8%
nllb-siglip-large 80.16 -
nllb-siglip-base 76.56 -
Multilingual MS Coco jina-clip-v2 86.03 +0.8%
nllb-siglip-large 85.37 -
nllb-siglip-base 84.87 -
XTD10 jina-clip-v2 85.98 +0.7%
nllb-siglip-large 85.41 -

Text2Image Recall@5 Performance:

Benchmark Model Score Δ NLLB-L
Crossmodal 3600 jina-clip-v2 81.43 -0.8%
nllb-siglip-large 82.07 -
nllb-siglip-base 79.29 -
Multilingual MS Coco jina-clip-v2 84.87 -3.1%
nllb-siglip-large 87.60 -
nllb-siglip-base 86.23 -
XTD10 jina-clip-v2 85.03 -3.0%
nllb-siglip-large 87.63 -

Text-Only Dense Retriever Performance

Similar to its predecessor, jina-clip-v2's text encoder can serve as an effective multilingual dense retriever. On the comprehensive Multilingual MTEB benchmarks, it achieves strong performance, reaching 69.86% on retrieval and 67.77% on semantic similarity tasks. These results demonstrate its versatility, performing competitively with our specialized text embedding model jina-embeddings-v3:

Task Model Score Relative to v3
Retrieval jina-clip-v2 69.86 -3.8%
jina-embeddings-v3 72.59 -
Semantic Similarity jina-clip-v2 67.77 -2.9%
jina-embeddings-v3 69.81 -

On English tasks, jina-clip-v2 shows consistent improvements over both its predecessor and NLLB-SigLIP, with particularly strong advantages in retrieval performance (nearly double NLLB-SigLIP's score).

Task Model Score Relative to v1
STS jina-clip-v2 81.29 +0.5%
jina-clip-v1 80.92 -
nllb-siglip-large 74.65 -
Retrieval jina-clip-v2 49.33 +2.1%
jina-clip-v1 48.33 -
nllb-siglip-large 24.92 -

Matryoshka Representation Performance

Both text and image encoders support MRL, and their output dimensions can be truncated to 64 while maintaining strong performance. Our embedding truncation evaluation revealed remarkable compression potential. Even an aggressive 75% dimensional reduction maintained over 99% performance across text, image, and cross-modal tasks.

Image Classification

Across 37 diverse image classification benchmarks, the image encoder shows strong resilience to truncated dimensions. Compressing from 1024 to 64 dimensions (94% reduction) results in only an 8% drop in top-5 accuracy and 12.5% in top-1, highlighting its potential for efficient deployment with minimal performance loss.

Line graph demonstrating the accuracy of image classification models across various embedding dimensions.
For image classification, we used the 19 benchmarks in the VTAB dataset , VOC 2007, SUN397, STL10, Rendered SST2, ObjectNet, MNIST, German Traffic Sign Recognition Benchmark (GTSRB), Fine-Grained Visual Classification of Aircraft (FGVC-Aircraft), FER 2013, Country211, Cars196, ImageNet-A, ImageNet-O,IxmageNet1k, ImageNet Sketch, and ImageNet v2.

Cross-Modal Retrieval

Despite a dramatic 94% reduction to just 64 dimensions, cross-modal retrieval using both truncated image and text embeddings remained remarkably robust, preserving 93% of image-to-text and 90% of text-to-image performance.

Graph depicting Cross-Modal Retrieval Performance with Text-to-Image scores on the y-axis and Embedding Dimensions on the x-a
We used six benchmarks, three of which are multilingual: Crossmodal-3600 (36 languages), flickr30k (English only), flickr8k (English only), MS COCO Captions (English only), Multilingual MS COCO Captions (10 languages), XTD 200 (27 languages)

Text-Only Retrieval

On English-only MTEB benchmarks, 64-dimension text embeddings (compressed from 1024) preserved semantic similarity remarkably well, dropping only 2.1%, while retrieval saw a modest 17.5% decrease.

Graph depicting English-only MTEB Performance across various embedding dimensions, with lines representing retrieval scores.

Getting Started

Via API

The code demonstrates how to generate embeddings using Python's requests. Pass a text string with either a base64 image or URL, plus your desired dimension size (default 1024, shown as 768 below).

import requests
import numpy as np
from numpy.linalg import norm

cos_sim = lambda a,b: (a @ b.T) / (norm(a)*norm(b))

url = 'https://api.jina.ai/v1/embeddings'

headers = {
  'Content-Type': 'application/json',
  'Authorization': 'Bearer <YOUR_JINA_AI_API_KEY>'
}

data = {
  'input': [
     {"text": "Bridge close-shot"},
     {"url": "https://fastly.picsum.photos/id/84/1280/848.jpg?hmac=YFRYDI4UsfbeTzI8ZakNOR98wVU7a-9a2tGF542539s"}],
  'model': 'jina-clip-v2',
  'encoding_type': 'float',
  'dimensions': '768' 
}

response = requests.post(url, headers=headers, json=data)
sim = cos_sim(np.array(response.json()['data'][0]['embedding']), np.array(response.json()['data'][1]['embedding']))
print(f"Cosine text<->image: {sim}")

Remember to replace <YOUR_JINA_AI_API_KEY> with an activated Jina API key. You can get a free API key with a million free tokens from here.

Image Tokens Pricing

Our API counts both text and image tokens. For images, token consumption is based on the number of 512x512 pixel tiles needed to cover the entire image area. Each tile costs 4,000 tokens to process, including partially filled tiles. For optimal cost-efficiency, we recommend that API users resize their images to 512x512 before sending requests.

Image Resolution Required Tiles Token Cost
512x512 1 4,000
720x720 4 16,000
1080x1080 9 36,000
Educational graphic showing three image format adjustments: square, landscape, and portrait, each with details on resizing an
For square images, resize to 512x512 for best cost-efficiency. For aspect ratio-sensitive tasks, scale the longest edge to 512, center the image, and pad with black. For general purposes, direct 512x512 resizing works well.

Via CSP Marketplaces

Jina CLIP v2 is available directly on AWS, Azure and GCP at the prices listed there.

AWS Marketplace: Jina CLIP v2
Microsoft Azure Marketplace
Google Cloud console
Spend smart, procure faster and retire committed Google Cloud spend with Google Cloud Marketplace. Browse the catalog of over 2000 SaaS, VMs, development stacks, and Kubernetes apps optimized to run on Google Cloud.

Via VectorDB

The vector database to build knowledgeable AI | Pinecone
Search through billions of items for similar matches to any object, in milliseconds. It’s the next generation of search, an API call away.
Multimodal Embeddings | Weaviate
Weaviate’s integration with Jina AI’s APIs allows you to access their models’ capabilities directly from Weaviate.
Jina Embeddings - Qdrant
Qdrant is an Open-Source Vector Database and Vector Search Engine written in Rust. It provides fast and scalable vector similarity search service with convenient API.

Conclusion

Building on our jina-clip-v1 release in June, which extended OpenAI's CLIP model with text input up to 8,192 tokens, and the frontier multilingual jina-embeddings-v3, jina-clip-v2 brings three major advances: multilingual support for 89 languages, increased image resolution at 512x512, and Matryoshka representation learning for more truncated embeddings.

CLIP-like models have established themselves as the backbone for general-purpose multimodal applications. With jina-clip-v2, we're taking these capabilities to the next level, breaking down language barriers to deliver more accurate cross-modal understanding and retrieval. We believe this release delivers a promise in making multimodal search and retrieval both more powerful and more accessible to developers worldwide.