Ich bin ein Berliner: German-English Bilingual Embeddings with 8K Token Length

Jina AI introduces a German/English bilingual embedding model, featuring an extensive 8,192-token length, specifically designed to support German businesses thriving in the U.S. market.

Illustration of Berlin's Brandenburg Gate in neon-style green and blue with classical design elements and a chariot sculpture

Berlin, Germany - January 15, 2023 – Echoing JFK's iconic 'Ich bin ein Berliner', at Jina AI we're thrilled to bridge languages in our own way. Today, we're proud to announce our latest innovation: jina-embeddings-v2-base-de, a German/English embedding model. This state-of-the-art bilingual model is a significant stride forward in language representation, boasting a context length of 8,192 tokens. What sets it apart is its remarkable efficiency: it achieves top-tier performance while being only 1/7th the size of comparable models.

Embeddings are crucial for German businesses looking to expand into the U.S. market. According to the German American Business Outlook (GABO) 2022, approximately a third of German companies generate over 20% of their global sales and profits in the U.S., with 93% expecting an increase in U.S. sales​​. This trend continues as 93% plan to grow their company's U.S. investments in the next three years, with 85% expecting net sales growth and a significant focus on digital transformation​​. Good embeddings can play a pivotal role in this expansion by facilitating better understanding of customer preferences, enabling more effective communication, and positioning culturally resonant products.

Our breakthrough is particularly beneficial for German businesses looking to implement bilingual applications in English-speaking countries. With jina-embeddings-v2-base-de, we're excited to see how German companies will innovate and thrive in an increasingly connected world.

Model Highlights

  • State-of-the-art Performance: jina-embeddings-v2-base-de consistently ranking at the top in relevant benchmarks and leading among open-source models of similar size.
  • Bilingual Model: This model encodes texts in both German and English, allowing the use of either language as the query or target document in retrieval applications. Texts with equivalent meanings in both languages are mapped to the same embedding space, forming the basis for multilingual applications.
  • Extended Context: An 8192-token length enables jina-embeddings-v2-base-de to support longer texts and document fragments, far surpassing models that only support a few hundred tokens at a time.
  • Compact Size: jina-embeddings-v2-base-de is built for high performance on standard computer hardware. With only 161 million parameters, the entire model is 322MB and fits in the memory of commodity computers. The embeddings themselves are 768 dimensions, a relatively small vector size compared to many models, saving space and run-time for applications.
  • Bias Minimization: Recent research shows that multilingual models without specific language training show strong biases towards English grammatical structures in embeddings. Embedding models should be about capturing meaning and not favor sentence pairs that are merely superficially similar.
  • Seamless Integration: Jina Embeddings v2 models have native integrations with major vector databases, including MongoDB, Qdrant, and Weaviate, as well as RAG and LLM frameworks such as Haystack and LlamaIndex.

Leading Performance in German NLP

We've put jina-embeddings-v2-base-de to the test against four renowned baselines that also support both German and English. These include:

Our benchmarks include the MTEB tasks for English and our own custom benchmark. Given the lack of a comprehensive benchmark suite for German embeddings, we took the initiative to develop our own, inspired by the MTEB. We're proud to share our findings and breakthroughs with you here.

GitHub - jina-ai/mteb-de: MTEB: Massive Text Embedding Benchmark
MTEB: Massive Text Embedding Benchmark. Contribute to jina-ai/mteb-de development by creating an account on GitHub.
A table comparing machine learning model performance with four distinct models listed by name, size in MB, and proficiency in German and English tasks

Compact Size, Superior Results

jina-embeddings-v2-base-de demonstrates exceptional performance, especially in German language tasks. It outshines the E5 base model while being less than a third of its size. Moreover, it stands toe-to-toe with the E5 large model, which is seven times larger, showcasing its efficiency and power. This efficiency makes jina-embeddings-v2-base-de a game-changer, particularly when compared to other popular bi- and multilingual embedding models.

Excelling in German-English Cross-Language Retrieval

Our model isn't just about size and efficiency; it's also a top performer in English-German cross-language retrieval tasks. This is evident in its performance in various key benchmarks:

  • WikiCLIR, for English to German retrieval
  • STS17, part of the MTEB evaluation for English to German retrieval
  • STS22, for German to English retrieval, also part of MTEB
  • BUCC, for German to English retrieval, included in MTEB

The performance in these benchmarks, particularly in the MTEB evaluation tests (with the exception of WikiCLIR), underscores the effectiveness of jina-embeddings-v2-base-de in handling complex bilingual tasks.

Comparison table of language models with sizes in MB and accuracy percentages for metrics like WikiCLIR and STS17

Get API Access

Our offerings for our enterprise users who value privacy and data compliance, including jina-embeddings-v2-base-de, are accessible via the Jina Embeddings API:

  1. Visit Jina Embeddings API and click on the model dropdown
  2. Select jina-embeddings-v2-base-de
Embedding API
Top-performing, 8192-token context length, $100 for 1.25B tokens, seamless OpenAI alternative, free trial
Screenshot of technology platform with highlighted language models, code snippets, and options like 'Integrate' and 'Try out'

We will make this model available in the AWS Sagemaker marketplace for Amazon cloud users and for download on HuggingFace very soon.

Jina 8K Embeddings: The Cornerstone of Diverse AI Applications

Embeddings are crucial for a wide range of AI applications, including information retrieval, data quality control, classification, and recommendation. They are fundamental to enhancing numerous AI tasks.

Jina AI is committed to advancing the state-of-the-art in embedding technology, keeping our core AI components transparent, accessible, and affordable to enterprises of all types and sizes that value privacy and data compliance. In addition to jina-embeddings-v2-base-de, Jina AI has released state-of-the-art embedding models for Chinese and high-performance English monolingual models. This is part of our mission to make AI technology more inclusive and globally applicable.

We value your feedback. Join our community channel to contribute feedback and stay informed about our advancements. Together, we're shaping a more robust and inclusive AI future.

Join the Jina AI Discord Server!
Check out the Jina AI community on Discord - hang out with 4232 other members and enjoy free voice and text chat.