What Should We Learn From ModernBERT?

Bigger training data, efficient parameter sizing, and a deep-but-thin architecture, ModernBERT sets a direction for future BERT-like models.

Futuristic illustration with a central white circle surrounded by white dots on a dotted background.

Back in 2018, Google dropped BERT and it was a game-changer for NLP, way before the current LLM wave. Even now, tons of Small Language Models are built on BERT. In Dec. 2024, ModernBERT takes what we've learned from recent LLM developments and applies it back to these smaller models. The key moves? Better parameter-efficiency, code understanding and long-context handling.

In this post, we'll break down how ModernBERT stacks up against two models we know inside and out: jina-XLM-RoBERTa (the multilingual backbone behind jina-embeddings-v3) and RobERTa-large. Let's look at each model:

  • ModernBERT (Dec. 2024) is a recently released SLM, developed collaboratively by Answer.AI, LightOn, and HuggingFace. It leverages modern optimizations like RoPE for an 8,192-token context window and GeGLU layers, boosting performance while maintaining efficiency.
  • jina-XLM-RoBERTa (Sept. 2024) is a multilingual text embedding model based on Meta's XLM-RoBERTa. While the original XLM-RoBERTa enhances RoBERTa using the XLM large multilingual dataset, jina-XLM-RoBERTa takes it further with extended context training, RoPE implementation, and FlashAttention-2 support. This model serves as the backbone for jina-embeddings-v3.
  • RoBERTa-large (July 2019) developed by Meta, is an enhanced version of BERT with 355 million parameters. Through extended training, larger datasets, and innovations like dynamic masking, it has achieved impressive results on key benchmarks including GLUE, SQuAD, and RACE. This makes it well-suited for various NLP tasks from text classification to question answering.

By comparing these models across three core aspects, we aim to highlight ModernBERT's effective design choices for fellow model developers and identify key development insights for future BERT-like models. We'll also share our learnings from developing jina-embeddings-v3 and discuss planned improvements for jina-embeddings-v4 and jina-reranker-v3.

ModernBERT's Parameter-Efficiency

Let's first look at ModernBERT's approach to parameter-efficiency - it's bringing in several key insights from recent LLM developments. ModernBERT leverages three core strategies: a deeper but thinner architecture, controlled vocabulary size, and progressive model upscaling from smaller models.

Deep-And-Thin Architecture

ModernBERT-large goes deeper with 28 layers, while jina-XLM-RoBERTa and RoBERTa-large run at 24. But here's the interesting part - it matches RoBERTa-large in parameter count despite those extra layers. jina-XLM-RoBERTa needs more parameters since it's handling 89 languages, while the other two focus just on English.

Comparative chart with "Deep-and-thin" vs "Shallow-and-fat" in red and yellow bars against a black background.
The depth (number of layers) is more important than width (number of hidden units) for small LLMs. A deep-and-thin model structure excels in capturing abstract concepts, resulting in superior final performance.

Most of a transformer's parameters comes from attention and fully-connected layers. ModernBERT stays competitive size-wise by going "thinner" - they're running 2,624 hidden units across 28 layers, compared to RoBERTa-large's 4,096 units across 24 layers. This "deeper" but thinner setup is letting them hit their performance targets without bloating the model.

ModernBERT-large jina-XLM-RoBERTa RoBERTa-large
Parameters 400M 550M 355M
Hidden states 1,024 1,024 1,024
Intermediate dims 2,624 4,096 4,096
Attention heads 16 16 16
Layers 28 24 24
Vocabulary size 50,368 250,002 50,265

This approach lines up with Meta's MobileLLM research, which found that for smaller models, depth matters more than width when it comes to capturing complex patterns and driving performance. Essentially, the ability to process information through more transformer layers proves more valuable than having wider layers for parallel processing.

Let's look at the data on how this deep-and-thin architecture performs.

Bar chart evaluating NLP model performances, including ModernBERT and ROBERTa, with a scale from 0 to 100.
Put it up against comparable models using the traditional shallow-fat architecture, and ModernBERT is delivering better results on key tasks like retrieval and STS - all while keeping the parameter count similar.
ModernBERT-large jina-XLM-RoBERTa RoBERTa-large
STS12 72.6 72.7 68.9
STS13 84.9 83.9 81.0
STS14 77.5 77.7 74.8
STS15 84.8 85.8 84.1
STS16 79.4 79.6 78.6
STS17 87.5 87.2 87.2
TRECCOVID 61.1 59.6 49.3
FiQA 44.4 40.0 40.7
NFCorpus 32.6 30.6 27.9
SciFact 68.6 65.5 63.1
Average 69.3 68.2 65.6

Take jina-XLM-RoBERTa - it builds on RoBERTa-large's shallow-fat architecture but pumps up the vocabulary from 50K to 250K tokens and trains on more data. Yet ModernBERT still edges it out, suggesting the architectural shift is making a real difference in efficiency.

Vocabulary Size Matters

First, let's understand how vocabulary parameters is counted in transformers. For any transformer, vocabulary parameters = number of distinct tokens × hidden size. Take jina-XLM-RoBERTa: with 250K tokens and 1,024 dimensions, it needs 256M parameters just for vocabulary encoding - before handling any actual language tasks!

Diagram illustrating a transformer model process: tokenization of text "hello, jina", assigning of vocabulary weights, and la
In transformers, the first layer maps tokens to hidden states using a weight matrix, namely the vocabulary weights. Consider using all UTF-8 code points (1,112,064) with 1,024 hidden dimensions - you'd need a massive 1,112,064 × 1,024 = 1 B parameters just for token conversion. While larger LLMs (100B+ parameters) can handle this overhead, it's a serious constraint for smaller models. This is exactly why we use tokenizers like BPE, which efficiently merge common UTF-8 code points into single tokens.

But here is the thing: vocabulary weights don't contribute to attention mechanisms - they're just lookup tables. For SLMs working with fixed parameter budgets, a larger vocabulary means fewer parameters available for attention layers, which do the actual language processing. This explains why English-only ModernBERT-large outperforms multilingual jina-XLM-RoBERTa despite being smaller - jina-XLM-RoBERTa allocates more parameters (47%!) to support multiple languages. ModernBERT's focused vocabulary not only improves performance but also speeds up inference, making it particularly effective for resource-constrained applications.

So now if we look at just the core model parameters (excluding vocabulary weights), ModernBERT actually packs more computational power than its peers: ModernBERT dedicates 19% more parameters to actual language modeling than jina-XLM-RoBERTa and 15% more than RoBERTa-large!

Model Specs ModernBERT-large jina-XLM-RoBERTa RoBERTa-large
Language Support English Only 89 Languages English Only
Vocab Size 50.4K 250K 50.3K
Total Params 400M 550M 355M
Vocab Params 51M 256M 51M
Vocab Param Ratio 13% 47% 14%
Core Model Params 349M 294M 304M

Model Upscaling by "Weight Tiling"

In building jina-BERT-v2 backbone, we found training SLMs from scratch was resource-intensive and complex. ModernBERT tackles this with a smart initialization approach called weight tiling - essentially bootstrapping ModernBERT-large from the weights of its smaller base version.

This technique isn't entirely new - it builds on DeepMind's work with Gopher and shows up in Microsoft's Phi-2 models too. But its application here is particularly effective for addressing the SLM training bottleneck.

Diagram comparing weight initialization in small vs. large models with labeled sections and steps for layer initialization.
ModernBERT scales from 22 to 28 layers using the Gopher team's depth initialization strategy. For those extra layers (23-28), they initialize each one using weights from the original 22 layers of ModernBERT-base. For each layer's weight matrices, they use Phi-2's center tiling approach. It works like this: they take the ModernBERT-base weights and drop them right in the middle of ModernBERT-large's matrices. For the edges that are still empty? They wrap the original weights around cyclically to fill them in.

This initialization strategy gives ModernBERT-large a significant advantage - instead of starting cold from scratch, it leverages pre-learned patterns from its smaller counterpart. It's proven particularly effective for scaling up language models in this size range.

We find that a warm started model rapidly recovers from a high initial loss (due to the added parameters) to a loss quite close to that of the base model. We are able to expand 417M parameters by over 3× in size and maintain performance greater than an equivalent fresh model trained from scratch to convergence, implying that the gains were not limited to the start of training. However, at larger sizes, the relative gains achieved at convergence diminish, especially with expansions in width.

The cyclical weight wrapping isn't just a convenience - it aligns well with how attention matrices naturally exhibit periodic patterns. Gopher's research shows this approach really shines for SLMs (sub-9B parameters), though the benefits start tapering off as you move into larger model territories.

ModernBERT's Code Modeling

ModernBERT optimizes for code understanding with its code-friendly tokenizer and training data. This pays off specifically in retrieval tasks. We ran a benchmark on CodeSearchNet - matching text descriptions to code snippets. ModernBERT outperformed both alternatives across the board.

Vertical bar chart comparing performance metrics of AI models like ModernBERT-large on CodeSearchNet, with rising bars for sc
The gap makes sense - neither jina-XLM-RoBERTa nor RoBERTa-large saw programming languages during training. Meanwhile, ModernBERT-large trained on two trillion tokens, including a substantial amount of code. This exposure to programming syntax and patterns gives it a clear advantage in code-related tasks. jina-XLM-RoBERTa edges out RoBERTa-large, likely due to its larger multilingual training data - same architecture, more exposure. Nonetheless, both trail ModernBERT-large significantly.
Task ModernBERT-large jina-XLM-RoBERTa RoBERTa-large
AdvRetrieval 0.342 0.363 0.331
QueryRetrieval.python 0.521 0.530 0.525
QueryRetrieval java 0.679 0.633 0.644
QueryRetrieval.javascript 0.755 0.768 0.732
QueryRetrieval.php 0.815 0.781 0.755
QueryRetrieval.ruby 0.729 0.744 0.722
QueryRetrieval.go 0.833 0.809 0.796
Retrieval.go 0.778 0.750 0.759
Retrieval.java 0.840 0.792 0.796
Retrieval.javascript 0.817 0.792 0.757
Retrieval.php 0.852 0.805 0.796
Retrieval.python 0.849 0.816 0.787
Retrieval.ruby 0.849 0.796 0.803
Avg. 0.743 0.721 0.708

The Tokenizer Edge

Let's dig into why ModernBERT handles code so well - it's using OLMo tokenizer, which was specifically trained on code, rather than standard BERT/RoBERTa tokenizers.

A tokenizer breaks UTF-8 text into tokens that get mapped to vectors - these are what the model actually processes. During training, it learns to combine frequently occurring character sequences into single tokens. The difference? A standard tokenizer might break init into in + it, missing the programming context. But ModernBERT's code-aware tokenizer gets it without breaking it.

Here's where it gets interesting with space handling: ModernBERT preserves Python's leading spaces as single tokens and differentiates between 4 vs 8 spaces - crucial for code structure. Meanwhile, jina-XLM-RoBERTa collapses all continuous spaces into a single _, and RoBERTa-large treats each space as its own token. This means ModernBERT's encoder gets cleaner, more meaningful input when processing code, while the others are working with fractured, less coherent tokens.

Hierarchical flowchart with tokenizers for text processing including ModernBERT-large, XLM-ROBERTa, and a large ROBERTa.
ModernBERT preserves Python's leading spaces as single tokens and differentiates between 4 vs 8 spaces - crucial for code structure; while the others are working with fractured, less coherent tokens.

ModernBERT's Long-Context Handling

ModernBERT has made significant strides in processing long text, thanks to its extensive training corpus (300B tokens with 8,192-token samples) and advanced techniques like combined global and local attention.

To evaluate long-document handling capabilities, we used the MLDR dataset - a comprehensive long-text benchmark spanning 13 languages. Since ModernBERT currently supports only English, we focused on MLDR's English subset to benchmark ModernBERT against jina-XLM-RoBERTa. While both these models can handle 8K-token inputs, RoBERTa-large was excluded from this benchmark due to its 512-token limit, which is insufficient for long-text analysis.

ModernBERT-large jina-XLM-RoBERTa
MLDR-en 0.351 0.290

ModernBERT's superior performance isn't just due to its extensive long-text training - it's largely thanks to its innovative combination of global and local attention mechanisms. Unlike jina-XLM-RoBERTa, which applies computationally expensive global attention to every layer, ModernBERT takes a more efficient approach. It alternates between global attention (used every third layer with a theta of 160,000) and local attention (using a 128-token sliding window with a theta of 100,000). This hybrid strategy maintains high performance while dramatically reducing training time.

In ModernBERT, every third layer employs global attention with a RoPE theta of 160,000 and the remaining layers use a 128 token, local sliding window attention with a RoPE theta of 10,000. —— ModernBERT

The Bitter Lesson?

The scaling law and the bitter lesson suggest that major performance improvements come primarily from increasing parameter counts and training data. This principle guided our approach of expanding the corpus and using LoRA for task-specific adaptations.

However, ModernBERT's success has revealed that we undervalued the power of architectural optimization. It demonstrates that SLMs can achieve exceptional results through better data-model efficiency, without necessarily scaling up parameters. A recent Stella Embeddings technical report reinforces this finding, indicating that current embedding model training methods can be improved without increasing corpus or model size.

Graph showing Scaling Law of Embedding Models with 'Parameter Size' on the x-axis and 'MTEB Performance' on the y-axis, featu
Scaling law of embedding models. The average MTEB performance on English tasks is plotted against the number of model parameters. Each dot represents an embedding model. The trendline, representing all models, is highlighted, with multilingual models emphasized in cyan. One can see that jina-embeddings-v3 demonstrates superior performance compared to models of similar size, also showing a superlinear improvement over its predecessor, jina-embeddings-v2. This graph was created by selecting top-100 embedding models from the MTEB leaderboard , excluding those without size information, typically closed-source or proprietary models. Submissions identified as obvious trolling were also filtered out.

Moving forward, we anticipate lower computational costs and smaller model sizes as we gain deeper insights into data utilization and implement ModernBERT's techniques. In the short term, we can implement straightforward improvements outlined in the ModernBERT paper - specifically integrating more code-related data and adopting a code-friendly tokenizer. More complex changes, like switching to a deep-and-thin architecture or bootstrapping large models from smaller ones, will require building backbone models from scratch - a more mid-term initiative.

While ModernBERT's efficiency is remarkable, its text-only limitation points to future challenges. As multimodal embedding models gain popularity, our next challenge is developing smarter, faster, and more capable search foundation models that can handle inputs for multimodal applications. These applications demand even longer context windows - an efficiency challenge that remains to be solved.

Conclusion

Throughout this post, we've explored how ModernBERT advances BERT-family models through three key innovations: its deep-and-thin architecture, optimized tokenizer, and efficient scaling using weight tiling. These improvements enable ModernBERT to deliver outstanding performance in a relatively compact size, surpassing both RoBERTa-large and jina-XLM-RoBERTa across various tasks. ModernBERT demonstrates that architectural improvements can matter more than parameter size, opening doors for more efficient models. Its successful use of weight tiling shows how progressive scaling can reduce training costs while preserving or even boosting performance. Additionally, its compact vocabulary and targeted optimizations suggest growing opportunities for specialized SLMs in resource-limited settings.