LFM2-ColBERT-350M

LFM2-ColBERT-350M is a late interaction retriever with excellent multilingual performance. It allows you to store documents in one language (for example, a product description in English) and retrieve them in many languages with high accuracy.

LFM2-ColBERT-350M offers best-in-class accuracy across different languages.
Inference speed is on par with models 2.3 times smaller, thanks to the efficient LFM2 backbone.
You can use it as a drop-in replacement in your current RAG pipelines to improve performance.

Find more information about LFM2-ColBERT-350M in our blog post.

🚀 Try our demo: https://huggingface.co/spaces/LiquidAI/LFM2-ColBERT

📄 Model details

Late interaction retrievers like LFM2-ColBERT-350M are particularly interesting because they preserve much of the expressivity of re-rankers while retaining the efficiency of bi-encoders. In practice, they're used to both retrieve documents at scale (like bi-encoders) and rank them at the same time (like rerankers).

We recommend using this model for various RAG use cases, such as:

E-commerce: Find products across many languages with semantic search at scale.
On-device semantic search: Ask questions to your phone in natural language to retrieve files, emails, and notes.
Enterprise knowledge assistants: Retrieve internal legal, financial, and technical documents in different languages.

Property	LFM2-ColBERT-350M
Total parameters	353,322,752
Layers	25 (18 conv + 6 attn + 1 dense)
Context length	32,768 tokens
Vocabulary size	65,536
Training precision	BF16
License	LFM Open License v1.0

Document length: 512 tokens

Query length: 32 tokens

Output dimensionality: 128 tokens

Similarity function: MaxSim

Supported languages: English, Arabic, Chinese, French, German, Japanese, Korean, and Spanish.

ColBERT(
  (0): Transformer({'max_seq_length': 511, 'do_lower_case': False}) with Transformer model: Lfm2Model 
  (1): Dense({'in_features': 1024, 'out_features': 128, 'bias': False, 'activation_function': 'torch.nn.modules.linear.Identity'})
)

🏃 How to run

First, install the PyLate and transformers library:

pip install -U pylate

Retrieval

Use this model with PyLate to index and retrieve documents. The index uses FastPLAID for efficient similarity search.

Indexing documents

Load LFM2-ColBERT-350M and initialize the PLAID index, then encode and index your documents:

from pylate import indexes, models, retrieve

# Step 1: Load the ColBERT model
model = models.ColBERT(
    model_name_or_path="LiquidAI/LFM2-ColBERT-350M",
)
model.tokenizer.pad_token = model.tokenizer.eos_token

# Step 2: Initialize the PLAID index
index = indexes.PLAID(
    index_folder="pylate-index",
    index_name="index",
    override=True,  # This overwrites the existing index if any
)

# Step 3: Encode the documents
documents_ids = ["1", "2", "3"]
documents = ["document 1 text", "document 2 text", "document 3 text"]

documents_embeddings = model.encode(
    documents,
    batch_size=32,
    is_query=False,  # Ensure that it is set to False to indicate that these are documents, not queries
    show_progress_bar=True,
)

# Step 4: Add document embeddings to the index by providing embeddings and corresponding ids
index.add_documents(
    documents_ids=documents_ids,
    documents_embeddings=documents_embeddings,
)

Note that you do not have to recreate the index and encode the documents every time. Once you have created an index and added the documents, you can re-use the index later by loading it:

# To load an index, simply instantiate it with the correct folder/name and without overriding it
index = indexes.PLAID(
    index_folder="pylate-index",
    index_name="index",
)

Retrieving top-k documents for queries

Once the documents are indexed, you can retrieve the top-k most relevant documents for a given set of queries. To do so, initialize the ColBERT retriever with the index you want to search in, encode the queries and then retrieve the top-k documents to get the top matches ids and relevance scores:

# Step 1: Initialize the ColBERT retriever
retriever = retrieve.ColBERT(index=index)

# Step 2: Encode the queries
queries_embeddings = model.encode(
    ["query for document 3", "query for document 1"],
    batch_size=32,
    is_query=True,  #  # Ensure that it is set to False to indicate that these are queries
    show_progress_bar=True,
)

# Step 3: Retrieve top-k documents
scores = retriever.retrieve(
    queries_embeddings=queries_embeddings,
    k=10,  # Retrieve the top 10 matches for each query
)

Reranking

If you only want to use LFM2-ColBERT-350M to perform reranking on top of your first-stage retrieval pipeline without building an index, you can simply use rank function and pass the queries and documents to rerank:

from pylate import rank, models

queries = [
    "query A",
    "query B",
]

documents = [
    ["document A", "document B"],
    ["document 1", "document C", "document B"],
]

documents_ids = [
    [1, 2],
    [1, 3, 2],
]

model = models.ColBERT(
    model_name_or_path="LiquidAI/LFM2-ColBERT-350M",
)

queries_embeddings = model.encode(
    queries,
    is_query=True,
)

documents_embeddings = model.encode(
    documents,
    is_query=False,
)

reranked_documents = rank.rerank(
    documents_ids=documents_ids,
    queries_embeddings=queries_embeddings,
    documents_embeddings=documents_embeddings,
)

📈 Performance

Accuracy

We extended the NanoBEIR benchmark to include Japanese and Korean languages. We open-sourced this dataset on Hugging Face at LiquidAI/nanobeir-multilingual-extended for reproducibility. On this NanoBEIR benchmark, LFM2-ColBERT-350M displays significantly stronger multilingual capabilities (especially in German, Arabic, Korean, and Japanese) while maintaining English performance.

Even more interestingly, LFM2-ColBERT-350M is an excellent cross-lingual retriever. This means that it is capable of retrieving documents based on queries from other languages. This is ideal for client-facing applications, like in e-commerce, where a description might be in English but the query is in another language.

LFM2-ColBERT-350M works especially well for English, French, Spanish, Italian, Portuguese, and German, as shown with these NDCG@10 scores on NanoBEIR:

Doc / Query	AR	DE	EN	ES	FR	IT	JA	KO	PT	AVG
AR	0.490	0.288	0.339	0.303	0.304	0.286	0.357	0.338	0.291	33.30%
DE	0.383	0.563	0.547	0.498	0.502	0.489	0.424	0.368	0.486	47.33%
EN	0.416	0.554	0.661	0.553	0.551	0.522	0.477	0.395	0.535	51.82%
ES	0.412	0.514	0.578	0.563	0.547	0.529	0.436	0.394	0.547	50.21%
FR	0.408	0.527	0.573	0.552	0.564	0.537	0.450	0.388	0.549	50.53%
IT	0.395	0.512	0.554	0.535	0.535	0.543	0.439	0.386	0.529	49.20%
JA	0.375	0.365	0.409	0.358	0.345	0.337	0.557	0.491	0.330	39.63%
KO	0.326	0.274	0.310	0.282	0.265	0.266	0.440	0.527	0.271	32.89%
PT	0.402	0.499	0.558	0.545	0.528	0.529	0.436	0.382	0.547	49.17%
AVG	40.07%	45.51%	50.32%	46.54%	46.00%	44.86%	44.62%	40.78%	45.38%

In comparison, GTE-ModernColBERT-v1 consistently gets lower scores when documents and queries are not in the same language:

Doc / Query	AR	DE	EN	ES	FR	IT	JA	KO	PT	AVG
AR	0.309	0.089	0.107	0.089	0.094	0.092	0.070	0.049	0.087	10.96%
DE	0.039	0.499	0.454	0.362	0.393	0.367	0.133	0.061	0.361	29.65%
EN	0.042	0.408	0.680	0.446	0.484	0.420	0.167	0.073	0.438	35.08%
ES	0.044	0.360	0.485	0.525	0.465	0.437	0.149	0.061	0.487	33.48%
FR	0.044	0.381	0.505	0.455	0.546	0.428	0.136	0.057	0.467	33.35%
IT	0.043	0.369	0.449	0.446	0.451	0.516	0.143	0.054	0.448	32.36%
JA	0.031	0.169	0.250	0.172	0.177	0.169	0.459	0.059	0.165	18.35%
KO	0.030	0.134	0.169	0.127	0.133	0.125	0.090	0.368	0.124	14.45%
PT	0.043	0.368	0.479	0.492	0.467	0.448	0.138	0.062	0.530	33.63%
AVG	6.94%	30.84%	39.75%	34.59%	35.68%	33.35%	16.53%	9.37%	34.24%

This makes retrieval a lot more reliable and can replace architectures with multiple models with a single, unified retriever.

Inference speed

Despite being more than twice as big, LFM2-ColBERT-350M demonstrates throughput performance on par with GTE-ModernColBERT-v1 for query and document encoding across various batch sizes.

Query encoding was evaluated using realistic query patterns from datasets like MS MARCO and Natural Questions.

Document encoding was measured on realistic documents with varying lengths and domains.

📬 Contact

If you are interested in custom solutions with edge deployment, please contact our sales team.

Please cite the PyLate library if you use it for inference or training:

@misc{PyLate,
title={PyLate: Flexible Training and Retrieval for Late Interaction Models},
author={Chaffin, Antoine and Sourty, Raphaël},
url={https://github.com/lightonai/pylate},
year={2024}
}

Downloads last month: -

Safetensors

Model size

0.4B params

Tensor type

F32

Space using LiquidAI/LFM2-ColBERT-350M 1

Evaluation results

Maxsim Accuracy@1 on NanoClimateFEVER
self-reported

0.400
Maxsim Accuracy@3 on NanoClimateFEVER
self-reported

0.520
Maxsim Accuracy@5 on NanoClimateFEVER
self-reported

0.640
Maxsim Accuracy@10 on NanoClimateFEVER
self-reported

0.800
Maxsim Precision@1 on NanoClimateFEVER
self-reported

0.400
Maxsim Precision@3 on NanoClimateFEVER
self-reported

0.207
Maxsim Precision@5 on NanoClimateFEVER
self-reported

0.156
Maxsim Precision@10 on NanoClimateFEVER
self-reported

0.118
Maxsim Recall@1 on NanoClimateFEVER
self-reported

0.195
Maxsim Recall@3 on NanoClimateFEVER
self-reported

0.273

View on Papers With Code