File size: 2,573 Bytes
4e7952a
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
4d969b2
 
 
4e7952a
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
---
language:
- en
library_name: colbert
pipeline_tag: sentence-similarity
tags:
- information-retrieval
- retrieval
- late-interaction
- ColBERT
license: mit  # ← change if needed
base_model: colbert-ir/colbertv1.9
---

# Colbert-Finetuned

**ColBERT** (Contextualized Late Interaction over BERT) is a retrieval model that scores queries vs. passages using fine-grained token-level interactions (“late interaction”). This repo hosts a **fine-tuned ColBERT checkpoint** for neural information retrieval.

- **Base model:** `colbert-ir/colbertv1.9`  
- **Library:** [`colbert`](https://github.com/stanford-futuredata/ColBERT) (with Hugging Face backbones)  
- **Intended use:** passage/document retrieval in RAG and search systems

> ℹ️ ColBERT encodes queries and passages into token-level embedding matrices and uses `MaxSim` to compute relevance at search time. It typically outperforms single-vector embedding retrievers while remaining scalable.

---

## ✨ What’s in this checkpoint

- Fine-tuned ColBERT weights starting from `colbert-ir/colbertv1.9`.
- Trained with **triples JSONL** (`[qid, pid+, pid-]`) using **TSV** `queries.tsv` and `collection.tsv` (IDs + text).
- Default training hyperparameters are listed below (batch size, lr, doc_maxlen, dim, etc.).
- This checkpoint and the associated contrastive training data are part of the work: [`NLKI: A lightweight Natural Language Knowledge Integration Framework
for Improving Small VLMs in Commonsense VQA Tasks`](https://arxiv.org/pdf/2508.19724)
- All copyrights for the training data are retained by their original owners; we do not claim ownership.
---

## 🔧 Quickstart

### Option A — Use with the ColBERT library (recommended)

```python
from colbert.infra import Run, RunConfig, ColBERTConfig
from colbert import Indexer, Searcher
from colbert.data import Queries

# 1) Index your collection (pid \t passage)
with Run().context(RunConfig(nranks=1, experiment="my-exp")):
    cfg = ColBERTConfig(root="/path/to/experiments")
    indexer = Indexer(checkpoint="dutta18/Colbert-Finetuned", config=cfg)
    indexer.index(
        name="my.index",
        collection="/path/to/collection.tsv"  # "pid \t passage text"
    )

# 2) Search with queries (qid \t query)
with Run().context(RunConfig(nranks=1, experiment="my-exp")):
    cfg = ColBERTConfig(root="/path/to/experiments")
    searcher = Searcher(index="my.index", config=cfg)
    queries = Queries("/path/to/queries.tsv")  # "qid \t query text"
    ranking = searcher.search_all(queries, k=20)
    ranking.save("my.index.top20.tsv")