Colbert-Finetuned / README.md

dutta18

Update README.md

4d969b2 verified about 2 months ago

preview code

raw

history blame contribute delete

2.57 kB

metadata

language:
  - en
library_name: colbert
pipeline_tag: sentence-similarity
tags:
  - information-retrieval
  - retrieval
  - late-interaction
  - ColBERT
license: mit
base_model: colbert-ir/colbertv1.9

Colbert-Finetuned

ColBERT (Contextualized Late Interaction over BERT) is a retrieval model that scores queries vs. passages using fine-grained token-level interactions (“late interaction”). This repo hosts a fine-tuned ColBERT checkpoint for neural information retrieval.

Base model: colbert-ir/colbertv1.9
Library: colbert (with Hugging Face backbones)
Intended use: passage/document retrieval in RAG and search systems

ℹ️ ColBERT encodes queries and passages into token-level embedding matrices and uses MaxSim to compute relevance at search time. It typically outperforms single-vector embedding retrievers while remaining scalable.

✨ What’s in this checkpoint

Fine-tuned ColBERT weights starting from colbert-ir/colbertv1.9.
Trained with triples JSONL ([qid, pid+, pid-]) using TSV queries.tsv and collection.tsv (IDs + text).
Default training hyperparameters are listed below (batch size, lr, doc_maxlen, dim, etc.).
This checkpoint and the associated contrastive training data are part of the work: NLKI: A lightweight Natural Language Knowledge Integration Framework for Improving Small VLMs in Commonsense VQA Tasks
All copyrights for the training data are retained by their original owners; we do not claim ownership.

🔧 Quickstart

Option A — Use with the ColBERT library (recommended)

from colbert.infra import Run, RunConfig, ColBERTConfig
from colbert import Indexer, Searcher
from colbert.data import Queries

# 1) Index your collection (pid \t passage)
with Run().context(RunConfig(nranks=1, experiment="my-exp")):
    cfg = ColBERTConfig(root="/path/to/experiments")
    indexer = Indexer(checkpoint="dutta18/Colbert-Finetuned", config=cfg)
    indexer.index(
        name="my.index",
        collection="/path/to/collection.tsv"  # "pid \t passage text"
    )

# 2) Search with queries (qid \t query)
with Run().context(RunConfig(nranks=1, experiment="my-exp")):
    cfg = ColBERTConfig(root="/path/to/experiments")
    searcher = Searcher(index="my.index", config=cfg)
    queries = Queries("/path/to/queries.tsv")  # "qid \t query text"
    ranking = searcher.search_all(queries, k=20)
    ranking.save("my.index.top20.tsv")