BOND-reranker

A cross-encoder reranker model fine-tuned for biomedical ontology entity normalization, designed to work with the BOND (Biomedical Ontology Neural Disambiguation) system.

Model Description

This model is a cross-encoder reranker trained to improve the accuracy of entity normalization by re-ranking candidate ontology terms retrieved by BOND's initial retrieval stage. It takes a query-candidate pair and outputs a relevance score.

Training Framework: Sentence Transformers with cross-encoder architecture

Model Architecture

  • Type: Cross-Encoder
  • Framework: Sentence Transformers
  • Max Sequence Length: 512 tokens
  • Output: Single relevance score per query-candidate pair
  • Parameters: ~110M (based on BiomedBERT-base)

Training Data

The model was trained on biomedical entity normalization data covering multiple ontologies including:

  • MONDO (diseases)
  • HPO (phenotypes)
  • UBERON (anatomy)
  • Cell Ontology (CL)
  • Gene Ontology (GO)
  • And other biomedical ontologies

Training data consists of query-candidate pairs with relevance labels, where queries are biomedical entity mentions and candidates are ontology terms.

Usage

With BOND Pipeline

from bond.config import BondSettings
from bond.pipeline import BondMatcher

# Configure BOND to use this reranker
settings = BondSettings(
    "model_path",  # Replace with your model path
    enable_reranker=True
)

matcher = BondMatcher(settings=settings)

Direct Usage

import torch
from sentence_transformers import CrossEncoder

# Load model from local path
model = CrossEncoder(
    "model_path",  # Replace with your model path
    device='cuda' if torch.cuda.is_available() else 'cpu'
)

# Example: Rank candidates for a query
query = "cell_type: C_BEST4; tissue: descending colon; organism: Homo sapiens"
candidates = [
    "label: smooth muscle fiber of descending colon; synonyms: non-striated muscle fiber of descending colon",
    "label: smooth muscle cell of colon; synonyms: non-striated muscle fiber of colon",
    "label: epithelial cell of colon; synonyms: colon epithelial cell"
]

# Get ranked results with probabilities
ranked_results = model.rank(query, candidates, return_documents=True, top_k=3)

print("Top 3 ranked results")

for result in ranked_results:
    prob = torch.sigmoid(torch.tensor(result['score'])).item()
    print(f"{prob:.8f} - {result['text']}")

Performance

This reranker is designed to work as the final stage in the BOND pipeline:

  1. Retrieval: Exact + BM25 + Dense retrieval with LLM expansion
  2. Reranking: This cross-encoder model scores and re-ranks top candidates
  3. Output: Final ranked list of ontology terms

The reranker significantly improves precision by re-scoring the top-k candidates (typically k=100) retrieved by the initial retrieval stage.

Evaluation Metrics

Evaluated on biomedical entity normalization development set:

Metric Score
Accuracy 97.50%
F1 Score 82.37%
Precision 79.58%
Recall 85.36%
Average Precision 88.67%
Eval Loss 0.230

Best Model: Checkpoint at step 69,500 (epoch 2.28) with best metric score of 0.9734

Model Files

  • config.json - Model configuration
  • model.safetensors - Model weights in SafeTensors format
  • tokenizer.json - Fast tokenizer
  • vocab.txt - Vocabulary file
  • special_tokens_map.json - Special tokens mapping
  • tokenizer_config.json - Tokenizer configuration

License

Apache 2.0

Downloads last month
6
Safetensors
Model size
41.5M params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support