BOND-reranker
A cross-encoder reranker model fine-tuned for biomedical ontology entity normalization, designed to work with the BOND (Biomedical Ontology Neural Disambiguation) system.
Model Description
This model is a cross-encoder reranker trained to improve the accuracy of entity normalization by re-ranking candidate ontology terms retrieved by BOND's initial retrieval stage. It takes a query-candidate pair and outputs a relevance score.
Training Framework: Sentence Transformers with cross-encoder architecture
Model Architecture
- Type: Cross-Encoder
- Framework: Sentence Transformers
- Max Sequence Length: 512 tokens
- Output: Single relevance score per query-candidate pair
- Parameters: ~110M (based on BiomedBERT-base)
Training Data
The model was trained on biomedical entity normalization data covering multiple ontologies including:
- MONDO (diseases)
- HPO (phenotypes)
- UBERON (anatomy)
- Cell Ontology (CL)
- Gene Ontology (GO)
- And other biomedical ontologies
Training data consists of query-candidate pairs with relevance labels, where queries are biomedical entity mentions and candidates are ontology terms.
Usage
With BOND Pipeline
from bond.config import BondSettings
from bond.pipeline import BondMatcher
# Configure BOND to use this reranker
settings = BondSettings(
"model_path", # Replace with your model path
enable_reranker=True
)
matcher = BondMatcher(settings=settings)
Direct Usage
import torch
from sentence_transformers import CrossEncoder
# Load model from local path
model = CrossEncoder(
"model_path", # Replace with your model path
device='cuda' if torch.cuda.is_available() else 'cpu'
)
# Example: Rank candidates for a query
query = "cell_type: C_BEST4; tissue: descending colon; organism: Homo sapiens"
candidates = [
"label: smooth muscle fiber of descending colon; synonyms: non-striated muscle fiber of descending colon",
"label: smooth muscle cell of colon; synonyms: non-striated muscle fiber of colon",
"label: epithelial cell of colon; synonyms: colon epithelial cell"
]
# Get ranked results with probabilities
ranked_results = model.rank(query, candidates, return_documents=True, top_k=3)
print("Top 3 ranked results")
for result in ranked_results:
prob = torch.sigmoid(torch.tensor(result['score'])).item()
print(f"{prob:.8f} - {result['text']}")
Performance
This reranker is designed to work as the final stage in the BOND pipeline:
- Retrieval: Exact + BM25 + Dense retrieval with LLM expansion
- Reranking: This cross-encoder model scores and re-ranks top candidates
- Output: Final ranked list of ontology terms
The reranker significantly improves precision by re-scoring the top-k candidates (typically k=100) retrieved by the initial retrieval stage.
Evaluation Metrics
Evaluated on biomedical entity normalization development set:
| Metric | Score |
|---|---|
| Accuracy | 97.50% |
| F1 Score | 82.37% |
| Precision | 79.58% |
| Recall | 85.36% |
| Average Precision | 88.67% |
| Eval Loss | 0.230 |
Best Model: Checkpoint at step 69,500 (epoch 2.28) with best metric score of 0.9734
Model Files
config.json- Model configurationmodel.safetensors- Model weights in SafeTensors formattokenizer.json- Fast tokenizervocab.txt- Vocabulary filespecial_tokens_map.json- Special tokens mappingtokenizer_config.json- Tokenizer configuration
License
Apache 2.0
- Downloads last month
- 6