Model Card for PULSAR-pbmc
PULSAR (Patient Understanding Leveraging Single-cell universAl Representation) is a multi-scale, multi-cellular foundation model for human peripheral blood mononuclear cells (PBMCs). It transforms a set of single-cell transcriptomes into an interpretable donor-level embedding that preserves single-cell resolution while capturing multicellular composition and coordination.
This repo hosts the aligned PBMC model (PULSAR-aligned) used to produce donor embeddings aligned for disease classification. A base-model is also available (see Model Sources).
Model Details
Model Description
PULSAR (Patient Understanding Leveraging Single-cell universAl Representation) is a hierarchical, multi-scale foundation model for PBMC scRNA-seq that converts unordered sets of single cells into a 512-d donor embedding while preserving single-cell resolution. It integrates molecular priors from ESM2 protein embeddings, cellular representations via Universal Cell Embeddings (UCE, 1,280-d), and a Multicellular Transformer encoder–decoder trained with a high-masking, Masked Cell Modeling objective. Pretraining proceeds in two stages: a pan-tissue CELLxGENE corpus (≈36.2M cells; 6,807 samples) followed by continual pretraining on blood (≈8.74M cells; 2,588 samples). The resulting donor embeddings support zero-shot and lightweight-head downstream tasks, including large-scale reference mapping for disease classification (state-of-the-art accuracy with strong external generalization), regression of plasma proteomics from transcriptomes, forecasting of future outcomes (e.g., RA conversion in ACPA+ individuals and influenza vaccine responsiveness), and individualized cytokine perturbation modeling across donor, cellular, and gene levels. A “virtual instrument” conditions on cytokine protein embeddings to transform baseline donor states and, with the decoder and an optional UCE→expression head, generates perturbed cell distributions and gene programs. Attention over cells provides mechanistic interpretability, highlighting disease- and severity-relevant subsets and enriching for antigen-specific clonotypes in viral infection. PULSAR thus operationalizes the AI Virtual Cell vision by linking molecular, cellular, and multicellular organization into a unified, transferable representation for precision immunology.
- Developed by: Kuan Pang (Stanford University, kuanpang@stanford.edu)
- Model type: Transformer
- License: MIT
Model Sources [optional]
- Repository: https://github.com/snap-stanford/PULSAR
- Paper: https://www.biorxiv.org/content/10.1101/2025.11.24.685470v1
- Aligned version: https://huggingface.co/KuanP/PULSAR-pbmc
Uses
Direct Use
- Generate 512-d donor embeddings from PBMC scRNA-seq to:
- Perform reference mapping/retrieval (kNN) for disease phenotypes
Out-of-Scope Use
The model might not work for tissue types other than PBMC, which also includes cell sorting samples.
How to Get Started with the Model
Use the code below to get started with the model.
Training Details
Training Data
Stage-1 pretraining corpus: CZ CELLxGENE Census (LTS 2023-07-25), 36.2M cells, 6,807 samples across 53 tissues and 69 conditions.
Stage-2 continual pretraining (blood focus): 8.736M cells, 2,588 blood/PBMC samples (balanced sexes; broad ages).
More details can be found in the Paper and GitHub.
Citation
BibTeX:
@article{pang2025pulsar,
title={PULSAR: a Foundation Model for Multi-scale and Multicellular Biology},
author={Pang, Kuan and Rosen, Yanay and Kedzierska, Kasia and He, Ziyuan and Rajagopal, Abhe and Gustafson, Claire E and Huynh, Grace and Leskovec, Jure},
journal={bioRxiv},
pages={2025--11},
year={2025},
publisher={Cold Spring Harbor Laboratory}
}
- Downloads last month
- 24
Model tree for KuanP/PULSAR-aligned
Base model
KuanP/PULSAR-pbmc