Model Card for PULSAR-pbmc

PULSAR (Patient Understanding Leveraging Single-cell universAl Representation) is a multi-scale, multi-cellular foundation model for human peripheral blood mononuclear cells (PBMCs). It transforms a set of single-cell transcriptomes into an interpretable donor-level embedding that preserves single-cell resolution while capturing multicellular composition and coordination.

This repo hosts the aligned PBMC model (PULSAR-aligned) used to produce donor embeddings aligned for disease classification. A base-model is also available (see Model Sources).

Model Details

Model Description

PULSAR (Patient Understanding Leveraging Single-cell universAl Representation) is a hierarchical, multi-scale foundation model for PBMC scRNA-seq that converts unordered sets of single cells into a 512-d donor embedding while preserving single-cell resolution. It integrates molecular priors from ESM2 protein embeddings, cellular representations via Universal Cell Embeddings (UCE, 1,280-d), and a Multicellular Transformer encoder–decoder trained with a high-masking, Masked Cell Modeling objective. Pretraining proceeds in two stages: a pan-tissue CELLxGENE corpus (≈36.2M cells; 6,807 samples) followed by continual pretraining on blood (≈8.74M cells; 2,588 samples). The resulting donor embeddings support zero-shot and lightweight-head downstream tasks, including large-scale reference mapping for disease classification (state-of-the-art accuracy with strong external generalization), regression of plasma proteomics from transcriptomes, forecasting of future outcomes (e.g., RA conversion in ACPA+ individuals and influenza vaccine responsiveness), and individualized cytokine perturbation modeling across donor, cellular, and gene levels. A “virtual instrument” conditions on cytokine protein embeddings to transform baseline donor states and, with the decoder and an optional UCE→expression head, generates perturbed cell distributions and gene programs. Attention over cells provides mechanistic interpretability, highlighting disease- and severity-relevant subsets and enriching for antigen-specific clonotypes in viral infection. PULSAR thus operationalizes the AI Virtual Cell vision by linking molecular, cellular, and multicellular organization into a unified, transferable representation for precision immunology.

  • Developed by: Kuan Pang (Stanford University, kuanpang@stanford.edu)
  • Model type: Transformer
  • License: MIT

Model Sources [optional]

Uses

Direct Use

  • Generate 512-d donor embeddings from PBMC scRNA-seq to:
    • Perform reference mapping/retrieval (kNN) for disease phenotypes

Out-of-Scope Use

The model might not work for tissue types other than PBMC, which also includes cell sorting samples.

How to Get Started with the Model

Use the code below to get started with the model.

Training Details

Training Data

Stage-1 pretraining corpus: CZ CELLxGENE Census (LTS 2023-07-25), 36.2M cells, 6,807 samples across 53 tissues and 69 conditions.

Stage-2 continual pretraining (blood focus): 8.736M cells, 2,588 blood/PBMC samples (balanced sexes; broad ages).

More details can be found in the Paper and GitHub.

Citation

BibTeX:

@article{pang2025pulsar,
  title={PULSAR: a Foundation Model for Multi-scale and Multicellular Biology},
  author={Pang, Kuan and Rosen, Yanay and Kedzierska, Kasia and He, Ziyuan and Rajagopal, Abhe and Gustafson, Claire E and Huynh, Grace and Leskovec, Jure},
  journal={bioRxiv},
  pages={2025--11},
  year={2025},
  publisher={Cold Spring Harbor Laboratory}
}
Downloads last month
24
Safetensors
Model size
87.4M params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for KuanP/PULSAR-aligned

Base model

KuanP/PULSAR-pbmc
Finetuned
(1)
this model