🛡️ Nemotron PII: Synthesized Data for Privacy-Preserving AI

Community Article Published October 28, 2025

Authors: Amy Steier, Andre Manoel, Alexa Haushalter, Maarten Van Segbroeck


Open Data for PII Protection

Most AI developers struggle to safely train or evaluate models on sensitive text—emails, chat logs, clinical notes, or legal documents. To help solve that, we're introducing NVIDIA Nemotron-PII, a free, fully synthetic dataset built using NVIDIA NeMo Data Designer. This dataset is paired with GLiNER-PII, a fine-tuned open-source model optimized for PII/PHI detection.

This release showcases a practical, reusable pipeline we use at NVIDIA, one that’s scalable across healthcare, finance, legal, and other enterprise data pipelines:

  • Design privacy-safe training data with NeMo Data Designer.
  • Fine-tune open-source models like GLiNER with high-quality synthetic examples.
  • Deploy those models in production—for example, to detect PII before generating synthetic tabular data using NVIDIA NeMo Safe Synthesizer (Available now in early access) or in pre-processing with NVIDIA NeMo Curator.

This approach provides a scalable foundation for de-identification, redaction, and compliance workflows.


What’s in the Dataset?

Nemotron-PII is a high-quality, synthetic dataset designed specifically for training robust PII/PHI detection models:

  • 100K synthetic records (50k train / 50k test).
  • More than 55 PII types including names, SSNs, MRNs, emails, and account numbers.
  • Structured and unstructured formats, spanning forms, logs, emails, and free text.
  • 50+ industries represented, reflecting diverse enterprise contexts.
  • Persona-grounded design leveraging Nemotron-Personas, a collection of synthetic personas grounded in real-world demographic and geographic distributions.
  • Span-level annotations for high-quality Named Entity Recognition (NER) training.
  • License: Licensed under CC BY 4.0 for free and commercial use.

How We Built It

Built with NeMo Data Designer, this dataset combines statistical grounding with flexible text synthesis to simulate real-world data across industries and formats.

  1. We used structured templates grounded in real-world field distributions to generate realistic data.
  2. We then applied multi-backend language models for free-text augmentation—including Mistral-Small-24B-Instruct-2501.
  3. Finally, we fine-tuned the GLiNER architecture using Nemotron-PII to create GLiNER-PII, optimized for multi-domain privacy detection.

The result? A privacy-first NER model with best-in-class recall and generalization, ready to drop into real-world pipelines or serve as a base for fine-tuning.


Who This Data Is For

Whether you’re building a clinical AI application or auditing enterprise logs, Nemotron-PII is designed to accelerate secure development:

  • Healthcare: Redact PHI from clinical notes, lab results, or patient messages.
  • Finance: Identify SSNs, account numbers, or transaction details for auditing.
  • Legal: Protect client identities in filings, contracts, and discovery materials.
  • Enterprise: Scan emails, documents, and internal logs for sensitive info.
  • Cybersecurity: Identify personal details in threat reports or user-generated content.

Why It Matters

Regulations like HIPAA, GDPR, and CCPA require strong data safeguards—yet most teams lack access to clean, scalable datasets to train compliant AI.

Nemotron-PII and GLiNER-PII provide a practical path forward:

  • No real PII or re-identification risk.
  • Enterprise-grade accuracy across domains and formats.
  • Open-weight model for private, auditable deployment.
  • Proven utility in NVIDIA’s own product pipelines.

The impact is already being seen through use of the GLiNER-PII model in NVIDIA NeMo Safe Synthesizer and NeMo Curator, achieving 92% recall and 64% F1 score for PII and PHI detection—a significant improvement over baseline models.


Scaling Privacy-Preserving AI Pipelines

It all starts with the right data. NVIDIA’s open AI-data stack lets you minimize exposure of sensitive information and optimize the privacy-performance tradeoff.

  1. Use NeMo Data Designer to generate synthetic training samples from scratch that are grounded in real-world statistics.
  2. Fine-tune open-source models like GLiNER with your synthesized NER data.
  3. Use those models in your AI systems to automatically detect and redact PII.

Start Building with Nemotron-PII

Nemotron-PII is an example of the kind of dataset you can create with NeMo Data Designer. You can try out NeMo Data Designer to design your own datasets for model fine-tuning.

You can experience the value of the GLiNER-PII model firsthand in products like NeMo Safe Synthesizer and NeMo Curator, which leverage GLiNER-PII to automatically detect, redact, and replace sensitive entities.

Whether you’re fine-tuning your own redaction models or looking to validate enterprise pipelines, Nemotron-PII provides a fast, reliable way to get started—with no PII exposure, no licensing restrictions, and full commercial rights.

Get started with just a few lines of code.

First, make sure you have the library installed:

pip install gliner

Now, let's try to find an email, SSN, and phone number in a messy block of text:

from gliner import GLiNER

# 1. Define our new text
text = "Hi support, I can't log in! My account username is 'johndoe88'. Every time I try, it says \"invalid credentials\". Please reset my password. You can reach me at (555) 123-4567 or johnd@example.com"

# 2. Define the labels we're hunting for.
labels = ["email", "ssn", "user_name"] 

# 3. Load the PII model
model = GLiNER.from_pretrained("nvidia/gliner-pii")

# 4. Run the prediction at given threshold
entities = model.predict_entities(text, labels, threshold=0.5)

print(entities)

Sample output:

[
  {'start': 52, 'end': 61, 'text': 'johndoe88', 'label': 'user_name','score': 0.96},
  {'start': 159, 'end': 173, 'text': '(555) 123-4567', 'label': 'phone_number', 'score': 0.97},
  {'start': 177, 'end': 194, 'text': 'johnd@example.com', 'label': 'email', 'score': 0.98}
]

👉 Explore, test, and integrate into your compliance workflows—and help advance trustworthy AI with synthesized data that is private by design.

Access the dataset and model today on Hugging Face:


Community

Sign up or log in to comment