WaRP-Safety-Llama3_8B_Instruct

Fine-tuned Llama 3.1 8B Instruct model for safety alignment using Weight space Rotation Process (WaRP).

Model Details

  • Base Model: meta-llama/Llama-3.1-8B-Instruct
  • Training Method: Safety-First WaRP (3-Phase pipeline)
  • Training Date: 2025-10-27

Training Procedure

Phase 1: Basis Construction

  • Collected activations from FFN layers using safety data
  • Computed SVD to obtain orthonormal basis vectors
  • Identified 419 important neurons in layer 31

Phase 2: Importance Scoring

  • Calculated importance scores using gradient-based methods
  • Generated masks for important directions
  • Used teacher forcing on safety responses

Phase 3: Incremental Learning

  • Fine-tuned on utility task (GSM8K) with gradient masking
  • Protected important directions to maintain safety
  • Improved utility while preserving safety mechanisms

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "kmseong/WaRP-Safety-Llama3_8B_Instruct"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype="auto", device_map="auto")

# Generate text
inputs = tokenizer("What is machine learning?", return_tensors="pt")
outputs = model.generate(**inputs, max_length=100)
print(tokenizer.decode(outputs[0]))

Safety Features

  • βœ… Protected safety mechanisms through gradient masking
  • βœ… Maintained refusal capability for harmful requests
  • βœ… Improved utility on reasoning tasks
  • βœ… Balanced safety-utility tradeoff

Datasets

  • Safety Data: LibrAI/do-not-answer
  • Utility Data: openai/gsm8k

Citation

@article{warp-safety,
  title={Safety-First WaRP: Weight space Rotation Process for LLM Safety Alignment},
  author={Min-Seong Kim},
  year={2025}
}

License

This model is built on Llama 3.1 8B Instruct and follows the same license.

Disclaimer

This model is fine-tuned for improved safety. Users should evaluate model outputs for their specific use cases and apply additional safety measures as needed.

Downloads last month
19
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for kmseong/WaRP-Safety-Llama3_8B_Instruct-20251027_125759

Quantizations
1 model