safety
updated
DuoGuard: A Two-Player RL-Driven Framework for Multilingual LLM
Guardrails
Paper
•
2502.05163
•
Published
•
22
CRANE: Reasoning with constrained LLM generation
Paper
•
2502.09061
•
Published
•
21
Investigating the Impact of Quantization Methods on the Safety and
Reliability of Large Language Models
Paper
•
2502.15799
•
Published
•
7
AISafetyLab: A Comprehensive Framework for AI Safety Evaluation and
Improvement
Paper
•
2502.16776
•
Published
•
6
LettuceDetect: A Hallucination Detection Framework for RAG Applications
Paper
•
2502.17125
•
Published
•
12
SafeArena: Evaluating the Safety of Autonomous Web Agents
Paper
•
2503.04957
•
Published
•
21
Safeguarding Vision-Language Models: Mitigating Vulnerabilities to
Gaussian Noise in Perturbation-based Attacks
Paper
•
2504.01308
•
Published
•
14
LLM Can be a Dangerous Persuader: Empirical Study of Persuasion Safety
in Large Language Models
Paper
•
2504.10430
•
Published
•
5
MCP Safety Audit: LLMs with the Model Context Protocol Allow Major
Security Exploits
Paper
•
2504.03767
•
Published
•
3
Set You Straight: Auto-Steering Denoising Trajectories to Sidestep
Unwanted Concepts
Paper
•
2504.12782
•
Published
•
3
X-Teaming: Multi-Turn Jailbreaks and Defenses with Adaptive Multi-Agents
Paper
•
2504.13203
•
Published
•
35
A Comprehensive Survey in LLM(-Agent) Full Stack Safety: Data, Training
and Deployment
Paper
•
2504.15585
•
Published
•
12
Unlearning Sensitive Information in Multimodal LLMs: Benchmark and
Attack-Defense Evaluation
Paper
•
2505.01456
•
Published
•
2
Teaching Models to Understand (but not Generate) High-risk Data
Paper
•
2505.03052
•
Published
•
6
Will AI Tell Lies to Save Sick Children? Litmus-Testing AI Values
Prioritization with AIRiskDilemmas
Paper
•
2505.14633
•
Published
•
3
How Should We Enhance the Safety of Large Reasoning Models: An Empirical
Study
Paper
•
2505.15404
•
Published
•
13
Be Careful When Fine-tuning On Open-Source LLMs: Your Fine-tuning Data
Could Be Secretly Stolen!
Paper
•
2505.15656
•
Published
•
15
SafeKey: Amplifying Aha-Moment Insights for Safety Reasoning
Paper
•
2505.16186
•
Published
•
7
Personalized Safety in LLMs: A Benchmark and A Planning-Based Agent
Approach
Paper
•
2505.18882
•
Published
•
14
Towards Safety Reasoning in LLMs: AI-agentic Deliberation for
Policy-embedded CoT Data Creation
Paper
•
2505.21784
•
Published
•
17
OS-Harm: A Benchmark for Measuring Safety of Computer Use Agents
Paper
•
2506.14866
•
Published
•
5
Automating Steering for Safe Multimodal Large Language Models
Paper
•
2507.13255
•
Published
•
3
The Devil behind the mask: An emergent safety vulnerability of Diffusion
LLMs
Paper
•
2507.11097
•
Published
•
64
Frontier AI Risk Management Framework in Practice: A Risk Analysis
Technical Report
Paper
•
2507.16534
•
Published
•
7
Personalized Safety Alignment for Text-to-Image Diffusion Models
Paper
•
2508.01151
•
Published
•
8
Data and AI governance: Promoting equity, ethics, and fairness in large
language models
Paper
•
2508.03970
•
Published
•
1