On Robustness and Reliability of Benchmark-Based Evaluation of LLMs Paper • 2509.04013 • Published Sep 4, 2025 • 4
Flaw or Artifact? Rethinking Prompt Sensitivity in Evaluating LLMs Paper • 2509.01790 • Published Sep 1, 2025 • 4
SPICE: A Synergistic, Precise, Iterative, and Customizable Image Editing Workflow Paper • 2504.09697 • Published Apr 13, 2025 • 1