Measuring what Matters: Construct Validity in Large Language Model Benchmarks Paper • 2511.04703 • Published Nov 3 • 7
Is Multilingual LLM Watermarking Truly Multilingual? A Simple Back-Translation Solution Paper • 2510.18019 • Published Oct 20 • 17
MALT: Improving Reasoning with Multi-Agent LLM Training Paper • 2412.01928 • Published Dec 2, 2024 • 45