PaperBench: Evaluating AI's Ability to Replicate AI Research Paper • 2504.01848 • Published Apr 2 • 36
MLE-bench: Evaluating Machine Learning Agents on Machine Learning Engineering Paper • 2410.07095 • Published Oct 9, 2024 • 7
[Re] Badder Seeds: Reproducing the Evaluation of Lexical Methods for Bias Measurement Paper • 2206.01767 • Published Jun 3, 2022
Probing LLMs for Joint Encoding of Linguistic Categories Paper • 2310.18696 • Published Oct 28, 2023 • 1