The Tool Decathlon: Benchmarking Language Agents for Diverse, Realistic, and Long-Horizon Task Execution Paper • 2510.25726 • Published 5 days ago • 42
The Hydra Effect: Emergent Self-repair in Language Model Computations Paper • 2307.15771 • Published Jul 28, 2023 • 19
Does Circuit Analysis Interpretability Scale? Evidence from Multiple Choice Capabilities in Chinchilla Paper • 2307.09458 • Published Jul 18, 2023 • 11