terminal-bench
Browse Terminal-Bench benchmarks and their tasks.
An improved version of Terminal-Bench 2.0, inspired by Z.ai's Terminal-Bench 2.0 Verified.
89 high-quality tasks across software engineering, machine learning, security, data science, and more.
The original Terminal-Bench benchmark. 80 tasks testing agents' abilities to complete tasks using a terminal.
The next frontier benchmark for terminal agents. Currently in development — contribute tasks and help shape the future of agent evaluation.
A domain-specific benchmark for scientific computing in terminal environments. Currently in development.
A single-task challenge to write a complete Kimi K2.5 inference engine in one <=25,000-byte CUDA file.
A single-task challenge to make rustc compile programs faster while preserving correctness across the full test suite.
A single-task challenge to implement a pure JS/WASM WebGL 1.0 and 2.0 software renderer.