Terminal-Bench

Contributors

View contributors for each Terminal-Bench benchmark.

89 high-quality tasks across software engineering, machine learning, security, data science, and more.

The original Terminal-Bench benchmark. 80 tasks testing agents' abilities to complete tasks using a terminal.

The next frontier benchmark for terminal agents. Currently in development — contribute tasks and help shape the future of agent evaluation.

A domain-specific benchmark for scientific computing in terminal environments. Currently in development.

Single-task Terminal-Bench challenges spanning inference engine code golf, Rust compiler speedup, and WASM rendering.