Terminal-Bench

Benchmarks

Browse Terminal-Bench benchmarks and their tasks.

An improved version of Terminal-Bench 2.0, inspired by Z.ai's Terminal-Bench 2.0 Verified.

89 high-quality tasks across software engineering, machine learning, security, data science, and more.

The original Terminal-Bench benchmark. 80 tasks testing agents' abilities to complete tasks using a terminal.

The next frontier benchmark for terminal agents. Currently in development — contribute tasks and help shape the future of agent evaluation.

A domain-specific benchmark for scientific computing in terminal environments. Currently in development.

Single-task Terminal-Bench challenges spanning inference engine code golf, Rust compiler speedup, and WASM rendering.