terminal-bench
Browse Terminal-Bench benchmarks and their tasks.
The latest version of Terminal-Bench. 89 high-quality tasks across software engineering, machine learning, security, data science, and more.
The original Terminal-Bench benchmark. 80 tasks testing agents' abilities to complete tasks using a terminal.
The next frontier benchmark for terminal agents. Currently in development — contribute tasks and help shape the future of agent evaluation.
A domain-specific benchmark for scientific computing in terminal environments. Currently in development.