terminal-bench

  • run terminal-bench
  • leaderboard
  • benchmarks
  • contributors
  • news
  • discord

Benchmarks

Browse Terminal-Bench benchmarks and their tasks.

Terminal-Bench 2.0

active

The latest version of Terminal-Bench. 89 high-quality tasks across software engineering, machine learning, security, data science, and more.

Terminal-Bench 1.0

active

The original Terminal-Bench benchmark. 80 tasks testing agents' abilities to complete tasks using a terminal.

Terminal-Bench 3.0

in progress

The next frontier benchmark for terminal agents. Currently in development — contribute tasks and help shape the future of agent evaluation.

Terminal-Bench Science

in progress

A domain-specific benchmark for scientific computing in terminal environments. Currently in development.