terminal-bench

  • run terminal-bench
  • leaderboard
  • benchmarks
  • contributors
  • news
  • discord

Contributors

View contributors for each Terminal-Bench benchmark.

Terminal-Bench 2.0

shipped

89 high-quality tasks across software engineering, machine learning, security, data science, and more.

Terminal-Bench 1.0

shipped

The original Terminal-Bench benchmark. 80 tasks testing agents' abilities to complete tasks using a terminal.

Terminal-Bench 3.0

in progress

The next frontier benchmark for terminal agents. Currently in development — contribute tasks and help shape the future of agent evaluation.

Terminal-Bench Science

in progress

A domain-specific benchmark for scientific computing in terminal environments. Currently in development.

Inference Engine Code Golf

shipped

A single-task challenge to write a complete Kimi K2.5 inference engine in one <=25,000-byte CUDA file.

Rust Compiler Speedup

shipped

A single-task challenge to make rustc compile programs faster while preserving correctness across the full test suite.

WASM Render

shipped

A single-task challenge to implement a pure JS/WASM WebGL 1.0 and 2.0 software renderer.