terminal-bench

  • run terminal-bench
  • leaderboard
  • benchmarks
  • contributors
  • news
  • discord

Benchmarks

Browse Terminal-Bench benchmarks and their tasks.

Terminal-Bench 2.1

active

An improved version of Terminal-Bench 2.0, inspired by Z.ai's Terminal-Bench 2.0 Verified.

Terminal-Bench 2.0

active

89 high-quality tasks across software engineering, machine learning, security, data science, and more.

Terminal-Bench 1.0

active

The original Terminal-Bench benchmark. 80 tasks testing agents' abilities to complete tasks using a terminal.

Terminal-Bench 3.0

in progress

The next frontier benchmark for terminal agents. Currently in development — contribute tasks and help shape the future of agent evaluation.

Terminal-Bench Science

in progress

A domain-specific benchmark for scientific computing in terminal environments. Currently in development.

Inference Engine Code Golf

active

A single-task challenge to write a complete Kimi K2.5 inference engine in one <=25,000-byte CUDA file.

Rust Compiler Speedup

active

A single-task challenge to make rustc compile programs faster while preserving correctness across the full test suite.

WASM Render

active

A single-task challenge to implement a pure JS/WASM WebGL 1.0 and 2.0 software renderer.