terminal-bench

How to run Terminal-Bench Challenges

Run or adapt long-horizon Terminal-Bench challenge tasks.

Terminal-Bench Challenges are long-horizon, single-task benchmarks. The tasks are authored in Harbor format, so Harbor can fetch, build, run, and verify them directly, but you may find it easier to evaluate them using custom, stateful, hand-rolled rollout logic like Anthropic used for the C compiler task.

Source

The challenge tasks live in the terminal-bench-challenges repository and are published as tasks on Harbor Hub:

Run with Harbor

Install Harbor, make sure Docker is running, and run a challenge task.

uv tool install harbor
harbor run \
  -t terminal-bench/inference-engine-codegolf \
  -a claude-code \
  -m anthropic/claude-haiku-4-5 \
  -k 5

Replace the task with any other challenge from the source list.

Custom rollouts

For serious attempts, you may want a custom stateful long-running rollout instead of using Harbor as the full framework. The task source and verifier can still be the benchmark contract while your own system handles checkpointing, orchestration, retries, agent memory, and long-running state.

Anthropic's C compiler project is an example of this style of work; their writeup describes using parallel Claude agents on a long-running systems project with programmatic verification.