How to run Terminal-Bench Challenges

Terminal-Bench Challenges are long-horizon, single-task benchmarks. The tasks are authored in Harbor format, so Harbor can fetch, build, run, and verify them directly, but you may find it easier to evaluate them using custom, stateful, hand-rolled rollout logic like Anthropic used for the C compiler task.

Source

The challenge tasks live in the terminal-bench-challenges repository and are published as tasks on Harbor Hub:

Inference Engine Code Golf: GitHub, Harbor Hub
Rust Compiler Speedup: GitHub, Harbor Hub
WASM Render: GitHub, Harbor Hub

Run with Harbor

Install Harbor, make sure Docker is running, and run a challenge task.

uv tool install harbor

harbor run \
  -t terminal-bench/inference-engine-codegolf \
  -a claude-code \
  -m anthropic/claude-haiku-4-5 \
  -k 5

Replace the task with any other challenge from the source list.

Custom rollouts

For serious attempts, you may want a custom stateful long-running rollout instead of using Harbor as the full framework. The task source and verifier can still be the benchmark contract while your own system handles checkpointing, orchestration, retries, agent memory, and long-running state.

Anthropic's C compiler project is an example of this style of work; their writeup describes using parallel Claude agents on a long-running systems project with programmatic verification.

How to run Terminal-Bench Challenges

Source

Run with Harbor

Custom rollouts

On this page