How to run Terminal-Bench Challenges
Run or adapt long-horizon Terminal-Bench challenge tasks.
Terminal-Bench Challenges are long-horizon, single-task benchmarks. The tasks are authored in Harbor format, so Harbor can fetch, build, run, and verify them directly, but you may find it easier to evaluate them using custom, stateful, hand-rolled rollout logic like Anthropic used for the C compiler task.
Source
The challenge tasks live in the terminal-bench-challenges repository and are published as tasks on Harbor Hub:
- Inference Engine Code Golf: GitHub, Harbor Hub
- Rust Compiler Speedup: GitHub, Harbor Hub
- WASM Render: GitHub, Harbor Hub
Run with Harbor
Install Harbor, make sure Docker is running, and run a challenge task.
uv tool install harborharbor run \
-t terminal-bench/inference-engine-codegolf \
-a claude-code \
-m anthropic/claude-haiku-4-5 \
-k 5Replace the task with any other challenge from the source list.
Custom rollouts
For serious attempts, you may want a custom stateful long-running rollout instead of using Harbor as the full framework. The task source and verifier can still be the benchmark contract while your own system handles checkpointing, orchestration, retries, agent memory, and long-running state.
Anthropic's C compiler project is an example of this style of work; their writeup describes using parallel Claude agents on a long-running systems project with programmatic verification.