How to run Terminal-Bench 2.1
Run Terminal-Bench 2.1 with Harbor.
Terminal-Bench 2.1 is run through Harbor with the terminal-bench/terminal-bench-2-1 dataset.
Prerequisites
Install Harbor and make sure Docker is running.
uv tool install harborharbor --helpSmoke test
Run the oracle solutions on the first five tasks. This confirms that Harbor can download the dataset and start the task containers.
harbor run -d terminal-bench/terminal-bench-2-1 -a oracle -l 5Run with an agent
harbor run \
-d terminal-bench/terminal-bench-2-1 \
-a claude-code \
-m anthropic/claude-opus-4-1 \
-k 5To scale with a cloud sandbox, set the provider keys and pass --env.
export DAYTONA_API_KEY="<your-daytona-api-key>"
export ANTHROPIC_API_KEY="<your-anthropic-api-key>"
harbor run \
-d terminal-bench/terminal-bench-2-1 \
-a claude-code \
-m anthropic/claude-opus-4-1 \
--env daytona \
-n 32 \
-k 5Custom agents
Pass an import path instead of a built-in agent name.
harbor run \
-d terminal-bench/terminal-bench-2-1 \
--agent-import-path "path.to.agent:SomeAgent" \
-k 5To run one task from the dataset, add --include-task-name.
harbor run \
-d terminal-bench/terminal-bench-2-1 \
-a claude-code \
-m anthropic/claude-opus-4-1 \
--include-task-name "<task-name>"Submit to the leaderboard
A new leaderboard submission process is coming soon.
Results
View the Terminal-Bench 2.1 leaderboard, browse the Terminal-Bench 2.1 benchmark, or open the dataset on Harbor Hub.