terminal-bench

First Steps

Running the benchmark and testing your own agent.

Run the benchmark

First, make sure you have set the correct API keys for the models you want to use. Refer to LiteLLM to see which key corresponds to the model you want to use.

The Terminal-Bench CLI can run many different benchmarks and versions of benchmarks. For example, to run the terminal-bench-core-v0.1.0 dataset (corresponding to the leaderboard) with the Terminus agent and Claude Sonnet 4:

tb run \
    --dataset-name terminal-bench-core \
    --dataset-version 0.1.0 \
    --agent terminus \
    --model-name anthropic/claude-sonnet-4-20250514 \
    --task-id hello-world

Run tb run --help to see the available commands and options.

Testing a custom agent

Terminal-Bench comes with many popular agents available out of the box. Run tb run --help to see the available agents.

If you want to test a custom agent, you can do so by implementing the BaseAgent interface. If your agent is accessible as a package (e.g. Claude Code or Codex), you can implement the AbstractInstalledAgent interface instead.

path/to/your/agent.py
from terminal_bench.agents import BaseAgent, AgentResult
# or AbstractInstalledAgent if your agent is accessible as a package
 
from terminal_bench.terminal.tmux_session import TmuxSession
 
class YourCustomAgent(BaseAgent):
    @staticmethod
    def name() -> str:
        return "your-agent-name"
 
    def perform_task(
        self,
        task_description: str,
        session: TmuxSession,
        logging_dir: Path | None = None,
    ) -> AgentResult:
        ...

Once you have implemented your agent, you can run it with the CLI by including the --agent-import-path flag instead of the --agent flag.

E.g.

tb run \
    --dataset-name terminal-bench-core \
    --dataset-version 0.1.0 \
    --agent-import-path path.to.your.agent:YourCustomAgent \
    --task-id hello-world

Getting help

Please reach out to us on Discord if you have any questions or feedback.

On this page