terminal-bench

First Steps

Running the benchmark and testing your own agent.

Run the benchmark

First, make sure you have set the correct API keys for the models you want to use. Refer to LiteLLM to see which key corresponds to the model you want to use.

The Terminal-Bench CLI can run many different benchmarks and versions of benchmarks.

To see the available benchmarks, run tb datasets list.

For example, to run the latest, pre-release terminal-bench-core dataset with the Terminus agent and Claude Sonnet 4:

tb run \
    --dataset terminal-bench-core==head \
    --agent terminus \
    --model-name anthropic/claude-sonnet-4-20250514 \
    --task-id hello-world

Run tb run --help to see the available commands and options.

Testing a custom agent

Terminal-Bench comes with many popular agents available out of the box. Run tb run --help to see the available agents.

If you want to test a custom agent, you can do so by implementing the BaseAgent interface. If your agent is accessible as a package (e.g. Claude Code or Codex), you can implement the AbstractInstalledAgent interface instead.

path/to/your/agent.py
from terminal_bench.agents import BaseAgent, AgentResult
# or AbstractInstalledAgent if your agent is accessible as a package

from terminal_bench.terminal.tmux_session import TmuxSession

class YourCustomAgent(BaseAgent):
    @staticmethod
    def name() -> str:
        return "your-agent-name"

    def perform_task(
        self,
        task_description: str,
        session: TmuxSession,
        logging_dir: Path | None = None,
    ) -> AgentResult:
        ...

Once you have implemented your agent, you can run it with the CLI by including the --agent-import-path flag instead of the --agent flag.

E.g.

tb run \
    --dataset terminal-bench-core==head \
    --agent-import-path path.to.your.agent:YourCustomAgent \
    --task-id hello-world

Getting help

Please reach out to us on Discord if you have any questions or feedback.