First Steps
Running the benchmark and testing your own agent.
Run the benchmark
First, make sure you have set the correct API keys for the models you want to use. Refer to LiteLLM to see which key corresponds to the model you want to use.
The Terminal-Bench CLI can run many different benchmarks and versions of benchmarks.
To see the available benchmarks, run tb datasets list
.
For example, to run the latest, pre-release terminal-bench-core
dataset with the Terminus agent and Claude Sonnet 4:
tb run \
--dataset terminal-bench-core==head \
--agent terminus \
--model-name anthropic/claude-sonnet-4-20250514 \
--task-id hello-world
Run tb run --help
to see the available commands and options.
Testing a custom agent
Terminal-Bench comes with many popular agents available out of the box. Run tb run --help
to see the available agents.
If you want to test a custom agent, you can do so by implementing the BaseAgent
interface. If your agent is accessible as a package (e.g. Claude Code or Codex), you can implement the AbstractInstalledAgent
interface instead.
from terminal_bench.agents import BaseAgent, AgentResult
# or AbstractInstalledAgent if your agent is accessible as a package
from terminal_bench.terminal.tmux_session import TmuxSession
class YourCustomAgent(BaseAgent):
@staticmethod
def name() -> str:
return "your-agent-name"
def perform_task(
self,
task_description: str,
session: TmuxSession,
logging_dir: Path | None = None,
) -> AgentResult:
...
Once you have implemented your agent, you can run it with the CLI by including the --agent-import-path
flag instead of the --agent
flag.
E.g.
tb run \
--dataset terminal-bench-core==head \
--agent-import-path path.to.your.agent:YourCustomAgent \
--task-id hello-world
Getting help
Please reach out to us on Discord if you have any questions or feedback.