First Steps
Running the benchmark and testing your own agent.
Run the benchmark
First, make sure you have set the correct API keys for the models you want to use. Refer to LiteLLM to see which key corresponds to the model you want to use.
The Terminal-Bench CLI can run many different benchmarks and versions of benchmarks. For example, to run the terminal-bench-core-v0.1.0
dataset (corresponding to the leaderboard) with the Terminus agent and Claude Sonnet 4:
Run tb run --help
to see the available commands and options.
Testing a custom agent
Terminal-Bench comes with many popular agents available out of the box. Run tb run --help
to see the available agents.
If you want to test a custom agent, you can do so by implementing the BaseAgent
interface. If your agent is accessible as a package (e.g. Claude Code or Codex), you can implement the AbstractInstalledAgent
interface instead.
Once you have implemented your agent, you can run it with the CLI by including the --agent-import-path
flag instead of the --agent
flag.
E.g.
Getting help
Please reach out to us on Discord if you have any questions or feedback.