Submitting to the Leaderboard

Implement your agent, or use Terminus

See our integration tutorial if you haven't integrated your agent yet.

If you want to evaluate on a model, rather than an agent, consider using Terminus.

Evaluate your model or agent

To submit to the leaderboard, you'll need to evaluate your agent on the terminal-bench-core@0.1.1 dataset. See our tutorial for instructions on how to integrate an agent.

Run the following command to evaluate your agent:

tb run \
    --agent terminus \
    --model <some-model> \
    --dataset-name terminal-bench-core \
    --dataset-version 0.1.1

Then reach out to us at mikeam@cs.stanford.edu or alex@laude.org and we'll add your results to the leaderboard.

Pro tip: quickly evaluate a new model

If all you want to do is evaluate a new model that only your API key has access to, you don't even need to clone the repo or install the package.

Just run the following command (make sure uv is installed):

uvx terminal-bench run -n terminal-bench-core -v 0.1.1 -a terminus -m <your-model>

Submitting to the Leaderboard

Implement your agent, or use Terminus

Evaluate your model or agent

Pro tip: quickly evaluate a new model

On this page