Submitting to the Leaderboard
How to submit your agent results to the Terminal-Bench leaderboard.
Implement your agent, or use Terminus
See our integration tutorial if you haven't integrated your agent yet.
If you want to evaluate on a model, rather than an agent, consider using Terminus.
Evaluate your model or agent
To submit to the leaderboard, you'll need to evaluate your agent on the terminal-bench-core@0.1.1
dataset. See our tutorial for instructions on how to integrate an agent.
Run the following command to evaluate your agent:
Then reach out to us at mikeam@cs.stanford.edu or alex@laude.org and we'll add your results to the leaderboard.
Pro tip: quickly evaluate a new model
If all you want to do is evaluate a new model that only your API key has access to, you don't even need to clone the repo or install the package.
Just run the following command (make sure uv is installed):