We're excited to announce that Terminal-Bench 3.0 is now in active development — the next version of Terminal-Bench.
What We're Building
Our goal for Terminal-Bench 3.0 is 100 diverse tasks targeting at most 30% solve rate from the best models at release. We want tasks that are genuinely difficult — longer-horizon, richer environments, and requiring specialized expertise.
Terminal-Bench 2.0 covered software engineering, sys-admin, security, and scientific computing. For Terminal-Bench 3.0, we're expanding to an even wider variety of domains. Any realistic, valuable, and challenging computer task that can be accomplished via the command line and programmatically verified is fair game.
Open Call for Contributions
We're looking for contributors to help build Terminal-Bench 3.0. The merge window is open now through the end of April.
What makes a good task?
- Real compensated computer work — something a human would be paid to do
- Clear instructions and robust verification — a solution passes if and only if it correctly completes the task
- Genuinely difficult — well beyond the current frontier of model ability
Strategies for creating challenging tasks include longer-horizon work (e.g. multi-day), richer environments (many microservices, filesystems, databases), and tasks requiring specialized expert-level knowledge.
How to contribute
- Calibrate: Review the Task Proposal Rubric and existing tasks for the bar we're targeting
- Ideate: Define a task proposal of challenging computer work in your domain of expertise
- Validate: Post your task idea in GitHub Discussions or Discord for maintainer approval
- Create: Fork the repo, implement your task, submit a PR, and iterate with reviewers
For full details, see the CONTRIBUTING.md.
Recognition
Due to the increased difficulty in making frontier tasks, contributors with even one accepted task will be acknowledged in the final release. Task contributors — organizations and individuals — will be ordered by the number of tasks accepted into the dataset.
Get Involved
- GitHub: harbor-framework/terminal-bench-3
- Discord: Join #tb-3 for questions and discussion, and #tb-task-proposals for early feedback on task ideas
- Weekly Meeting: Thursdays 9am PT — open to all contributors
We're building the next frontier benchmark for AI agents, and we'd love your help.
Written by