Terminal-Bench 3.0 Call for Contributions

We're excited to announce that Terminal-Bench 3.0 is now in active development — the next version of Terminal-Bench.

What We're Building

Our goal for Terminal-Bench 3.0 is 100 diverse tasks targeting at most 30% solve rate from the best models at release. We want tasks that are genuinely difficult — longer-horizon, richer environments, and requiring specialized expertise.

Terminal-Bench 2.0 covered software engineering, sys-admin, security, and scientific computing. For Terminal-Bench 3.0, we're expanding to an even wider variety of domains. Any realistic, valuable, and challenging computer task that can be accomplished via the command line and programmatically verified is fair game.

Open Call for Contributions

We're looking for contributors to help build Terminal-Bench 3.0. The merge window is open now through the end of May.

What makes a good task?

Real compensated computer work — something a human would be paid to do
Clear instructions and robust verification — a solution passes if and only if it correctly completes the task
Genuinely difficult — well beyond the current frontier of model ability

Strategies for creating challenging tasks include longer-horizon work (e.g. multi-day), richer environments (many microservices, filesystems, databases), and tasks requiring specialized expert-level knowledge.

How to contribute

Calibrate: Review the Task Proposal Rubric and existing tasks for the bar we're targeting
Ideate: Define a task proposal of challenging computer work in your domain of expertise
Validate: Post your task idea in GitHub Discussions or Discord for maintainer approval
Create: Fork the repo, implement your task, submit a PR, and iterate with reviewers

For full details, see the CONTRIBUTING.md.

Recognition

Due to the increased difficulty in making frontier tasks, contributors with even one accepted task will be acknowledged in the final release. Task contributors — organizations and individuals — will be ordered by the number of tasks accepted into the dataset.

Get Involved

GitHub: harbor-framework/terminal-bench-3
Discord: Join #tb-3 for questions and discussion, and #tb-task-proposals for early feedback on task ideas
Weekly Meeting: Thursdays 9am PT — open to all contributors

We're building the next frontier benchmark for AI agents, and we'd love your help.