Contributing
Consider contributing to Terminal-Bench to become a co-author on our paper.
T-bench is looking for contributors to help build the first benchmark for AI agents that operate in the command line. Contributors will have the opportunity to become co-authors on a paper which we hope to submit to a top-tier ML venue like NeurIPS. Contributors will be considered eligible for co-authorship after earning 6 points.
There are three primary ways to contribute:
Contribution | Points Awarded | Description |
---|---|---|
Author a Task | 2 points | A complete task consists of the task description, environment, solution, and unit tests. Points will be awarded once the task has been approved and merged into main. |
Build an Adapter | 6 points | By contributing an adapter you'll help Terminal-Bench users evaluate their agents on third-party datasets. Point will be awarded once you've completed the partity experiments and we merge the dataset into main. |
Help with Engineering | 1-6 points | A great way to contribute to Terminal-Bench is to help with miscellaneous engineering tasks. We will consider co-authorship for contributors who have spent at least 15 hours on the project. If you haven't been told you're an author yet, but you feel you've contributed enough, please reach out to a project lead. |
Points may also be awarded on an ad-hoc basis by project leads as an incentive.
Final authorship and author order will be determined at the discretion of the project leads by taking into account each contributor's point total and other intangible contributions.
Contributors with fewer than six points may be thanked in the acknowledgements.
Using AI coding tools
Our goal is to evaluate the performance of language models on these tasks, and overzealous use of coding assistants (e.g. Cursor, Claude Code) may produce tasks that are artificially easy for those models. That being said, these are powerful tools and can help accelerate task creation. Our official policy is that task descriptions and solutions should be human-written. Other files can be generated by coding assistants so long as they are human verified.
Task Requirements
Note
We might not merge tasks that don't meet our requirements and we won't be able to give you credit for them!
Looking for inspiration to get you started? Check out our living list of task ideas.
Your task shouldn't be too similar to an existing task. It's your responsibility to check this before you being writing one. You can browse the existing tasks here. If you're in doubt, please reach out to a project lead.
Tasks should be hard and realistic. The best tasks are real problems that you've personally struggled with before. Think back to a time when you spent an hour or two (or even days!) trying to accomplish some task on your computer.
However, tasks don't necessarily have to be something that you would personally do on the command line. For example, a human might not use the CLI to remove spam from their email (preferring to use the GUI), but an AI agent equipped with an API key might find it most convenient to write a simple script to iterate over all messages.
We're looking for tasks that...
- Involve compiling complicated packages from source with exotic configurations
- Require running old code in a new environment
- Perform operations under constraints (e.g. optimize this code so that it passes all test cases but runs in under X milliseconds, train a model that get X% on this benchmark)
- Identify and fix or exploit security vulnerabilities
- String together multiple API calls
- Perform complicated networking tasks
- Resolve package dependencies
- Deploy complicated (simulated) cloud environments
- Manipulate interactive environments like vim
- Have a potentially dangerous solution that the agent must avoid to pass
What we don't need
Through our initial testing we've found that some tasks are not a good fit for the benchmark. While we're excited about all contributions, we will not accept tasks that fall into the following categories:
Tasks that are too easy
We've found that agents can already solve many common CLI tasks, including:
- Manipulating file directories or changing permissions
- Answering general knowledge questions (e.g. "echo the name of the 13th President")
- Resolving common git states
Tasks that require dynamic external dependencies
It's important that a task passes its test cases both on the day you submit it and at any future point in time. For example, weak tasks are "What is the headline of the New York Times today?" and "Update this Pandas script to use the latest version" since both will have different correct solutions depending on when the task is initiated.
Tasks that can't be sufficiently tested
Currently we support programmatic PyTest tests. This means that each instance has to have a well-defined final state that passes all tests. For example, a task like "Write the first two levels of the Sierpinski triangle in vim" sounds promising, but would be challenging to evaluate. Which characters should the model use? Does this instruction include the zeroth level? Which scale should the triangles be at?
In the future we may support LLM-as-a-judge tests which would enable open-ended tasks.