Creating Private Sets
How to create a private set of tasks.
Terminal-Bench is a broad set of tasks meant to test the general ability of an agent to perform complex tasks in a terminal environment.
However, many agents are designed for more narrow use cases. We have built our harness so that it can still be leveraged by making it easy for agent developers to create a custom set of tasks for their agent.
Creating a custom task
This will start an interactive wizard that will guide you through the process of creating a new task.
Running a custom dataset
Run the harness with the --dataset-path
flag instead of the --dataset-name
and --dataset-version
flags.
You can also create a custom registry if you prefer to use the name/version convention of Terminal-Bench.