Creating Private Sets

Terminal-Bench is a broad set of tasks meant to test the general ability of an agent to perform complex tasks in a terminal environment.

However, many agents are designed for more narrow use cases. We have built our harness so that it can still be leveraged by making it easy for agent developers to create a custom set of tasks for their agent.

Creating a custom task

tb tasks create --task-dir <path>

This will start an interactive wizard that will guide you through the process of creating a new task.

Running a custom dataset

Run the harness with the --dataset-path flag instead of the --dataset flag.

tb run --dataset-path <path>

You can also create a custom registry if you prefer to use the name/version convention of Terminal-Bench.

Creating Private Sets

Creating a custom task

Running a custom dataset

On this page