terminal-bench

Creating Private Sets

How to create a private set of tasks.

Terminal-Bench is a broad set of tasks meant to test the general ability of an agent to perform complex tasks in a terminal environment.

However, many agents are designed for more narrow use cases. We have built our harness so that it can still be leveraged by making it easy for agent developers to create a custom set of tasks for their agent.

Creating a custom task

tb tasks create --task-dir <path>

This will start an interactive wizard that will guide you through the process of creating a new task.

Running a custom dataset

Run the harness with the --dataset-path flag instead of the --dataset-name and --dataset-version flags.

tb run --dataset-path <path>

You can also create a custom registry if you prefer to use the name/version convention of Terminal-Bench.

On this page