terminal-bench

Introduction

Overview of the Terminal-Bench agent interface.

Creating a custom agent

Terminal-Bench tasks are complex environments with pre-configured lifecycles, include custom startup and teardown instructions. The Terminal-Bench harness handles this complexity for you, so you can focus on writing your agent.

If you haven't already, make sure to install Terminal-Bench by following our installation guide.

Integration Approaches

There are three approaches to integrating your own agent into Terminal-Bench:

  1. Install and run in the task container
  2. Implement the BaseAgent interface
  3. Implement the MCPAgent interface

Install and run in the task container

The most straightforward approach is to install your agent as a package in the task container.

We've provided an AbstractInstalledAgent class that you can implement to install and run your agent in the task container. Implementing the interface involves defining an installation script, the execution command, and environment variables.

Implement the BaseAgent interface

Implementing the BaseAgent interface is more robust than installing your agent in the task container, but, depending on your agent's implementation, can require a bit of surgery to implement.

The BaseAgent interface only requires you to implement one method:

def perform_task(
    self,
    task_description: str,
    session: TmuxSession,
    logging_dir: Path | None = None,
) -> AgentResult:
    """
    Perform the task described by `task_description`.
 
    Args:
        task_description: The description of the task to perform.
        session: Optional tmux session to send commands to.
        logging_dir: The directory to optionally log your agent's output.
 
    Returns:
        An `AgentResult` object with the agent's token counts and optionally a
        failure mode for debugging.
    """
    pass

To integrate your agent by implementing the BaseAgent interface, you'll need to

Want help?

We are also willing to pair-program with you to integrate your agent. Email mikeam@cs.stanford.edu or alex@laude.org (or both!) if you'd like help.

Implement the MCPAgent interface

Coming soon!

Running a custom agent

Once you have implemented your custom agent, you can run it by using the --agent-import-path flag when running the harness.

tb run --agent-import-path path.to.your.agent:YourCustomAgent ...

On this page