Terminus is our research-preview agent for consistently evaluating the abilities of language models to power autonomous agents in the terminal. It is the second highest performing agent on Terminal-Bench, outperformed at release only by Claude Code.
Motivation
While building Terminal-Bench we realized that third party agents like Codex CLI, Goose, and Claude Code are not the most appropriate scientific instruments for model performance because they:
-
Do not operate entirely autonomously. Due to well-placed security concerns some agents frequently ask for user approval or do not have complete control over the terminal (e.g. can only execute whitelisted commands or cannot launch additional shells). Terminus is an "autonomy-first" agent which will comply with initial user directives within its sandboxed environment with no artificial limitations on its behavior.
-
Are not compatible with arbitrary language models. For example, Claude Code is only compatible with Anthropic's models and cannot be used to evaluate on other APIs or hosted open source models. Furthermore, frontier labs like OpenAI may have designed their agents with the specific limitations and capabilities of their own models in mind. We designed Terminus to be agnostic to specific models, and it supports virtually any API or locally hosted model through LiteLLM.
-
Operate directly inside the (potentially buggy) environment. Installed agents like Claude Code run directly inside the shell they control. This means that all agent logic like control loops, tool execution, and API calls are made from inside the containerized environment. In realistic complex environments, such as those with broken installation dependencies, minimal resources, or buggy networking it is challenging to install and run these agents. Terminus remotely connects to its dockerized execution environment through a tmux session, allowing it to execute even in these problematic cases.
Technical Design
Terminus was designed as a high-performance neutral test bed for language models as agents in the terminal. It operates without any intermediate user input or limitations on its control over its Docker container, and was designed without any particular language model in mind.
Mono-tool Design
Terminus has only a single tool at its disposal: an interactive tmux session running inside its execution environment. While other agents have dedicated tools for editing files, executing bash commands, or downloading files Terminus purely operates by sending standard keystrokes into the tmux session. This means that it has the flexibility to decide on its own how to execute a sub-task like writing a script to a file (it could elect to echo the contents directly into a file, or could operate an interactive text editor like vim or emacs). It also means that Terminus can scroll, use arrow keys to navigate a menu, and launch additional shells to accomplish its tasks.
Independent Execution
Terminus "lives" outside the environment it operates. In our harness its logic is run in a Python process that is independent from the task's Docker container. This design choice allows it to remotely connect to arbitrary computer environments.
Autonomy-First
Terminus will never ask for user input and will instead independently push to complete its task on its own to the best of its ability. Given the current limitations of language models to operate safely, this makes it inappropriate for sensitive environments where the agent could cause irreparable damage. However, since the Terminal-Bench harness is entirely sandboxed this level of autonomy is appropriate for understanding the capabilities of frontier models. We believe that in the near future agents like Terminus will cross a threshold and be trusted even in non-sandboxed environments. Terminus (and Terminal-Bench more broadly) are designed to signal when this critical point is reached.
Written by