
Warp achieved a new state of the art on Terminal-bench, resolving 52% of tasks in Terminal-bench-core v0.1, a 9% gain over the previous state of the art. This marks the second big step forward on Terminal-bench, after Anthropic debuted Opus 4 with a 43% Terminal-bench task resolution rate. Read about how Warp hill-climbed Terminal-bench in their latest blog post.
Task completion comparison between Warp and Claude Code (Opus 4). Each agent was given 5 attempts to solve each task at least once.
Tasks Warp uniquely solved (12): cartpole-rl-training, configure-git-webserver, cron-broken-network, extract-safely, git-multibranch, hf-model-inference, oom, openssl-selfsigned-cert, pytorch-model-cli.hard, sanitize-git-repo, solana-data, train-fasttext.
Tasks Claude Code (Opus 4) uniquely solved (5): build-tcc-qemu, count-dataset-tokens, decommissioning-service-with-sensitive-data, download-youtube, jupyter-notebook-server.
Will Terminal-bench be saturated soon?
We're working hard to make sure that doesn't happen!
Terminal-bench itself is a framework for evaluating agents in the terminal. Our initial batch of 80 tasks, Terminal-bench-core v0.1 might saturate soon, although the last 50% is harder than the first. For example, some tasks that no agent has solved yet involve building the linux kernel from source (build-linux-kernel-qemu) or reversing a decompression algorithm under constraints (write-compressor).
We built versioning into the Terminal-bench framework precisely so that we can continue to release future batches of harder tasks to increase the longevity of the benchmark and enable others to develop benchmarks using the Terminal-bench framework. Check out the Terminal-bench registry docs to learn more.
Stay tuned for (or contribute to) Terminal-bench-core v0.2, our next batch of tasks, coming soon!
We're excited to see the community continue to adopt Terminal-bench and our number one priority is to assist in that adoption, so do not hesitate to reach out.
Written by