Since releasing Terminal-Bench, we've added a dozen new tasks, and we're not stopping anytime soon!
This is the first in a series of posts highlighting exciting new tasks. We're continuing to expand the set of tasks to help you test and improve your agents, with an emphasis on realistic, challenging, and valuable tasks.
Parallelize Graph Algorithm: Scientific Computing with UPC
Difficulty: Hard | Domain: High-Performance Computing + Bioinformatics
Contributed by Jason Poulos
Our parallelize-graph task ventures into the specialized world of Unified Parallel C (UPC) programming, requiring agents to implement parallel de Bruijn graph construction for genome assembly. It's based on a real scientific computing problem: completing a complete genome from tiny pieces of DNA (k-mers).
Why This Challenges AI Agents
This task presents unique difficulties for agents operating in terminal environments:
- Complex Tool Management: Agents must navigate building Berkeley UPC from source, managing SMP configurations, and debugging compilation issues through command-line interfaces.
- Multi-step Verification: Agents must compile, run, and compare outputs across different implementations while managing complex dependencies.
The combination of complex build systems and domain expertise makes this a great task for Terminal-Bench.
Terminus in Action
Terminus with GPT-4.1 is unable to complete this task. It implements a serial version of the algorithm, but is unable to parallelize it correctly:
FEAL Linear Cryptanalysis: Breaking Ciphers
Difficulty: Hard | Domain: Cryptography + Security
Contributed by Nicholas Carlini
The feal-linear-cryptanalysis task brings academic cryptanalysis into a hands-on programming challenge, requiring agents to implement a sophisticated known-plaintext attack against a FEAL-like cipher. The FEAL cipher was is a family of block ciphers that were published in the 1980s and 1990s. It's known to be vulnerable to linear cryptanalysis, which is a type of cryptanalysis that uses linear approximations of the cipher's behavior to break it. Read more here.
Why This Challenges AI Agents
Cryptanalysis presents a perfect storm of difficulties for AI agents working in terminal environments:
- Iterative Hypothesis Testing: The attack requires trying different approaches, analyzing results, and adapting strategy based on intermediate findings.
- Bit Manipulation: Success depends on precise understanding of binary operations, bit patterns, and cipher internals that are difficult to verify without domain expertise.
- Multi-stage Problem Solving: Agents must coordinate statistical analysis, key recovery, and verification steps while managing multiple executables and data files.
Terminus in Action
Terminus with GPT-4.1 also fails this task. It's over-confident in its approach and tries to use a method that doesn't work, ultimately signaling completion prematurely.
Get Started
Both tasks (and 90 more!) are available in the Terminal-Bench repository.
Written by