terminal-bench
science
A Benchmark for Evaluating AI Agents on Complex Real-World Scientific Workflows in the Terminal
What is Terminal-Bench-Science?
Terminal-Bench-Science (TB-Science) is a benchmark for evaluating AI agents on the complex real-world computational workflows that natural scientists run in their research labs. It builds on the success of Terminal-Bench, adopted by frontier labs such as OpenAI, Anthropic, and Google DeepMind, which helped drive rapid progress in AI coding agents by defining what leading labs measure and optimize for. No equivalent exists for science — until now.
Current "AI for Science" benchmarks test textbook knowledge or abstract capabilities like hypothesis generation. They do not measure whether an AI system can execute the end-to-end computational workflows that drive modern research in the natural sciences. TB-Science will close this gap by porting real workflows from leading research labs into executable benchmark tasks, evaluated in containerized environments with deterministic, programmatic verification.
Our goal is to catalyze a "Claude Code / Codex for Science" moment by giving natural scientists a direct voice in shaping AI progress: domain experts contribute real workflows, frontier labs optimize against them, and the resulting advances flow back as more capable AI tools for scientific discovery, creating a virtuous cycle between the scientists who know what matters and the labs building the next generation of AI.
domain experts contribute complex real-
world scientific workflows as tasks
tasks are used to evaluate and
rank frontier AI agents/models
frontier labs invest in improving scientific
capabilities of their agents/models
improved agents/models
accelerate scientific research
VIRTUOUS CYCLE OF
AI FOR SCIENCE
PROGRESS
Domains
TB-Science is targeting 100+ benchmark tasks across the natural sciences, spanning the life sciences, physical sciences, earth sciences, and mathematical & computational sciences:
Timeline
Project launch, initial task collection and review
Open contribution call, extensive task collection and review, evaluation runs
Public release and leaderboard, paper submission
Contributing
TB-Science is a community-driven effort — we welcome contributions from domain experts across all natural science disciplines. See for more details.
Contact
For questions, feedback, or if you're interested in contributing, join the #tb-science channel on our Discord, or reach out to Steven Dillmann at stevendi@stanford.edu.
Terminal-Bench-Science is an open academic collaboration hosted by Stanford University and the Laude Institute.