Terminal-Bench

terminal-bench

science

A Benchmark for Evaluating AI Agents on Complex Real-World Scientific Workflows in the Terminal

What is Terminal-Bench-Science?

Terminal-Bench-Science (TB-Science) is a benchmark for evaluating AI agents on the complex real-world computational workflows that natural scientists run in their research labs. It builds on the success of Terminal-Bench, adopted by frontier labs such as OpenAI, Anthropic, and Google DeepMind, which helped drive rapid progress in AI coding agents by defining what leading labs measure and optimize for. No equivalent exists for science — until now.

Current "AI for Science" benchmarks test textbook knowledge or abstract capabilities like hypothesis generation. They do not measure whether an AI system can execute the end-to-end computational workflows that drive modern research in the natural sciences. TB-Science will close this gap by porting real workflows from leading research labs into executable benchmark tasks, evaluated in containerized environments with deterministic, programmatic verification.

Our goal is to catalyze a "Claude Code / Codex for Science" moment by giving natural scientists a direct voice in shaping AI progress: domain experts contribute real workflows, frontier labs optimize against them, and the resulting advances flow back as more capable AI tools for scientific discovery, creating a virtuous cycle between the scientists who know what matters and the labs building the next generation of AI.

Natural Science Community

Domain experts from
the natural sciences

Frontier AI Labs

Anthropic, OpenAI,
Google Deep Mind etc.

Frontier AI Agents and Models

Agents: Claude-Code, Codex etc.
Models: Opus, GPT, Gemini etc.

domain experts contribute complex real-
world scientific workflows as tasks

tasks are used to evaluate and
rank frontier AI agents/models

frontier labs invest in improving scientific
capabilities of their agents/models

improved agents/models
accelerate scientific research

VIRTUOUS CYCLE OF
AI FOR SCIENCE
PROGRESS

Domains

TB-Science is targeting 100+ benchmark tasks across the natural sciences, spanning the life sciences, physical sciences, earth sciences, and mathematical & computational sciences:

Timeline

Q1 2026

Project launch, initial task collection and review

Q2 2026

Open contribution call, extensive task collection and review, evaluation runs

Q3 2026

Public release and leaderboard, paper submission

Contributing

TB-Science is a community-driven effort — we welcome contributions from domain experts across all natural science disciplines. See for more details.

Contact

For questions, feedback, or if you're interested in contributing, join the #tb-science channel on our Discord, or reach out to Steven Dillmann at stevendi@stanford.edu.

Terminal-Bench-Science is an open academic collaboration hosted by Stanford University and the Laude Institute.