We're thrilled that Terminal-Bench was one of seven benchmarks featured on the Claude 4 model card and explicitely referenced as one of two benchmarks in their annoucement!
Terminal-Bench was also one of two benchmarks mentioned by Dario Amodei during his keynote at the recent Code with Claude event:
Notably, while Claude 4 Opus' reported performance of 43.2% is state of the art on Terminal-Bench-Core, it is the lowest score on the model card. While other benchmarks are rapidly becoming saturated, Terminal-Bench is still a useful benchmark for evaluating the performance of frontier AI agents.
In the coming days we will be verifying Claude 4's performance on Terminal-Bench-Core and updating our offical leaderboard accordingly.
Written by