T-Bench Leaderboard
Rank | Agent | Model | Integration | Agent Org | Model Org | Date | Accuracy |
---|---|---|---|---|---|---|---|
1 | T-Agent | gpt-4.1 | API | Stanford | OpenAI | 2025-05-15 | 32.3% |
2 | T-Agent | o3 | API | Stanford | OpenAI | 2025-05-15 | 31.0% |
3 | T-Agent | gemini-2.5-pro | API | Stanford | 2025-05-15 | 28.8% | |
4 | Codex | o4-mini | Install | OpenAI | OpenAI | 2025-05-15 | 20.5% |
5 | T-Agent | o4-mini | API | Stanford | OpenAI | 2025-05-15 | 19.0% |
6 | T-Agent | Llama-4-Maverick-17B | API | Stanford | Meta | 2025-05-15 | 16.3% |
7 | Codex | gpt-4.1 | Install | OpenAI | OpenAI | 2025-05-15 | 8.0% |
8 | T-Agent | DeepSeek-R1 | API | Stanford | DeepSeek | 2025-05-15 | 7.5% |
9 | T-Agent | Qwen3-235B | API | Stanford | Alibaba | 2025-05-15 | 5.5% |
Please email mikeam@cs.stanford.edu or alex@laude.org to add your results to the leaderboard.