Benchmarking Pull Request Code Reviews
Fix the bug, run the commands, approve the change.
SWE-Bench measures the first, Terminal-Bench the second, but we still lack a standard benchmark for the third.
So I built a lightweight benchmark to test 5 major AI models on real pull request decisions from major open-source projects like Kubernetes and VS Code, with real outcomes.
Due to cost constraints, the setup was intentionally simple:
- 20 real PRs (10 approve, 10 reject)
- 5 AI agents/models: Claude Code, Codex CLI (o3-mini), OpenCode (o3), AMP (o4-mini), and Gemini CLI
- Approve or Reject
- Same evaluation criteria for everyone
Results
Model | Overall Accuracy | Approval Rate | Quality Score |
---|---|---|---|
OpenCode | 70% | 80% | 0.89 |
Codex CLI | 60% | 90% | 0.64 |
Gemini CLI | 60% | 50% | 0.83 |
Claude Code | 60% | 90% | 0.68 |
AMP | 55% | 85% | 0.71 |
The overlapping pattern was that most models were way too nice. CodexCLI and Claude Code approved 90% of PRs, including the ones that were clearly incomplete or problematic.
Here are the false positive rates:
CodexCLI, Claude Code, AMP: 80%
OpenCode: 60%
Gemini CLI: 60%
Only Gemini CLI achieved a balanced 50% approval rate, closest to the ground truth distribution. It was also the only one that wrongly rejected good PRs (40% false negative rate).
What This Means
Most of the AI Models are "Yes Men". They are conflict-averse reviewers who would rather say yes than risk blocking a good change. If you're using AI for code review, weight their rejections more heavily than approvals.
OpenCode wins here, but not just because of accuracy:
- Highest accuracy (70%) AND highest quality reasoning (0.89/1.0)
- Provided specific technical analysis: "This PR improves code consistency by using native pyarrow grouped aggregation functions..."
- Referenced actual PR content and implementation details
Compare that to CodexCLI's typical response: "The changes look good and address the issue effectively."
Gemini CLI showed the most balanced judgment but sometimes got bogged down in edge cases that didn't matter.
Dataset Issues:
- PR #2: Kubernetes docs fix that added an unnecessary "ACKNOWLEGEMENT" section with typos
- PR #4: VS Code readonly files feature—complex implementation, mixed reviews
- PR #15: Rust doctest optimization—significant performance improvement but breaking change concerns
With only 20 examples, each mistake carries huge weight (5% accuracy hit). The models might perform differently on:
- Different project types (web vs systems vs data)
- Larger codebases with more context
- PRs with CI/CD results and test outputs
Agent vs Model
Here's something important I realized: I was testing raw language models, not coding agents. What I actually tested was models getting a text prompt with PR info. They didn't have the ability to:
- run code or tests
- browse the repo or docs
- view CI/CD results or commit history
A real coding agent should be able to do all of that, iterate on its work, and ask clarifying questions. This explains some of the approval bias. Without being able to verify claims, models default to trusting the PR author.
What's Next?
To really understand AI code review capabilities, we need more including but not limited to:
- a larger dataset across different language/project types
- real agent testing
- better prompt engineering
- few-shot examples of good reviews
An interesting problem to consider is how to customize a coding agent to review code in the style and taste of a specific organization/codebase. A possible approach is to generate synthetic training data from historical PRs and reviews instead of curating by hand. In general, figuring out how to automate gathering PRs and review comments is also another next step.
If you'd like to dive deeper or explore the data and code behind my study, you can visit my full repository.
Related Work:
There's also related work like PullRequestBenchmark, which focuses specifically on binary approve/reject decisions.