tiffany sun

Blog

Benchmarking Pull Request Code Reviews

August 3, 2025

I built a lightweight benchmark to test 5 major AI models on real pull request decisions from major open-source projects like Kubernetes and VS Code. Most models turned out to be "Yes Men" - approving 80-90% of PRs including problematic ones.

Read more →