We Turned 2.5 Years of PR Comments Into an AI Code Reviewer

Written by Kadir Pamukçu

5 min read

Updated Mar 09, 2026

Share this article

Table of Contents

PR volume doubled. Review bandwidth didn’t. Here’s how we built a Claude-powered code reviewer trained on two and a half years of our own GitHub history — and the system we now run on every PR.

The Problem: More Code, Same Hours

When your team ships faster, code review becomes the bottleneck. We found ourselves with roughly double the number of pull requests compared to the previous year, and no realistic way to keep up without letting quality slip or burning people out.

We tried existing tools. CodeRabbit flagged too many low-value issues and didn’t respect our team’s conventions. BugBot had promising bug detection, but its model bundles the tool with Cursor — meaning the cost only makes sense if your whole team uses that editor. We have engineers on different tools, so we were effectively paying for reviews alone.

Neither tool understood our codebase, our patterns, or our standards. So we decided to build something that did.

The Pattern: Extract → Distill → Codify → Apply → Iterate

Before getting into the specifics, here’s the architecture in one line:

Extract your existing PR review history
Distill it into reusable rules using AI sub-agents and human judgment
Codify those rules into a skill file
Apply it automatically on every PR and interactively before one opens
Iterate by feeding new reviews back into the cycle

Everything below is how each step works in practice.

The Insight: Your Review History Is a Gold Mine

Two and a half years of GitHub pull request comments is a remarkable dataset. Every inline comment, every “please use our serializer pattern,” every repeated nitpick about test structure — it’s all there, representing the collective engineering knowledge of your team.

The core idea: instead of writing review rules from scratch, extract them from what we’d already written.

Step 1: Crawl Your PR Review History

We wrote a script to crawl the entire GitHub review history for our repository. The output was a structured dataset of PR names and descriptions, the diffs being reviewed, and every inline comment with its associated code context.

The raw file ran to about 500,000 lines — not something you can drop into a single AI context window.

Step 2: Divide and Extract Insights with Sub-Agents

We split the review history by domain using file paths as a guide — frontend, backend, serializers, tests, general design patterns. Each domain produced multiple files of roughly 3,000 lines each.

We ran multiple Claude sub-agents (using Opus) in parallel, each ingesting one chunk of the review history, with a prompt to identify insights that are frequently mentioned, genuinely valuable, and could be codified as a reusable rule.

This produced a set of raw insights per domain. We then did a human review pass — deleting low-quality suggestions, merging duplicates, and refining the wording. The AI found the patterns; a human judged their worth.

Step 3: Build a Skill File

The refined insights were compiled into a skill file — a structured markdown document that tells Claude how to review code in our codebase. It covers things like how we use serializers, timestamp and model conventions, test coverage expectations, and frontend state management patterns.

The current skill file is around 11,000 tokens — deliberately lean to leave room in the context window for the actual PR diff and surrounding code.

Here’s a sample from the backend rules to give a sense of the specificity:

When using .first() or .last(), verify an explicit .order_by() is present. Relying on default model ordering is fragile.
When filtering through M2M or reverse FK relations, add .distinct() to prevent duplicate objects from JOINs.
When adding unique_together or UniqueConstraint, verify behavior with nullable fields. MySQL allows multiple NULLs in unique constraints.
When parsing ISO date strings from the backend, use parseISO from date-fns instead of new Date(). new Date(“2024–11–08”) assumes midnight UTC and shifts the date by the browser’s timezone offset, producing off-by-one day bugs.

These aren’t invented rules — they’re distilled from real comments left by real engineers over two and a half years. The AI recognized they kept coming up; the humans decided they were worth keeping.

Here’s how that rule was extracted — a raw insight before it was codified:

Insight 4: Use parseISO from date-fns instead of new Date() for date string parsing

Category: correctness

Novelty: NEW

Frequency: 25+ distinct PRs
Across files A, C, D (PRs #1453, #1583, #1587, #1651, #2132, #2456, #2488, #2509, #2513, #2529, #2534, #2672, #2875, #3334, #3355, #5316, #5359, #6114, #6281, #6562, #6595)

Evidence:
PR #2456 (author-name): “[BUG] I think we need to use parseISO(employeeOffData.timeOffStartDate) here. I moved someone to extended leave on 2024–09–22, but I see Sep 21, 2024.”
‍
PR #1587 (author-name): “No, the problem is new Date(shift.startDate). For example: new Date(‘2024–06–12’) → Tue Jun 11 2024 17:00:00 GMT-0700. The string is June 12, but the date object is June 11.”
‍
Suggested SKILL.md addition: Under “Frontend Conventions”: “Date parsing: Use `parseISO` from `date-fns` for parsing ISO date strings from the backend. NEVER use `new Date(dateString)` for date-only strings (e.g., ‘2024–11–08’) — it assumes midnight UTC and shifts the date by the browser’s timezone offset, producing off-by-one day bugs. For `DateTimeField` values that include time (e.g., ‘2024–11–08T08:00:00Z’), `new Date()` is acceptable but `parseISO` is still preferred for consistency. Do not wrap values in `new Date()` when they are already Date objects.”

Step 4: Two Modes — CI and Interactive

1. CI Mode (Automated PR Reviews)

We wired the skill file into a GitHub Actions workflow so that Claude automatically reviews every pull request. It reads the diff and PR description, loads the relevant skill file context, and posts a summary comment and targeted inline comments.

It’s tuned to not post low-priority nitpicks. If the code looks fine, it says so. The goal is signal, not noise. It runs around $0.02–$0.20 per PR depending on size. All engineering team reviews run through a $100/month Claude account with no additional charge. Worth noting: during the insight extraction phase, running many sub-agents in parallel hit rate limits on that account within about 15 minutes — something to plan for if you’re doing the initial build.

2. Interactive Mode (Local, Pre-PR)

For use before opening a PR, there’s a separate interactive skill for local use via Claude Code or similar tools. Rather than silently reviewing a diff, it starts a conversation: What’s the intent of this change? What architectural trade-offs did you consider? Is there product context that affects the approach?

This mode catches issues before they ever appear in a PR — and before a reviewer has to write the same comment for the fourth time.

Rethinking Code Ownership

Shipping faster with AI assistance creates a new problem: nobody is writing the full code anymore, so nobody naturally owns it. Understanding the codebase becomes harder to maintain when it’s assembled in pieces.

We’re addressing this with a secondary owner model. Each task or project is assigned a second engineer who reviews code not just to understand it, but to own it — as if they wrote it themselves. At any point in the project, or in the future, that person should be able to step in and take over entirely.

This changes how we think about review entirely. The difference between a reviewer and an owner changes how you approach the work and how you budget time for it. A reviewer scans for issues, an owner builds a mental model.

The AI reviewer supports both roles. For the author, it catches issues before they surface in human review. For the secondary owner, it produces a running record of decisions, patterns, and intent across the codebase — making it easier to build the deep familiarity that real ownership requires.

What It Actually Catches

Because the reviewer loads commit history alongside the diff, it can reason about intent. If a change looks odd in isolation but makes sense given what was committed two weeks ago, it understands that. If a new change quietly contradicts a deliberate decision in a prior commit, it flags the inconsistency.

Some concrete things it catches regularly:

Logic errors grounded in history — not just “this looks wrong” but “this conflicts with how this module was intentionally designed in a prior commit”
Pattern violations — deviations from serializer conventions, test structure, and frontend state patterns baked into the skill file from real past reviews
Missing context in PR descriptions — prompting authors to explain decisions they may have glossed over
Silent regressions — places where new code technically works but undermines something established elsewhere

Here’s a real example of a comment it produced:

HIGH: The / 60 here converts the result to hours, but the analogous code in the list method (line 204) passes get_requested_duration() without dividing by 60, meaning it stays in minutes. Both values are assigned to pending_comp_time_duration and serialized identically via EmployeePayPeriodSerializer.

This inconsistency means one of the two call sites returns the wrong unit. Based on commit fb1523ae (“Personal Payroll: use duration instead of hours”), the intent was to switch to minutes everywhere. If that’s correct, the / 60 here should be removed to match line 204.

That comment references a specific commit, identifies the inconsistency across two call sites, infers the original intent from the commit message, and gives a concrete fix. A reviewer skimming a diff would likely miss it entirely.

When the code looks fine, it says so. The signal-to-noise ratio is noticeably better than the off-the-shelf tools we tried.

What Comes Next

A few directions we’re already exploring:

Continuous learning. We’re building an automated insights pipeline: a GitHub Action that runs post-merge, reads every new review and commit, and opens a PR to add or modify entries in an insights document. When an insight appears frequently enough and is validated, it gets promoted into the skill file. This closes the Extract → Distill → Codify loop automatically over time.

Domain-specific rule files. Splitting rules by domain and running specialized sub-agents would allow richer rule sets without blowing the context window.

Product context. Feeding in product documentation or specs could allow the reviewer to flag not just how code is written, but whether it correctly reflects what was intended.

The full skill files live in our repository. If you spot something the agent is getting wrong, you can open a PR to improve it — a fitting way to iterate on a code review tool.