AutoPR Leaderboard

Comprehensive comparison of AutoPR systems on PRBench-Core and PRBench-Full. Metrics are reported as percentages unless specified.

Evaluation scope. Scores are reported on PRBench-Core and PRBench-Full test splits with all metrics expressed as percentages unless noted otherwise.

📬 Share new results via Qiguang Chen or Zheng Yan, or open an issue on GitHub.

⚠️ Leaderboard entries are curated manually from papers and user submissions—please double-check any numbers you rely on and let us know if corrections are needed.

Leaderboard Timeline

Curate the milestones.

SOTA Progress (Avg.)

Charts track the running SOTA averages.

PRBench-Core

PRBench-Full

Each curve shows the best Avg. reported up to that date for the selected split.

PRAgent rows correspond to AutoPR agents built on top of the base models listed directly above them. Human-authored posts provide a reference upper bound collected from the test split.

Superscripts ^R and ^T denote reasoning and textual-modality models, respectively. Boldface marks the best score per metric. For models with a ^T superscript evaluated with PRAgent, Gemini-2.5-Pro serves as the vision backbone.