Does Your Paper Really Suck?

sinab1 pts0 comments

Does your paper really suck? | Sina Booeshaghi

Home<br>Blog

Does your paper really suck?

By A. Sina Booeshaghi &middot; June 27, 2026

Oded Rechavi, at QED Science, believes that if your paper is not in the<br>top 1% of their QED score then it<br>"sucks". But what is this QED score and what is its purpose? Does it really<br>measure scientific quality? If a paper is not in the 1% does it really<br>suck?

These are important questions because scientists are increasingly<br>overwhelmed with the volume of new work posted on preprint servers and<br>published in journals. As a result, traditional quality signals used<br>for triaging papers, such as journal, conference venue, and institution,<br>are becoming less reliable. AI further compounds this problem by making<br>it easy to produce plausible scientific writing at scale. Papers are<br>longer, figures are denser, and the existence of a paper is no longer<br>sufficient evidence that it represents substantial scientific work.

In response, companies like QED Science are building AI tools to help<br>scientists identify quality work. QED uses Large Language Models (LLMs)<br>to review scientific papers and provide AI feedback. Many scientists<br>report that the feedback is useful and often resembles comments received<br>during human peer review.

QED recently released a<br>white paper<br>that goes one step further and describes the "QED Score", a single<br>number that is intended to measure a paper's quality. The QED score is<br>generated by prompting a collection of LLMs to review a paper for<br>"originality" and "validity". The resulting evaluations are combined<br>into a single score, the QED score. In their white paper, the authors<br>claim that the QED score is a "more accurate, faster, and less biased<br>estimate of paper quality than journal rank." The authors present three<br>validation studies, all of which compare the QED score against the<br>SCImago Journal Rank (SJR), a journal-level metric based on citation data. The first study<br>compares QED and SJR against a corpus of expert-assigned labels<br>("Limited", "Satisfactory", and "Strong"). The second compares QED<br>scores for 2,879 bioRxiv preprints with the SJR of the journals in which<br>those papers were eventually published. The third asks experts to choose<br>between pairs of papers where QED and SJR disagree most strongly.

In this review, I evaluate the evidence supporting the QED score as a<br>measure of scientific quality. While QED clearly provides a much faster<br>review than traditional peer review, I find that the evidence presented<br>does not support the authors' claims that the QED score is a more<br>accurate or less biased measure of scientific quality.

Case study 1 is methodologically opaque and does not effectively<br>demonstrate that the QED score measures quality

In case study 1, the authors obtain a curated dataset of 975 published<br>papers labelled "Limited", "Satisfactory", or "Strong" by a panel of<br>expert reviewers whose identities are not disclosed. Each paper received<br>a label based on validity and originality, the same criteria used to<br>generate the QED score. The authors then asked whether the QED or the<br>SJR score better predicted these labels. QED achieved an AUC of 0.863<br>versus SJR's 0.804 for distinguishing "Limited" from "Satisfactory +<br>Strong" papers, and 0.782 versus 0.774 for distinguishing "Strong" from<br>"Satisfactory + Limited" papers.

These values cannot be meaningfully interpreted without the underlying<br>data and methodology. The paper does not report the distribution of<br>labels, whether the expert reviewers who generated the benchmark labels<br>were blinded to journal, author, or institutional identity, nor do they<br>provide any data or code to reproduce the analysis. The authors also<br>provide no guarantee that these papers were excluded from the training<br>data of the LLMs used to evaluate them. Therefore, case study 1 does not<br>establish that the QED score accurately measures scientific quality.

Case study 2 provides inconsistent evidence that the QED score measures<br>quality

The second case study compares QED scores for 2,879 bioRxiv preprints<br>with the SJR score of the journals where those preprints were<br>eventually published. Across all fields, the authors report a Spearman<br>correlation of 0.63. Within individual fields, however, the correlations<br>ranged from 0.78 (Genetics) to 0.39 (Systems Biology).

The authors describe the overall agreement as "substantial", but<br>explain weaker agreement in some fields by arguing that the SJR score is<br>a noisy proxy for quality. This argument is internally inconsistent. If<br>the SJR score is a reasonable proxy for scientific quality, then the<br>weaker agreement across fields suggests that the QED score is a weak<br>proxy for quality. If the SJR score is a noisy proxy for scientific<br>quality, then agreement with the SJR score cannot be used to validate<br>the QED score. Either way, by the authors' own admission, this analysis<br>does not establish the QED score as an accurate measure of quality.

Case study 3 contains several uncontrolled and...

score quality paper scientific papers authors

Related Articles