Does Your Paper Really Suck?

Does your paper really suck? | Sina Booeshaghi

Home Blog

Does your paper really suck?

By A. Sina Booeshaghi · June 27, 2026

Oded Rechavi, at QED Science, believes that if your paper is not in the top 1% of their QED score then it "sucks". But what is this QED score and what is its purpose? Does it really measure scientific quality? If a paper is not in the 1% does it really suck?

These are important questions because scientists are increasingly overwhelmed with the volume of new work posted on preprint servers and published in journals. As a result, traditional quality signals used for triaging papers, such as journal, conference venue, and institution, are becoming less reliable. AI further compounds this problem by making it easy to produce plausible scientific writing at scale. Papers are longer, figures are denser, and the existence of a paper is no longer sufficient evidence that it represents substantial scientific work.

In response, companies like QED Science are building AI tools to help scientists identify quality work. QED uses Large Language Models (LLMs) to review scientific papers and provide AI feedback. Many scientists report that the feedback is useful and often resembles comments received during human peer review.

QED recently released a white paper that goes one step further and describes the "QED Score", a single number that is intended to measure a paper's quality. The QED score is generated by prompting a collection of LLMs to review a paper for "originality" and "validity". The resulting evaluations are combined into a single score, the QED score. In their white paper, the authors claim that the QED score is a "more accurate, faster, and less biased estimate of paper quality than journal rank." The authors present three validation studies, all of which compare the QED score against the SCImago Journal Rank (SJR), a journal-level metric based on citation data. The first study compares QED and SJR against a corpus of expert-assigned labels ("Limited", "Satisfactory", and "Strong"). The second compares QED scores for 2,879 bioRxiv preprints with the SJR of the journals in which those papers were eventually published. The third asks experts to choose between pairs of papers where QED and SJR disagree most strongly.

In this review, I evaluate the evidence supporting the QED score as a measure of scientific quality. While QED clearly provides a much faster review than traditional peer review, I find that the evidence presented does not support the authors' claims that the QED score is a more accurate or less biased measure of scientific quality.

Case study 1 is methodologically opaque and does not effectively demonstrate that the QED score measures quality

In case study 1, the authors obtain a curated dataset of 975 published papers labelled "Limited", "Satisfactory", or "Strong" by a panel of expert reviewers whose identities are not disclosed. Each paper received a label based on validity and originality, the same criteria used to generate the QED score. The authors then asked whether the QED or the SJR score better predicted these labels. QED achieved an AUC of 0.863 versus SJR's 0.804 for distinguishing "Limited" from "Satisfactory + Strong" papers, and 0.782 versus 0.774 for distinguishing "Strong" from "Satisfactory + Limited" papers.

These values cannot be meaningfully interpreted without the underlying data and methodology. The paper does not report the distribution of labels, whether the expert reviewers who generated the benchmark labels were blinded to journal, author, or institutional identity, nor do they provide any data or code to reproduce the analysis. The authors also provide no guarantee that these papers were excluded from the training data of the LLMs used to evaluate them. Therefore, case study 1 does not establish that the QED score accurately measures scientific quality.

Case study 2 provides inconsistent evidence that the QED score measures quality

The second case study compares QED scores for 2,879 bioRxiv preprints with the SJR score of the journals where those preprints were eventually published. Across all fields, the authors report a Spearman correlation of 0.63. Within individual fields, however, the correlations ranged from 0.78 (Genetics) to 0.39 (Systems Biology).

The authors describe the overall agreement as "substantial", but explain weaker agreement in some fields by arguing that the SJR score is a noisy proxy for quality. This argument is internally inconsistent. If the SJR score is a reasonable proxy for scientific quality, then the weaker agreement across fields suggests that the QED score is a weak proxy for quality. If the SJR score is a noisy proxy for scientific quality, then agreement with the SJR score cannot be used to validate the QED score. Either way, by the authors' own admission, this analysis does not establish the QED score as an accurate measure of quality.

Case study 3 contains several uncontrolled and...

Does Your Paper Really Suck?

Related Articles

Is AI ruining our skills? Early results are in – and they're not good

The Anatomy of an AI-Native Org

Apertus – Open Foundation Model for Sovereign AI

How to Earn a Billion Dollars

Italy's Meloni says Trump 'made up' story that she 'begged' him for photo at G7