This update follows on earlier posts (20 Dec. 2023; 31 Jan. and 7 Feb. 2024) reporting on the use of generative AI in the peer review of a research paper.
I received an official response by email from the mediAzioni journal editors on Thursday, 22 February 2024.
Attached to that email were detailed responses from both human reviewers (which I do not provide here). I appreciate that the journal has made the decision to involve the reviewers, whose identities yet remain undisclosed.
Most notable in the response is that the mediAzioni editors take no explicit position on my allegations that generative AI was used to produce both review reports. They state: “our journal is not the proper place to conduct a dispute on this issue” / “la nostra Rivista non è la sede per condurre un contraddittorio in merito.”
Remarks in both anonymous reviewers’ responses suggest they were unaware that I had made repeated requests to engage with the reviewers from my earliest email, of 20 December 2023 (“which I strongly urge you to address directly with the authors of the reviews (please share this response with them) as well as with the journal’s editorial/scientific boards due to the serious ethical breach I am alleging”). The reviewers appear to have been notified only after my public statements of 31 January, contrasting with the editors’ earlier commitments to conduct an investigation.
The reviewers’ responses indicate to me that these texts were written by the humans who used generative AI (likely ChatGPT 3.5) to review my article through a process of individualized AI prompting and post-editing, calling upon their own professional expertise. These evidently human-produced responses, written in the first person, thus provide control samples against which to compare the earlier review reports.
The following grid presents the results of a new round of testing using various free and paid AI detection sites, conducted 23 February 2024 in Italy (listed alphabetically below). The two original review reports were here tested against four control samples: one AI control using selected output from the ChatGPT simulation tests, along with three human controls: my article in its draft version of June 2023 and the two reviewer responses to me. The reviewers’ responses are readily analyzed as being human, like my own article, and in contrast to their original review reports.
Controlled AI detection tests
AI detector | ChatGPT output | Reviewer 1 | Reviewer 2 | My article | ||
Peer review report | Response to author | Peer review report | Response to author | |||
AI Detector Pro | 98% chance AI | 98% chance AI | 2% chance AI | 98% chance AI | 3% chance AI | 2% chance AI |
Content at Scale | “Human probability: X Reads like AI!” “Unfortunately, it reads very machine-like” | “Human probability: Hard to tell” “Unfortunately, it reads very machine-like” | “Human probability: Passes as Human!” | “Human probability: X Reads like AI!” “Unfortunately, it reads very machine-like” | “Human probability: Passes as Human!” | “Human probability: Passes as Human!” |
Copyleaks | 100% AI content | 73.5% AI content | 0% AI content | 60% AI content | 0% AI content | 0% AI content |
Crossplag | ChatGPT 3.5: AI Content Index 80% – 100% ChatGPT 4: AI Content Index 0% – 100% | AI Content Index 100% “This text is mainly written by an AI” | AI Content Index 0% “This text is mainly written by a human” | AI Content Index 82% “This text is mainly written by an AI” | AI Content Index 0% “This text is mainly written by a human” | AI Content Index 0% “This text is mainly written by a human” |
GLTR | higher likelihood | higher likelihood | lower likelihood | higher likelihood | lower likelihood | lower likelihood |
GPTZero | 95% – 98% probability AI depending on text input | Probability Breakdown: Human 24% Mixed 32% AI 44% | Probability Breakdown: Human 93% Mixed 6% AI 0% | Probability Breakdown: Human 83% Mixed 15% AI 2% | Probability Breakdown: Human 90% Mixed 10% AI 0% | Probability Breakdown: Human 91% Mixed 9% AI 0% |
Originality AI (% = confidence level of prediction) | AI Detection Score 0% Original 100% AI | AI Detection Score 0% Original 100% AI | AI Detection Score 22% Original 78% AI | AI Detection Score 0% Original 100% AI | AI Detection Score 1% Original 99% AI | AI Detection Score 81% Original 19% AI |
Sapling | Fake: 98% – 100% | Fake: 100% | Fake: % variable depending on input | Fake: 93.5% | Fake: % variable depending on input | Fake: 0.0% – 2.4% |
Winston AI | ChatGPT 3.5: Human Score 0% ChatGPT 4: Human Score 0% – 7% | Human Score 5% | Human Score 100% | Human Score 32% | Human Score 100% | Human Score 100% |
Notes: – GPTKit, Scribbr, Undetectable AI, and Writer were initially tested but were excluded from the analysis due to their inability to reliably detect pure ChatGPT output. – Such tests provide comparative indications only. Originality AI gives false positives for some human control text, outliers indicating lower reliability of this tool. I did not include the Sapling “% fake” predictions for the reviewer responses due to their inconsistency depending on the text inserted. |
These new controlled AI detection tests – along with the ChatGPT simulations, the concordance lines with many matching text patterns, my qualitative commentary in the annotated reports, the journal’s response timeline, the unwillingness to de-anonymize, and the unconvincing reviewer responses – only reinforce my position that generative AI was used to produce both peer review reports for my paper.
My case must not be seen as exceptional. I do not mean to single out the anonymous reviewers alone for their conduct or the journal editors for their response, as similar scenarios are surely now playing out all over the place. My public pursuit of this matter has been motivated as much by raising awareness as by establishing accountability.
Nicholas Lo Vecchio
26 February 2024