Machine Evidence II: The Abstract Setting

Machine Evidence II: The Abstract Setting

It’s nice when a prediction comes good, but a pyrrhic pleasure when it’s a prediction of an oncoming problem which then materialises more quickly than anticipated. In June 2019, I wrote Machine Evidence: Trial by AI, a short piece in which I considered the upcoming risk to medical journals and practitioners posed by the increasing quality of machine-generated text from giant language models such as OpenAI’s GPT-2. In it, I generated a couple of abstracts for medical papers. They were not great. They contained some inconsistencies and basic medical errors. But they had at least some surface resemblance to medical paper components, in a way that most AI text generators prior to GPT-2 would have struggled to achieve.

In that piece, I cautioned that these technologies would simplify the process of faking studies or creating articles which attempt to manipulate medical practitioners, warning:

Fakes and hoaxes do slip through. It’s now much easier to generate fake studies, and GPT-2 still has a lot of development to go through. […] First, if you’re a journal editor or peer reviewer, keep your eyes open. It doesn’t take much to imagine a nefarious sod submitting an AI-generated study to see if it will get through your process. It also doesn’t take a lot to imagine unscrupulous folks employing AI in the future to generate lots of fake trials. Or even to write sections of their paper. 

Machine Evidence, 2019

The process to get from there, with GPT-2 outputting halfway decent material that might slip by if someone was not paying careful attention, to our current situation, has been far faster than most would have anticipated.

We now have GPT-3 and the much-discussed simple API to access it, ChatGPT. The latest update to GPT-3 is producing text which is far more coherent and consistent than the material I generated in 2019. The hoaxing has begun in earnest, as has the use of ChatGPT to generate sections of papers, or entire articles.

In a preprint on bioRxiv, Catherine Gao and colleagues showed that abstracts written by GPT-3 were able to pass as human-written against blind review by a team of four scientists. The researchers gathered a sample of fifty journal article abstracts from the most prestigious medical journals – JAMA, the BMJ, the Lancet, Nature Medicine and NEJM – and used GPT-3 to generate its own abstracts based on the article titles and journal names. Peer reviewers blinded to the authorship of the pieces were asked to identify which was the original and which was the AI-generated abstract. There was a notable ability to fool some reviewers, some of the time:

When given a mixture of original and general abstracts, blinded human reviewers correctly identified 68% of generated abstracts as being generated by ChatGPT, but incorrectly identified 14% of original abstracts as being generated. Reviewers indicated that it was surprisingly difficult to differentiate between the two

Gao et al., 2022.

Gao et al. find that around a third of the time, human reviewers did not correctly identify an AI-generated abstract as being created by ChatGPT.

These results are also highly skewed against the AI in terms of the conditions of detection. The reviewers were given a sample of abstracts and were told that some AI-generated abstracts were amongst them. They were, therefore, particularly primed to seek out any evidence of AI generation. Even under those favourable circumstances, they were unable to identify the AI-generated abstracts in almost one third of cases.

In the wild, in settings in which human attentiveness is not as well guaranteed and reviewers may not be primed to try to detect any hallmarks of AI authorship, the ability to bypass reviewer scrutiny is likely to be higher. Moreover, the authors of the study deliberately did not take any of the various measures available to them to improve on the quality and believability of the generated text, such as the use of prompt engineering.

The good news for those concerned by this development is that AI detection software successful identified the ChatGPT-generated abstracts with an extremely high rate of success and without false positives. This suggests that if journal editors and conference organisers want to avoid hoaxing or fake study submissions, or do not wish to accept ChatGPT-written submssions, they must supplement their peer review process with the use of AI detection software. Note that traditional plagiarism detection software will fail to detect ChatGPT generated material. Only specific AI detection tools will do the job reliably.

This will not remain a valid remedy forever, as I have previously pointed out. Using adversarial training, any AI detection software can in theory be used to train a language model to generate text which is not detectable (or is less detectable) by that particular detection algorithm. An detection-generation arms race is a likely outcome.

Nor will this be a complete solution for ChatGPT co-authorship detection. In co-authorship, a human uses ChatGPT to generate components of a piece and then themselves edits and modifies the generated text. These edits will necessarily reduce the confidence of the AI detection software in identifying a text as AI generated. A co-authoring model may be less worrisome in some spaces, but in respect to hoaxing and fakery, it stands a chance of beating even a peer review structure reinforced with AI detection tools in the near future.