Machine Evidence: Trial by AI

Machine Evidence: Trial by AI

Take a look at the following snippets from descriptions of clinical trials, thinking about how you’d rate the quality and strength of the evidence that comes from each:


We conducted a clinical trial in which erythropoietin (EPO) administration was administered daily to patients with severe acne resulting in clinically significant skin damage in addition to profound fatigue as seen in the patients’ medical history. Although there were not significant differences between the placebo group and the Epo-Group in terms of skin changes, patients who received EPO had more than double the number of epidermal hyperplasia scores, a more marked increase in the number of lesions on the erythrocyte surface, as well as an increased number of systemic inflammatory mediators, including interleukin (IL)-6, interleukin-2, M-CSF, and TNF-α compared to patients receiving placebo. EPO was administered in the study phase during an average of 5 weeks per patient, and patients with milder acne treated with a more concentrated product had less erythrocyte inflammation and fewer systemic inflammatory mediators. More than 40% of the patients who received EPO in this study showed an improvement in their acne after 8 weeks compared to 25% of those who received placebo.


We conducted a randomized controlled trial to evaluate the effect of supplementation with a combination of the L-ascorbic acid and vitamin C in humans on clinical outcomes. Ninety-three subjects were randomly assigned to one of three study groups. The subjects were assessed at baseline for serum ascorbic acid, bupropion and vitamin C concentrations and for changes in blood pressure and cholesterol. In addition, the subjects were interviewed about their weight and physical activity at baseline. The outcomes were then evaluated in 4×4×4 tests on 4 occasions. All the different study groups were classified in 1 of the following order: L-ascorbic acid (A or B), bupropion (C or D), dietary supplements, or no dietary supplements. L-ascorbic acid and bupropion were not studied in men, women, and elderly (80 to 90 years old).


Objective: Conduct a randomized controlled trial to test the claim that erythromycin, a common antibiotic prescribed for urinary tract infections, decreases mortality in patients with active UTIs.

Design: Fifty three participants who met all relevant inclusion criteria were randomly assigned to erythromycin for 5 days to see whether the combination of erythromycin and gentamicin increases survival for patients with active chronic UTI.

Setting: A tertiary care hospital in rural Texas.

Patients and treatment groups: Twenty patients were included (17 women, 21 men) and 50 control subjects (24 women, 23 men). To include a placebo group, all patients were randomized to either erythromycin (n = 27) or to gentamicin (n = 27).

Main outcome: Baseline survival: The primary outcome was the relative change in survival from the baseline to the end of each 8-week treatment period. This was determined by dividing the total (time to death/week) mortality adjusted for the length of the randomized treatment period by the baseline outcome.

My question for you: at what point did you realise something wasn’t right? Did you make it through to the numbers not adding up in (3)?  (“Twenty patients were included (17 women, 21 men)“) The last sentence of (2) throw you? (“L-ascorbic acid and bupropion were not studied in men, women, and elderly (80 to 90 years old). “) Or maybe some of the internal inconsistencies in (1)?

All three of these trial descriptions were composed by a neural network, and as far as I can tell, no such trials took place. The neural network in question is OpenAI’s new machine learning model, GPT-2. Machine learning engineer Adam King recently launched, a wonderfully straightforward way for anyone to play with GPT-2 and see what it can do. It can do a lot. [update: (03/07/20) – TalktoTransformer has now been taken down by its creator because of the incredible cost of running it as a free service. Huge thanks are due to Adam King for all of his efforts in allowing so many people to engage with this technology. I hope some readers will be able to support his new paid project Inferkit]

The interface just requires you to input any amount of text, and the AI will then generate text to follow on. This kind of program has been around for a while, but it’s inspiring and a little terrifying to see how good GPT-2 is, and how far this has come since the last time I tried to feed my PhD thesis into a recursive neural net and generate some machine philosophy. GPT-2 can emulate poetry, write news articles, and – as we’ve seen – compose clinical trial reports entirely fabricated from its understanding of the highly structured format of the genre. I gave GPT-2 the text you see in bold above, and everything else is its own. GPT-2 also emulates academic practices such as referencing:

In the study, women who had a weight loss of more than 15 kg in five years were significantly more likely to report increasing their physical activity by 15-25 min per week compared to an intervention group that did not receive assistance from a certified trainer. This result provided support for the association between increases in physical activity and reduction in obesity and diabetes.

Here, after inventing a study and over-interpreting it, it provides a link to a real article by Martens et al, entitled “Presence of Chlamydia, Mycoplasma, Ureaplasma, and Other Bacteria in the Upper and Lower Genital Tracts of Fertile and Infertile Populations” – so perhaps not the real source, then. But enough to fool a casual reader who doesn’t follow up every reference?

The researchers also found that pregnancies were significantly less likely to occur if a woman was taking exercise, smoking, or any other form of exercise compared to those taking only a low-impact exercise training program (L.A. et al., 2008). Other studies have demonstrated that high intensity exercise (≥90 km per week and <20 METs/wk) increased the likelihood of a first pregnancy loss on the day of delivery by 24% (Bergström et al., 2000).

I have no idea if these references are real or have been dreamed up by the AI. I did find a 2000 study by Bergström et al., though it was about smoking and periodontal health…

GPT-2 is not just a curiosity. It’s a important phenomenon for the way we think about the evidence we consume. We all know, somewhere in the backs of our minds, that there are fake trials out there, published or otherwise making the rounds. There are cases of researchers who have been exposed after publishing faked data and entire fake studies. It’s hubris at best to think that we catch them all. Fakes and hoaxes do slip through. It’s now much easier to generate fake studies, and GPT-2 still has a lot of development to go through. It’s also far from the only AI working towards replicating the style and content of any given text. An AI tailored to generating fake study reports would probably do it better and more deceptively, especially when some of the glitches (numbers that don’t match, for instance) are patched out.

What does this mean for us? First, if you’re a journal editor or peer reviewer, keep your eyes open. It doesn’t take much to imagine a nefarious sod submitting an AI-generated study to see if it will get through your process. It also doesn’t take a lot to imagine unscrupulous folks employing AI in the future to generate lots of fake trials. Or even to write sections of their paper. Ever get bored writing the same old descriptions of the condition your study focuses on? Great, let GPT-2 do it for you:

Cases of chronic rheumatoid arthritis (CRS) are more aggressive, include more localized tissue damage, and require more expensive and lengthy hospitalization compared to most cases.

They can have lifelong damage to nerve tissue including the joints, and are almost always associated with weakened immune systems, increased vulnerability to infection, and increased susceptibility to complications.

How’d it do? It’s still learning, always learning.

The first question has often been the one I posed at the start of this post: how good is the trial? But really we ought to always ask: how good is the evidence that this trial actually took place as described? The procedures we use to verify the details of trials through from protocol to implementation to analysis to report are highly variable, inconsistently applied and understood. Maybe the first message of GPT-2 is a warning that we should’ve been attentive to long before: if you’re not systematically verifying that the details of a report come from a study that really took place, and report real data, then you’re open to misleading and masquerading. Not only does failure to verify risk the reputation of journals, it risks skewing clinical practice with long-lasting effects.

We shouldn’t ignore the benefits that AI may offer in healthcare. Optimization and categorization are the bread and butter of AI tech, and diagnostics, prognostics and treatment decision-making all can potentially benefit. We’ve seen some of that manifested already. But at the same time, given the immense value of medical evidence, remember that right now AI can be turned towards anyone’s goals.

It’s not just medical evidence that OpenAI can synthesise. Philosophers of medicine such as myself can watch our backs. I fed TalktoTransformer small snippets of my PhD thesis on Hierarchies of Evidence, and it generated swathes of philosophical musings on evidence, like this one:

I then argue that the content and defensibility of certain philosophical commitments, and of specific theories, is determined by the role of an individual and his or her interactionist conception of human nature. In this understanding of human nature, human beings have been interpreted differently under different levels of reasoning because of human differences in the types of capacities for which they make the judgement as well as their capacity to respond. This interactionist conception also includes a conception of our cognitive and moral capacities as well.


The claim that follows from causality is that there is an empirical requirement for the claim that A causes B. The claim that causality is a necessary property of all things is that it is appropriate for the non-causal property of causality to be a necessary property of all things too.

It may not be Wittgenstein (yet) but it also loves a good list, and had no problem emulating his style to rewrite some of the Tractatus. I challenge you, without peeking at the original, to tell me where exactly Wittgenstein ends and the computer takes over:

1. The world is everything that is the case.
1.1. The world is the totality of facts, not of things.
1.11 The world is determined by the facts, and by these being all the facts.
1.12 For the totality of facts determines both what is the case, and also all that is not the case.
1.13 This totality of facts and all its consequences cannot be known without reference to the sum of its parts.
1.1.3. What is true and true is not a contradiction,but something other than itself.
1.14 Thus, what is a fact is the totality of facts. It can be known only by reference to all the facts. And if the facts receive a priori reference in this manner, they necessarily are true.
1.15 Hence, the totality of facts cannot be known without reference to all things or of the relations between them.
1.16 A proposition is true if its premises are true.
1.17 Therefore, everything true and true is itself.
2. The world contains two sides of the same object.
2.1. The truth of something else cannot exist in itself.
2.2. Therefore, nothing exists in itself.

Answers on a postcard.