All that Glitters is not… Evidence

Recently, I wrote about a new machine learning model called GPT-2 which was conspicuously not released by OpenAI. GPT-2 is a massive language model which can be used to generate often highly convincing text when given a prompt. Using the ‘attention’ framework, the language model is able to produce compelling text which coherently refers back to terms used across the whole piece. This allows it to create text which looks realistic, particularly if you don’t look too closely at the details. The whole thing was considered too potentially dangerous for release by OpenAI, who chose instead to release a much smaller version of the model which isn’t capable of anything like as convincing text generation. They’re now embarking on a staggered release schedule in which they’re slowly ramping up the size of the models that they offer to the public, whilst simultaneously working closely with platforms to anticipate some of the problems that will come along with the ability to rapidly and very cheaply generate large quantities of reasonably persuasive-looking text on pretty much any topic.

In my previous post, I gave a few examples of how this kind of machine learning model might affect the way we interact with medical evidence, and argued that it’s important to pay attention to techniques and procedures to ensure that the reports of studies which we engage with are genuine. Verifying trial reports is vital not just because of the ability of tools like GPT-2 to fake reports quickly, but has always been important – frauds and fakes are real and we don’t catch them all.

The examples that I gave in my post were generated with the dramatically scaled-back version of GPT-2 which OpenAI had released, and created using a very neat site, TalkToTransformer by Adam King, which allows anyone to prompt the weak GPT-2 instance and see how it would respond. There’s also Huggingface’s WriteWithTransformer which does a similar thing. So the samples I generated weren’t particularly brilliant, but might get by an unfocused reader. The full model, though, might produce far more consistent work.

One thing I didn’t mention, though, is the possibility of a tool being created which could detect AI-generated text and thereby address the problem right from the outset. One such example is GLTR – pronounced ‘Glitter’ – from MIT IBM Watson and Harvard NLP. GLTR (Giant Language model Test Room) is a tool specifically designed to out text written by a language model like GPT-2. It does this by testing whether each word in a text sample is amongst the likeliest words which GPT-2 would be choosing from when generating new text. Essentially, it looks at how GPT-2 generates text and asks whether it’s likely that GPT-2 would generate the sample.

GLTR colour-codes words and individual letters accordingly. If you see lots of green and yellow, but very little red and pink, then a language model like GPT-2 would be able to generate that text – it matches the likely predictions of the next word which GPT-2 would make. Where you see plenty of red and pink sprinkled through the text, though, there’s evidence that this was written by a human as the word-choice is consistently veering outside the most probably choices which GPT-2 would select from.

For example, I fed GLTR one of the samples from my previous post, and its analysis quickly identifies it as likely to be written by a language model:

Notice all the green and yellow. The red at the start is located only in the prompt text I gave to TalkToTransformer. By contrast, a paragraph of my own from later in the same post:

Here, we see a lot more hallmarks of human-written text: lots of red and purple as I select words which GPT-2 wouldn’t reckon are all that likely to come next given everything that came before. That’s reassuring for me. But should we be reassured more generally? We’ve got a way to identify AI-generated text and we test for it.

But it’s not so simple. GLTR is good at recognising text generated by GPT-2, but not so great at some other language models, as Janelle Shane of AIWeirdness.com pointed out. It’s not so much a case of GLTR working as a generalised fake text detector, rather as a specific GPT-2 spotter.

That’s not the only issue. Training bots by facing them off against other bots is by now a staple machine learning approach. It wouldn’t be particularly cumbersome to build in generating text which picked out some of the red or purple words more frequently into a language model’s text generation – indeed, GLTR itself would act as the teacher for a language model with that goal in mind. Take a function of the way GLTR scores text and optimize towards fooling it. That might damage the overall coherence of the text and lead to some dodgy results, but if GLTR is right about the patterns exhibited by human-made text, it might also improve how well the language model emulates our writing style.

This is a general problem for approaches to algorithmic detection of machine-made content. Any sufficiently accomplished tool to detect fake content becomes the new training partner for an AI to learn to beat detection. Sure, we can have an arms race of detectors and deceivers, but it doesn’t seem likely that successful detection will be a generalised norm, even if its possible for certain periods. That might be enough. After all, the text we’re worried about is static. A fake trial report, for instance, won’t get better over time, so as detection tools get better, it’ll get caught and expunged. Whether that erodes trust in reports more generally, and whether we’re able to disseminate retractions well enough to undo any damage, though, might be another matter. That’s assuming we’re paying attention at all. Attention is expensive, the incentives to deceive are high, and the price of deception is coming down fast.