There has been some hope that detection systems will enable educators and others to determine the likely provenance of texts: whether some text has been written by a human (e.g. your student) or by a large language model (e.g. GPT-3.5). In principle, this is a relatively obvious classification problem, and supervised machine learning models are frequently trained to solve such problems. The problem made more tractable by the vast data-sets, mostly automatically label-able, available to train such systems. Most of the text on the internet is human-written – for the time being, a high enough percentage to work as a passable data set for the human side of the classification problem. Vast quantities of AI-written text is being produced daily, and the owners of such generators (e.g. OpenAI) know that all such text is AI-generated. Constructing a labelled training dataset is not the challenge.
Other approaches, though, have emerged. Rather than attempting to train a black box classifier to categorise texts into human and AI-authored, tools like the popular GPTZero have tried a more theory-driven, principled approach. These detectors start from the observation that AI-written texts have a tendency to share certain properties, which human-written texts are less likely to exhibit. In the case of GPTZero, the two chosen properties are perplexity and burstiness. The theory is that AI-written texts tend to have low levels of both perplexity and burstiness, while human-written texts tend to exhibit higher levels of one or both of these characteristics.
Perplexity is a well-used measure in the design of language models. It is a mathematical way to try to capture the degree to which a language model generates an expected, normal or conventional choice as the next word in a text. But here, rather than using perplexity as an evaluation of the language model (i.e. how well does it provide the expected choice of word when prompted to do so?) , detectorists are measuring the perplexity of a text itself. Really, we should have two distinct sets of terminology here to prevent the slippage between the measure of language model performance and the measure of text characteristics. A text with high perplexity contains lots of words which humans (and hence good language models – i.e. ones with low perplexity scores) would be unlikely to choose. It contains unexpected words which look out of context, particularly when supplied only with what precedes it. This seems like a good way to measure the chance that a language model would have been able to generate a specific text. Language models trained to have low perplexity, and used at settings which don’t deviate from that, are presumed to be less likely to generate high perplexity texts. Humans, with their idiosyncrasies, are more likely to pick some perplexing word choices.
Of course, evaluating the perplexity of a text will never allow us to infer that a text was written by an AI with a high degree of certainty. It is relatively trivial to construct texts with low perplexity scores as a human author. Simply write the most tedious, uninspired drudge you can. I set out to write a very tedious and expected description of Immanuel Kant’s categorical imperative, and ended up with a low perplexity score according to GPTZero:
Immanuel Kant was a German philosopher who wrote about ethics. He wrote a book called The Groundwork of the Metaphysics of Morals. In his book, Immanuel Kant described the concept of the Categorical Imperative, which is an important ethical principle according to him.Written entirely (ashamedly) by the author
When processed by GPTZero, this texts received a perplexity score of 22.333 and a burstiness score (more on that later) of 3.512. These are low values. GPTZero concluded, hubristically, that: “Your text is likely to be written entirely by AI“. Entirely incorrect.
This was my first and only attempt at fooling GPTZero, and worked first time, simply by writing in a way that GPTZero associates with AI language generators. How did I reverse-fool GPTZero? By using groups of words that are found together over and over in the training data: words that belong together. There is nothing in this text which surprises or stands out. I repeated myself a lot. I also kept my sentences grammatically simple. Rather than going for even the most basic of compound phrase (e.g. “In his book, Immanuel Kant described the concept of the Categorical Imperative, which is an important ethical principle according to him.”), I went for staccato, repetitive phrasing. This also helped me keep the burstiness score low. Burstiness is a mathematical attempt to measure how varied the perplexity of the text is. The expectation is that human authors will sometimes dive into idiosyncratic phrasing and surprise you, and sometimes write quite tedious clichés, melding the modes and varying their sentences lengths and writing tempo throughout a piece.
We’ve demonstrated that a human can deliberately write in a way that has low perplexity and low burstiness, and thus get GPTZero to flag their own text as likely to be entirely AI written. Can we go the other way? Now that we know what the detector is looking for, can we prompt a language model like ChatGPT to generate high perplexity, bursty text? Yes, through some quite trivial prompt engineering.
To start with, I asked ChatGPT to generate a paragraph of text about Kant’s Categorical Imperative for an essay. The prompt was: “Write the first paragraph of an essay about Immanuel Kant’s categorical imperative.” As GPTZero’s authors expect, it wrote something dry and uninspired:
Immanuel Kant’s categorical imperative is a central concept in his ethical philosophy. It is a principle that Kant believed should guide all moral decision-making, providing a universal and objective framework for determining what actions are morally right or wrong. According to Kant, the categorical imperative is based on the idea of acting out of duty, rather than self-interest or personal preference. This means that we should act in accordance with the moral law simply because it is the right thing to do, without expecting any kind of reward or benefit in return. In this essay, I will explore the meaning and implications of the categorical imperative, as well as its relevance to contemporary ethical debates.ChatGPT, generated by the author
This is the kind of dull text that educators overoptimistically presume is the inevitable output of language models when turned to homework assignments. The narrative that ChatGPT generates dry (read: low perplexity) responses which can easily be caught by a reader who is attentive to boring writing is the same one that the GPTZero detector is relying upon and formalising. But ChatGPT has generated this dry text largely because that is the default mode of most human writing. Unless we are trying to be a bit more florid, our writing trudges toward tedium. When submitted to GPTZero, this text scores a perplexity of 18.000 and burstiness of 4.528. In other words, it’s even more expected than my own attempt to write like an AI (if a little burstier). GPTZero grades it as entirely AI-written, correctly.
But now let’s prompt GPTZero to be more inventive. In APIs for other language models, this can simply be achieved by tweaking settings, such as raising the ‘temperature’ setting to invite more use of lower-probability words. In ChatGPT, this can easily be achieved by adding to our prompt. I provided ChatGPT with the same text it had just generated, and gave it the prompt: “Rewrite this paragraph to make it more interesting and less dry.” It generated:
Immanuel Kant was a man on a mission – a mission to discover a universal moral code that could guide humanity towards a better world. And thus, the categorical imperative was born, a concept that would go on to be a cornerstone of his ethical philosophy. This principle was no mere suggestion or subjective opinion, but a rock-solid, objective framework for determining what actions were morally right or wrong. Kant believed that by following this framework, we could transcend our selfish desires and act out of a sense of duty to the greater good. So, let’s dive deeper into the fascinating world of the categorical imperative and explore its relevance to modern ethical debates.ChatGPT, generated by the author
When fed to GPTZero, the detector still recognises that this is likely to be AI-authored. But the scores for both measures are considerably higher, and much higher than my own writing when attempting to write in a dull way. It scores 30.800 for perplexity and 7.596 for burstiness.
We can go further. I provided the original generated paragraph again and instructed ChatGPT: “Rewrite this paragraph to make it much more idiosyncratic, using unexpected words and turns of phrase.” It generated:
Oh, the categorical imperative – Kant’s brainchild and the bedrock of his ethical musings. A principle so sacrosanct that it could guide even the most conflicted of moral compasses towards righteousness. Its objective framework serves as a moral microscope, allowing us to distinguish the right from the wrong with crystal clarity. According to Kant, this framework is rooted in our duty to do what is right, rather than following our baser instincts of self-interest and personal preference. In essence, we ought to march to the beat of morality’s drum, not expecting any rewards or kudos in return. And in this humble treatise, I shall endeavor to explore the nuances and consequences of the categorical imperative, and how it holds relevance in the maelstrom of modern ethical debates.ChatGPT, generated by the author
This text would undoubtedly look weird if inserted into an essay. However, a marker would be unlikely to suspect that the text is strange due to it being written entirely by an AI. They would probably advise the author to tone it down a bit, inject a little less personality into their writing style. In other words, be a bit less perplexing.
GPTZero agrees. When asked to evaluate this text, entirely generated by AI, it answers: “Your text is likely to be written entirely by a human“. The perplexity score is 55.000 and the burstiness is 23.503. We’ve successfully generated entirely AI written text which fools the detector. I repeated the same prompt as a test of the reliability of this sort of prompt engineering tactic, to much the same effect. It wrote:
Oh the categorical imperative – a lynchpin in the ethical philosophy of Immanuel Kant. It serves as a lighthouse, illuminating the dark seas of moral decision-making with its objective and universal framework. Kant believed that this principle should serve as the vanguard of all our moral choices, guiding us to the righteous path. It admonishes us to act out of duty, rather than the enticements of self-interest or personal preference. We should be led by the beat of morality’s drum, driven only by the desire to do what is right, without expecting any quid pro quo. In this dissertation, I shall explore the significance of the categorical imperative, its intricate nuances, and its pertinence to the contemporary ethics discourse.ChatGPT, generated by the author
It is quite reminiscent of the first effort (particularly in the “Oh the categorical imperative” opening gambit). This text did not score quite so highly – perplexity = 45.667 and burstiness = 18.129 – but still gets the GPTZero approval: “Your text is likely to be written entirely by a human“. Note that GPTZero’s approach is not limited to just a human vs. AI binary. It could flag individual sentences or parts of a text as AI written, even if overall considering that the text had some hallmarks of human writing. It does not do that here. No sentence of the perplexity-prompted model is flagged as potentially AI written by GPTZero.
Consciously writing with low perplexity resulted in human-written texts flagged as entirely AI-generated. A simple prompt modification to request an idiosyncratic written style resulted in AI-generated texts graded as entirely human-written. Perplexity is relatively easy to manipulate, both in human and AI writing. As such, perplexity can only be of very limited value in AI detection. Certainly, it would be a mistake to accuse a student of having submitted AI generated work on the basis of GPTZero flagging the text as likely to be AI written. Not only could this be a mistake in the detector’s output, it also amounts to an implicit accusation that the student’s writing is rather dreary!
There are other detection tools available, though the details of how their detectors work and what they measure are often less clear. OpenAI have released their own detector (albeit with very sobering success rates). One hope is that by applying an overlapping set of AI detectors which use different methods to differentiate human and AI authorship, we can get a clearer and more reliable picture. For now, at least, that is still in the realm of aspiration. OpenAI’s detector required more text than the above sample paragraph to run its detection algorithm, so I prompted ChatGPT with “Continue from the following paragraph in the same style:” and provided the idiosyncratic text which passed GPTZero’s perplexity texts. It generated some more content:
Oh, but what a revolutionary idea the categorical imperative is! It is a conceptual beacon in the murky waters of moral relativism, providing a moral compass for all to follow. Kant’s principle is the cornerstone of ethical philosophy, guiding us on the path towards virtuous action. He exhorts us to be duty-bound, to shun the temptations of self-gratification and the seductions of personal preference. The moral law should be our guiding light, and we should follow it with dogged determination, for its own sake. In this essay, I shall traverse the labyrinth of the categorical imperative, endeavoring to decode its enigmatic meaning and decipher its profound implications. I shall also explore how it informs and intersects with contemporary ethical debates, for its timeless wisdom endures even in this ever-changing world.ChatGPT, generated by the author.
It really likes “Oh” as an opener in this particular chat instance.
Alongside the original paragraph, when offered to OpenAI’s detector, the result is: “The classifier considers the text to be unlikely AI-generated.” Fooled again. I didn’t have the heart to produce 1,000 words of tedious low-perplexity text of my own to try to fool the detector.