Four Myths about Generative AI in Education

Four Myths about Generative AI in Education

As a higher education practitioner who writes about and works with Generative AI, I often find myself at conferences, workshops, working groups and panels about the impact of Generative AI in education. Time and again, four distinct but related myths surface amongst colleagues and students alike. These myths contribute to ill-formed policies on the use of GenAI in courses and assessments, and can fuel both erroneous accusations of academic misconduct and overconfidence from academics about the robustness of their assessment design. Let’s debunk each myth in turn.

1) It is feasible and practical to detect the use of AI in writing

There are a wide range of AI detection tools on the market. There is some evidence that some of these tools can differentiate between AI authored texts and human authored texts with a reasonable level of consistency. However, it is not possible and under current approaches will never be possible to establish conclusively that an entire text or part of a text was composed using AI, short of an outright admission by the supposed author. Unlike traditional plagiarism detection software which matches material in a text to online sources and can thus identify the likely source of copied elements of a text, AI detection software is not matching text to a databank of text generated by an AI. Rather, these tools attempt to analyse the linguistic properties of a text and determine how likely it is that an AI system would produce such a text. If an AI system would be likely to produce the text, the text will be categorised as demonstrating hallmarks of AI authorship.

That a flagged text would be producible using AI is not good evidence that the text was actually produced using AI, especially given that AI text generators are trained to produce text which resembles commonplace human-authored writing styles. That a text has been flagged as likely to be AI-generated by a detection tool is not strong evidence that it is AI-generated, and is certainly not adequate grounds for an academic misconduct allegation. Even tools with supposedly low false positive rates will produce a very high number of false positives across a sufficiently large student cohort. Given the false positive rates amongst these detectors and the unknown frequency of AI use within the student population, it is furthermore not possible to establish even a likelihood that a text is AI-generated even on the basis of the most confident judgement of an AI detection tool.

As I and a great many others have demonstrated, it is relatively trivial for humans to compose texts entirely without the use of an AI which will nonetheless be flagged by detection software as AI-generated. It is also trivial to prompt a language model to write in a way which will not be likely to be detected by these tools. There are a range of tools available which will rewrite AI-generated texts to make AI detection tools less likely to flag the material as AI-generated. It is also reasonably easy for a human to make small modifications and tweaks to an otherwise AI-generated text which will fool many detectors. AI co-authorship, in which a human uses AI as part of their writing process but not to produce the entire text, is likely to remain entirely undetectable. A surprisingly widespread subsidiary version of this myth is that Generative AI tools like ChatGPT themselves can detect whether a text was written by an AI: this is demonstrably false.

Furthermore, humans are not likely to be better AI-writing detectors than these AI detection tools. Humans who flag texts as AI generated may be misled by their own perceptions of common characteristics of AI generated texts, which in many cases are outdated (see Myth 2, below). There is evidence that humans are no better than chance at differentiating between human-authored and AI-authored texts, even when the AI-authored texts are produced using old text generators like GPT-3. For instance, Catherine Gao and colleagues, in a preprint on bioRxiv, demonstrated that a blind review team of four scientists could not differentiate between abstracts for scientific papers written by GPT-3 and those written by humans. That a human thinks a text seems like it is AI-generated is not good evidence that the text is AI-generated.

There are only very limited exceptions to the rule that it is not possible for humans or AI tools to offer compelling evidence that a text is AI generated. The most significant comes from the use of certain incongruous phrases that could only reasonably find their way into a paper via the use of AI text generators, such as “As an AI language model, …” and “Regenerate response.” These phrases have been detected in dozens of academic papers, as reported by Guillaume Cabanac.

2) Text produced using Generative AI is bland, repetitive or predictable

One of the common ways to try to detect AI-authored text looks for the level of perplexity and burstiness of the text (see Perplexing Perplexity for further details). These are attempts to measure whether the text includes unusual word choices, unusual grammatical choices and variability in vocabulary, sentence lengths and constructions. These are taken to be hallmarks of human-written text, or at least to be unusual within AI-generated texts. Many people have played around with text generators and found the responses to be quite bland and predictable, and noted repetition of the prompt text and repetitions within the text.

However, these are only hallmarks of the ways in which we have tended to train and use generative AI systems. Generative AI tools are trained on large corpuses of human-authored text, and pick up and replicate our patterns of word choice and sentence use. The most well-used systems are designed for general and widespread use, so tend to default towards quite a bland style which is easy to read. It is simple enough to prompt generative AI tools to write in different styles. Simply including such an instruction in the prompt text will usually result in the tool generating text which avoids these common tropes. More specialised models which are trained or fine-tuned on particular data sets, too, can exhibit very different patterns and properties in the text they produce. That a text is bland, repetitive or predictable is not good evidence that it is AI-generated, and perhaps more importantly that a text involves unusual word choices, variable sentence lengths or complex grammatical constructions is not good evidence that it is not AI-generated.

3) Generative AI tools cannot produce text with accurate citations

A common response to the rise of GenAI has been to put more emphasis on the use and accuracy of citations. This is likely due to the inability of earlier models such as GPT-2 and GPT-3 to produce accurate citations and the frequency with which those earlier models would hallucinate fake citations within generated texts.

However, more recent GenAI systems which have access to the internet for search purposes (e.g. Microsoft’s Copilot) are able to find and accurately cite material from the web. This does not entail that such systems will never produce hallucinated citations or mischaracterise the material which they cite. But accurate citations is no longer a challenge for AI-authorship and thus the presence of accurate citations in a text is not good evidence that it is not AI-generated (nor, of course, is the absence of citations any evidence that the text is AI-generated).

Furthermore, there are a great many specific tools which have now been developed to perform literature searching and citation functions. These are language models which have been fine-tuned and implemented to identify relevant sources and provide accurate citations. Generalising about the capacities of Generative AI tools based only on the most popular tool at the time (e.g. the basic version of ChatGPT) will lead to misconceptions about the capabilities of Generative AI.

4) Generative AI is good for traditional assessments like essays, but weaker for more creative or reflective tasks

Another hope in creating assessments which are in some sense “AI-proof” is to replace traditional assessment formats such as essays with tasks which include more ‘creative’ or ‘reflective’ components. This might include writing an original composition which reflects the student’s learning, such as a poem, play or dialogue. It might include reflective writing about what the student has learned, the approach they have taken, or their learning journey in the course. It might also include some more of diary, log or series of annotations to explain the process by which the work was completed.

The hope that Generative AI tools will be less able to perform these tasks than they are to generate traditional academic essays is forlorn. While prompting a Generative AI tool to generate poetry, plays, dialogues, reflections, diaries or logs might take a little more creativity on the part of the person using the tool, there is currently no good evidence that Generative AI tools are any worse at performing these tasks than they are at traditional academic tasks. There may be more models out there optimised to common tasks like writing blogposts, newspaper articles or academic papers. However, this does not mean that it is impossible or even particularly difficult to get simple multipurpose language models to perform these non-traditional tasks, or to create a fine-tuned system specifically to complete these tasks.

There may be more hope for this kind of approach in requiring students to integrate certain experiences or features into tasks. For example, a task which asks a student to write about and reflect upon an in-class debate or simulation might make the task of using Generative AI to create such a report more difficult. However, this does not negate the possibility to use Generative AI to compile such a report. It only undermines the most basic version of Generative AI use here: relying entirely on the language model to generate the entire text. A student can still provide their analysis or reflection upon the event as part of a more sophisticated prompt and use a language model to help fill out more detail or link this to wider experiences, literatures or ideas.

Last modified: 26/03/24. Written entirely by the human author, as far as you can tell.