AI Detection Tools

With the recent growth of both performance and accessibility of generative Artificial Intelligence (AI), students are increasingly using this new technology in their academic course work.  While AI can be used to assist in learning and writing, like brainstorming or feedback; it can also produce customizable content that passes as human-created, such as full paragraphs and completed essays. This has raised concerns amongst faculty about identifying what work has been created independently and what has been generated with AI. One method of determining this is with AI detection tools. This review outlines popular AI detection tools–how they function, reliability , ethical concerns, and peer approaches.  Some of the most popular tools are Turntitin, GPTZeroOriginality.ai and Grammarly 

Summary 

We hope this review of AI detection tools can assist faculty in developing guidelines that align with the college’s academic integrity principles while acknowledging the evolving nature of AI in education. For example: 

  • AI detection Tools are not consistently reliable (Elkhatat, Elsaid, & Almeer, 2023). If AI detectors are used, they should not be used as evidence in academic integrity cases without additional supporting evidence. 
  • If the results from an AI detector are going to be considered, issues of equity and biases described in the report should be kept in mind. 
  • Clear policy surrounding AI use and instruction on AI literacy should be provided to students. 

How AI Detection Works 

AI detectors are systems powered by AI to determine the likelihood that content was written by AI.  They recognize patterns in writing styles and assess syntax, coherence, and contextual nuances and provide a probability score that conveys either confidence in AI detection or percentage of AI composition. The probability score does not confirm or deny AI usage but merely provides a measure of likelihood.   

For example, Turnitin’s AI writing detection feature gives educators a report with data points, including an AI writing indicator score that highlights text segments that Turnitin’s model predicts may have been written by an AI tool. Separately, it highlights AI-generated text that may have been further modified using an AI paraphrasing tool.  

The overall percentage of text detected as AI is detailed in two detection categories in the Submission Breakdown: 

  • AI-generated only - This category detects qualifying text that was likely generated from a Large Language Model (LLM). Text of this category is highlighted in a cyan color in the submission breakdown bar and in the submission.  
  • AI-generated text that was AI-paraphrased - This category detects qualifying text that was likely AI-generated and then likely modified by an AI-paraphrasing tool or AI word spinner, such as Quillbot. Text of this category is highlighted in a purple color in the submission breakdown bar and in the submission.   

Key Analytical Metrics 

AI detectors assess text using several computational measures, two of the most significant being: 

  • Perplexity: A measure of unpredictability in language. Higher perplexity indicates more variation and nuance, characteristics of human writing. On the other hand, AI-generated text tends to be more predictable, resulting in lower perplexity scores. 
  • Burstiness: The variation in sentence structure and length. Human writing naturally fluctuates between short and long sentences, and simple and complex sentences. On the other hand, AI-generated text often exhibits uniformity, leading to lower burstiness. 

These metrics interact to assess authenticity. High burstiness can increase perplexity, making text harder to predict, whereas low burstiness often signals AI-generated text. AI detectors analyze these patterns to estimate the likelihood of AI authorship. 

Reliability 

The important thing to consider in education is how reliable these tools are, and the answer to that is not simple. For example Turnitin claims 98% accuracy, Originality claims 99% accuracy, and GPTZero claims 99% accuracy . However, these numbers are not fixed. AI detection is currently a cat and mouse game and because the technology generating content is advancing rapidly it makes sense that the accuracy of the technology detecting it would shift as well. Turnitin claimed less than 1% false positive rate when it was first released but then amended that to say it was higher, now they don’t give an exact number.  Even if a detector was 100% accurate today, it is unlikely to be as accurate when a newly trained platform is released. Higher powered AIs are more sophisticated and create content that is harder to detect. With each advancement in AI, there will be a slight lag in trained detectors. 

Another  factor is how the user generates text. Savvy students aren’t just copying and pasting text. Accuracy for many of these models drops dramatically with just minor revisions and tweaks. One study reported AI generated text detection fell from 74% to 42% with human modification. They also found that when AI was asked to transform or paraphrase text previously generated by AI the accuracy rate dropped to 26%. The quality of the prompt used to generate the text can also complicate detection rates. If students instruct the AI to mimic their own writing style through provided examples or give it specific instructions about the style of output, detection rates drop. Turnitin has also been shown to be more accurate at detecting stand alone snippets of AI generated work rather than finding sections embedded into longer works

Ethical Concerns 

Beyond simple issues of plagiarism there are other concerns, such as creating inequities by giving higher grades to students who have the means and skill to properly access the more advanced tools.  

As previously discussed, AI detectors are analyzing sentence structures and complexity. A well- known study by Liang et al. showed students with limited vocabulary, for example, non-native English speakers, are more likely to have their writing be flagged as AI generated. Even with accuracy rates as high as 99%, running every paper created for an institution through a detector can result in hundreds of flagged papers and the mental and emotional toll those accusations can have on the accused students is not insignificant.  

There is also speculation that AI detection may not work equally across disciplines. A write up by The Washington Post, indicates scientific writing may be more likely to be flagged due to the formulaic nature of the writing. The AI detectors are looking for predictable or average writing. Scientific writing often lacks opportunity for creative flourish and involves precise vocabulary, which is easier to replicate and thus gets flagged more often. 

Another major concern is that of equity. Avoiding detection may just be a matter of having access to the newest AI models, which often cost money to access. AI detection, just like other AI based technology, needs to be trained on what to look for. The goal of generative AI is to create writing that is as close to human generated as possible and thus with every iteration, becomes harder to detect. Detection models will always lag slightly behind. Students who can afford access will be able to avoid detection, but those using free versions may not.  

Using AI detectors also has the potential to violate Family Educational Rights and Privacy Act (FERPA). Quinnipiac University points out that according to FERPA, student work requires consent to be shared unless it is with a service under contract to the school and bound by additional FERPA regulations. Not all AI detectors fit that description. 

Peer Institutions 

Many of our peer institutions are struggling with the same concerns we face. Vanderbilt University began using Turnitin AI detection but stopped the service after Turnitin published higher false positive rates than initially reported. Vanderbilt’s decision was also influenced by a lack of transparency surrounding how the AI detectors worked, implications for non- native English speakers, and privacy issues. Vanderbilt instead switched to emphasizing citing AI when utilized, AI literacy, and bolstering student professor relationships. 

Similarly, Connecticut College does not use AI detectors and discourages their usage for similar reasons to Vanderbilt citing bias and degradation of trust and instead favoring clear policy on use and improved teaching practices. 

Wesleyan University has access to Turnitin’s AI detector but allows professors to turn it on or off via their Moodle sites. The probability score produced by Turnitin is not sufficient to bring to the Honor Board without additional supporting evidence and scores below 25% are not submissible at all. 

Many institutions point to Vanderbilt’s policy and this one from MIT Sloan Teaching and Learning Technologies. They both provide a useful list of references worth looking at.