The AI Text Classifier is a fine-tuned GPT model that predicts how likely it is that a piece of text was generated by AI from a variety of sources, such as ChatGPT.
The AI Text Classifier is a model that has been fine-tuned to distinguish between human-written and AI-generated text. It has been trained on a dataset consisting of samples from multiple sources, including human-written text from Wikipedia, the WebText dataset, and human demonstrations collected during InstructGPT training. The dataset is carefully constructed to pair each sample of model-written text with a similar sample of human-written text to minimize spurious correlations.
The training batches are balanced, containing equal proportions of AI-generated and human-written text. However, there may be instances of AI-generated text labeled as human-written due to the prevalence of AI-generated content on the internet.
It's important to note some limitations of the model. It is primarily trained and evaluated on English language text from the public web and models trained on English text. Performance on non-English text is relatively worse. The model is less reliable on short text and text that is highly predictable, such as a list of the first 1,000 prime numbers.
The primary use case for this classifier is to confirm whether text submitted as human-written is indeed human-written. It may not perform well on other targets like student essays, automated disinformation campaigns, or chat transcripts, as it's known that neural network classifiers are poorly calibrated outside of their training data.
The accuracy of the classifier has been evaluated on a validation set and a challenge set. It outperforms a previously published classifier, with an AUC score of 0.97 on the validation set and 0.66 on the challenge set. The classifier's performance degrades as the size of the language model generating the text increases, meaning larger models produce outputs that resemble human-written text more closely.