Search results

Rijnders, Mike 2023

The most effective features to distinguish humanly-authored and ChatGPT-generated text

Master thesis | Statistics and Data Science (MSc)

open access

The developments in Artificial Intelligence have resulted in the emergence of large lan- guage models such as ChatGPT. The development of such models has led to an increased risk of fraudulent...Show moreThe developments in Artificial Intelligence have resulted in the emergence of large lan- guage models such as ChatGPT. The development of such models has led to an increased risk of fraudulent activities, therefore this research wants to determine the most effective features for distinguishing between humanly-authored and ChatGPT-generated text within the sci- entific domain. This research has constructed a text corpus consisting of humanly-authored and ChatGPT-generated abstracts based on the titles of scientific papers. This research build three different XGBoost classifiers, the first based on Doc2Vec vector embeddings, the second on text-extracted features and the third combining both. The results underscore the superiority of models incorporating Doc2Vec vector embeddings while reading time emerged as the most influential feature in accurately predicting whether a text is humanly-authored or ChatGPT-generated in both the text-extracted feature and the combined model. The combined model had the best performance in terms of accuracy. Nevertheless, the model based on Doc2Vec vector embeddings and text-extracted features was still outperformed by the GPTZero model, emphasizing the necessity for further refinement before its application in assessing whether a text is humanly-authored or ChatGPT-generated.Show less

Leiden University Student Repository

Refine Results

Availability

Faculty

Thesis type

Programme

Issued

Supervisor

Language

Evaluation

Your search

Enabled Filters

Sort