Evaluating the Reliability of Generative AI in Distinguishing Machine from Human Text

Yuhefizar Yuhefizar; Ronal Watrianthos; Dony Marzuki

doi:10.56294/dm20251181

Authors

Yuhefizar Politeknik Negeri Padang, Information Technology. Padang, Indonesia Author https://orcid.org/0000-0001-9861-7962
Ronal Watrianthos Politeknik Negeri Padang, Information Technology. Padang, Indonesia Author https://orcid.org/0000-0003-3475-7266
Dony Marzuki Politeknik Negeri Padang, English Departement. Padang, Indonesia Author https://orcid.org/0000-0002-8284-8042

DOI:

https://doi.org/10.56294/dm20251181

Keywords:

Generative Artificial Intelligence, AI Text Detection, Machine Learning, Academic Integrity, ChatGPT, Binary Classification

Abstract

Introduction: The rapid progression of generative AI systems has facilitated the creation of human-like text with remarkable sophistication. Models such as GPT-4, Claude, and Gemini are capable of generating coherent content across a wide range of genres, thereby raising critical concerns regarding the differentiation between machine-generated and human-authored text. This capability presents significant challenges to academic integrity, content authenticity, and the development of reliable detection methodologies.
Objective: To evaluate the performance and reliability of current AI-based text detection tools in identifying machine-generated content across different text genres, AI models, and writing styles, establishing a comprehensive benchmark for detection capabilities.
Methodology: We systematically evaluated ten commercially available AI detection tools utilizing a curated dataset comprising 150 text samples, expanded from the original 50. This dataset included human-authored texts, both original and translated, as well as AI-generated content from six advanced models (GPT-3.5, GPT-4, Gemini, Bing, Claude, LLaMA2), along with paraphrased variants. Each tool underwent assessment through binary classification, employing metrics such as accuracy, precision, recall, F1 scores, and confusion matrices. Statistical significance was determined using McNemar's test with Bonferroni correction.
Results indicate that Content at Scale demonstrated the highest accuracy at 88% (95% CI: 84.2-91.8%), followed by Crossplag at 76% and Copyleaks at 70%. Notably, performance varied significantly across different text categories, with all tools exhibiting reduced accuracy for texts generated by more recent models, such as Claude and LLaMA2. False positive rates ranged from 4% to 32%, which raises concerns regarding their applicability in academic contexts. No tool achieved perfect accuracy, and a performance degradation of 12% was observed with models released subsequent to the initial study design.
Conclusions: Current AI text detection tools exhibit moderate to high levels of accuracy; however, they remain imperfect, displaying considerable variability across different AI models and text types. The ongoing challenge of achieving reliable detection, coupled with non-trivial false positive rates, necessitates cautious implementation in high-stakes environments. These tools should serve as a complement to, rather than a replacement for, human judgment in academic and professional contexts.

References

1. De-Fitero-Dominguez D, Garcia-Lopez E, Garcia-Cabot A, Del-Hoyo-Gabaldon JA, Moreno-Cediel A. Distractor Generation Through Text-to-Text Transformer Models. IEEE Access. 2024;12. DOI: https://doi.org/10.1109/ACCESS.2024.3361673

2. Gruetzemacher R, Paradice D. Deep Transfer Learning & Beyond: Transformer Language Models in Information Systems Research. ACM Comput Surv. 2022 Jan 31;54(10s):1–35. DOI: https://doi.org/10.1145/3505245

3. Gong L, Crego J, Senellart J. Enhanced transformer model for data-to-text generation. In: EMNLP-IJCNLP 2019 - Proceedings of the 3rd Workshop on Neural Generation and Translation. 2019. DOI: https://doi.org/10.18653/v1/D19-5615

4. Caruccio L, Cirillo S, Polese G, Solimando G, Sundaramurthy S, Tortora G. Claude 2.0 large language model: Tackling a real-world classification problem with a new iterative prompt engineering approach. Intelligent Systems with Applications. 2024;21. DOI: https://doi.org/10.1016/j.iswa.2024.200336

5. Toyama Y, Harigai A, Abe M, Nagano M, Kawabata M, Seki Y, et al. Performance evaluation of ChatGPT, GPT-4, and Bard on the official board examination of the Japan Radiology Society. Jpn J Radiol. 2024;42(2). DOI: https://doi.org/10.1007/s11604-023-01491-2

6. Grindrod J. Large language models and linguistic intentionality. Synthese. 2024 Aug 6;204(2):71. DOI: https://doi.org/10.1007/s11229-024-04723-8

7. Sobo A, Mubarak A, Baimagambetov A, Polatidis N. Evaluating LLMs for Code Generation in HRI: A Comparative Study of ChatGPT, Gemini, and Claude. Applied Artificial Intelligence. 2025 Dec 31;39(1). DOI: https://doi.org/10.1080/08839514.2024.2439610

8. Raza M, Jahangir Z, Riaz MB, Saeed MJ, Sattar MA. Industrial applications of large language models. Sci Rep. 2025 Apr 21;15(1):13755. DOI: https://doi.org/10.1038/s41598-025-98483-1

9. Ahmed Ali Linkon, Mujiba Shaima, Md Shohail Uddin Sarker, Badruddowza, Norun Nabi, Md Nasir Uddin Rana, et al. Advancements and Applications of Generative Artificial Intelligence and Large Language Models on Business Management: A Comprehensive Review. Journal of Computer Science and Technology Studies. 2024 Mar 13;6(1):225–32. DOI: https://doi.org/10.32996/jcsts.2024.6.1.26

10. Irfan D, Watrianthos R, Amin Nur Bin Yunus F. AI in Education: A Decade of Global Research Trends and Future Directions. International Journal of Modern Education and Computer Science (IJMECS) [Internet]. 2025 Apr 8 [cited 2025 Mar 31];2(2):135–53. Available from: https://www.mecs-press.org/ijmecs/ijmecs-v17-n2/v17n2-7.html DOI: https://doi.org/10.5815/ijmecs.2025.02.07

11. Yoo JH. Defining the Boundaries of AI Use in Scientific Writing: A Comparative Review of Editorial Policies. J Korean Med Sci. 2025;40(23). DOI: https://doi.org/10.3346/jkms.2025.40.e187

12. Perkins M, Roe J. Academic publisher guidelines on AI usage: A ChatGPT supported thematic analysis. F1000Res. 2024 Jan 16;12:1398. DOI: https://doi.org/10.12688/f1000research.142411.2

13. Miao J, Thongprayoon C, Suppadungsuk S, Garcia Valencia OA, Qureshi F, Cheungpasitporn W. Ethical Dilemmas in Using AI for Academic Writing and an Example Framework for Peer Review in Nephrology Academia: A Narrative Review. Clin Pract. 2023 Dec 30;14(1):89–105. DOI: https://doi.org/10.3390/clinpract14010008

14. Ghiurău D, Popescu DE. Distinguishing Reality from AI: Approaches for Detecting Synthetic Content. Computers. 2024 Dec 24;14(1):1.

15. Casal JE, Kessler M. Can linguists distinguish between ChatGPT/AI and human writing?: A study of research ethics and academic publishing. Research Methods in Applied Linguistics. 2023;2(3). DOI: https://doi.org/10.1016/j.rmal.2023.100068

16. Hinton M, Wagemans JHM. How persuasive is AI-generated argumentation? An analysis of the quality of an argumentative text produced by the GPT-3 AI text generator. Argument and Computation. 2022;14(1). DOI: https://doi.org/10.3233/AAC-210026

17. An Empirical Study of AI-Generated Text Detection Tools. Advances in Machine Learning & Artificial Intelligence. 2023;4(2). DOI: https://doi.org/10.33140/AMLAI.04.02.03

18. Gegg-Harrison W, Quarterman C. AI Detection’s High False Positive Rates and the Psychological and Material Impacts on Students. In 2024. p. 199–219. DOI: https://doi.org/10.4018/979-8-3693-0240-8.ch011

19. Elkhatat AM, Elsaid K, Almeer S. Evaluating the efficacy of AI content detection tools in differentiating between human and AI-generated text. International Journal for Educational Integrity. 2023 Sep 1;19(1):17. DOI: https://doi.org/10.1007/s40979-023-00140-5

20. Ghiurău D, Popescu DE. Distinguishing Reality from AI: Approaches for Detecting Synthetic Content. Computers. 2024 Dec 24;14(1):1. DOI: https://doi.org/10.3390/computers14010001

21. Moshirfar M, Altaf AW, Stoakes IM, Tuttle JJ, Hoopes PC. Artificial Intelligence in Ophthalmology: A Comparative Analysis of GPT-3.5, GPT-4, and Human Expertise in Answering StatPearls Questions. Cureus. 2023 Jun 22; DOI: https://doi.org/10.7759/cureus.40822

22. Watrianthos R, Triono Ahmad S, Muskhir M. Charting the Growth and Structure of Early ChatGPT-Education Research: A Bibliometric Study. Journal of Information Technology Education: Innovations in Practice. 2023;22:235–53. DOI: https://doi.org/10.28945/5221

23. Zhao H, Ling Q, Pan Y, Zhong T, Hu JY, Yao J, et al. Ophtha-LLaMA2: A Large Language Model for Ophthalmology. 2023;

24. Amanda Amanda, Elsa Muliani Sukma, Nursyahrina Lubis, Utami Dewi. Quillbot As An AI-powered English Writing Assistant: An Alternative For Students to Write English. Jurnal Pendidikan dan Sastra Inggris. 2023;3(2). DOI: https://doi.org/10.55606/jupensi.v3i2.2026

25. Mohammad T, Nazim M, Alzubi AAF, Khan SI. Examining EFL Students’ Motivation Level in Using QuillBot to Improve Paraphrasing Skills. World Journal of English Language. 2024;14(1). DOI: https://doi.org/10.5430/wjel.v14n1p501

26. Nawaz SA, Li J, Bhatti UA, Shoukat MU, Ahmad RM. AI-based object detection latest trends in remote sensing, multimedia and agriculture applications. Front Plant Sci. 2022 Nov 18;13. DOI: https://doi.org/10.3389/fpls.2022.1041514

Evaluating the Reliability of Generative AI in Distinguishing Machine from Human Text

Authors

DOI:

Keywords:

Abstract

References

Downloads

Published

Issue

Section

License

How to Cite

Scopus

citescore

compendex

sjr