Abstract
Recent advances in large language models (LLMs) such as ChatGPT-4, Claude-3, and Llama-3 have brought significant impacts on education. These large language models have shown their powerful performance in both generating writing assignments and providing direct solutions to problems. However, the misuse of LLMs has raised concerns regarding academic integrity, as those LLMs can pass most of the courses at the undergraduate level. On the other hand, all existing AI detector fails to detect machine-generated code effectively. The abuse of LLMs underscores the importance of developing the missing methodologies that can detect code generated by LLMs, ensuring that academic evaluations accurately reflect true student knowledge. In our study, we adopt the GPT text detection methodology raised by (Mitchell et al., 2023), expand (Yang et al., 2023b)’s study to different latest LLMs, and explore the machine code detection code field. Our study found that probability curve-based method detection on machine-generated code could be generalized to various LLMs other than GPT models. In addition, the method is vulnerable to comment deletion attacks.
Â
Â
References
Anthropic. 2023. Claude 3 family. Accessed: 2023-05-02. https://www.anthropic.com/news/claude-3-family.
Sebastian Bordt and Ulrike von Luxburg. 2023. Chatgpt participates in a computer science exam. CoRR abs/2303.09461. https://doi.org/10.48550/ARXIV.2303.09461.
Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri, Gretchen Krueger, Michael Petrov, Heidy Khlaaf, and Girish Sastry and... 2021. Evaluating large language models trained on code.
Hugging Face. 2023. Codeparrot. Hugging Face Model Repository. Accessed: 2023-05-02. https://huggingface.co/codeparrot/codeparrot.
Sebastian Gehrmann, Hendrik Strobelt, and Alexander M. Rush. 2019. GLTR: statistical detection and visualization of generated text. In Marta R. Costa-jussà and Enrique Alfonseca, editors, Proceedings of the 57th Conference of the Association for Computational Linguistics, ACL 2019, Florence, Italy, July 28 - August 2, 2019, Volume 3: System Demonstration. Association for Computational Linguistics, pages 111–116. https://doi.org/10.18653/V1/P19-3019.
LeetCode. 2023. Leetcode. Accessed: 2023-05-02. https://leetcode.com.
Yixin Liu, Kai Zhang, Yuan Li, Zhihing Yan, Chujie Cao, Ruoxi Chen, Zhengqing Yuan, Yue Huang, Hanshi Sun, Jianfeng Gao, Liafng He, and Lichao Sun. 2024. Sora: A review on background, technology, limitations, and opportunities of large vision models.
Niloofar Miresghallah, Justus Mattern, Sicun Gao, Reza Shokri, and Taylor Berg-Kirkpatrick. 2024. Smaller language models are better black-box machine-generated text detectors.
Eric Mitchell, Yoonho Lee, Alexander Khazatsky, Christopher D. Manning, and Chelsea Finn. 2023. DetectGPT: Zero-shot machine-generated text detection using probability curvature. In Andreas Krause, Emma Brunskill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato, and Jonathan Scarlett, editors, International Conference on Machine Learning, ICML 2023, 23-29 July 2023, Honolulu, Hawaii, USA. PMLR, volume 202 of Proceedings of Machine Learning Research, pages 24950–24962. https://proceedings.mlr.press/v202/mitchell23a.html.
OpenAI. 2023. AI text classifier. Accessed: 2023-05-02. https://platform.openai.com/ai-text-classifier.
OpenAI, Josh Albee, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ige Akkaya, Florencia Leoni Alemi, Shyamal Anandkumar, Kazuma Henschmidt, Sam Altman, William Yang Wang, and Wei Cheng. 2023b. Zero-shot detection of machine-generated codes.
Suchir Balaji, Valerie Balcom, Paul Baltescu, Haiming Bao, Mohammad Bavarian, Jeff Belgum, Irwan Bello, Jake Berdine, Gabriel Bernadet-Shapiro, Christopher Berner, Lenny Bogdonoff, Oleg Boiko, and Madelaine Boyd... 2024. Gpt-4 technical report.
Mike Perkins, Jasper Roe, Darius Tronea, James McGaughran, and Don Hickerson. 2023. Game of tones: Faculty detection of GPT-4 generated content in university assessments. CoRR abs/2305.18081. https://doi.org/10.48550/ARXIV.2305.18081.
Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. 2022. High-resolution image synthesis with latent diffusion models.
Pranab Sahoo, Ayush Kumar Singh, Sripama Saha, Vinija Jain, Samrat Mondal, and Aman Chadha. 2024. A systematic survey of prompt engineering in large language models: Techniques and applications.
Giriprasad Sridhar, Ranjani H. G., and Sourav Mazumdar. 2023. Chatgpt: A study on its utility for ubiquitous software engineering tasks.
Edward Tian. 2023. Gptzero: An ai text detector. Software.
Wang Jian, Shangqing Liu, Xiaoef Xie, and Yi Li. 2023. Evaluating ai code detectors on code content.
Frank F. Xu, Uri Alon, Graham Neubig, and Vincent Joshua Hellerman. 2022. A systematic evaluation of large language models of code. In Deep Learning for Code Workshop. https://openreview.net/forum?id=SLeC6noObLZ.
Xianjun Yang, Liangnnpen Pan, Xuandong Zhao, Haifeng Chen, Linda R. Petzold, William Yang Wang, and Wei Cheng. 2023a. A survey on detection of llms-generated content. CoRR abs/2310.16554. https://doi.org/10.48550/ARXIV.2310.16554.
Xianjun Yang, Kexun Zhang, Haifeng Chen, Linda Petzold, William Yang Wang, and Wei Cheng. 2023b. Zero-shot detection of machine-generated codes.
Daozan Zhang, Weitong Zhang, Yu Zhao, Jianguo Zhang, Bing He, Chenchen Qin, and Jianhua Yao. 2023. Dnapgt: A generalized pre-trained tool for versatile ai sequence analysis tasks.