DOI QR코드

DOI QR Code

Framework for evaluating code generation ability of large language models

  • Sangyeop Yeo (Division of Artificial Intelligence, University of Science and Technology) ;
  • Yu-Seung Ma (Division of Artificial Intelligence, University of Science and Technology) ;
  • Sang Cheol Kim (Artificial Intelligence Computing Research Laboratory, Electronics and Telecommunications Research Institute) ;
  • Hyungkook Jun (Artificial Intelligence Computing Research Laboratory, Electronics and Telecommunications Research Institute) ;
  • Taeho Kim (Artificial Intelligence Computing Research Laboratory, Electronics and Telecommunications Research Institute)
  • Received : 2023.08.27
  • Accepted : 2023.12.20
  • Published : 2024.02.20

Abstract

Large language models (LLMs) have revolutionized various applications in natural language processing and exhibited proficiency in generating programming code. We propose a framework for evaluating the code generation ability of LLMs and introduce a new metric, pass-ratio@n, which captures the granularity of accuracy according to the pass rate of test cases. The framework is intended to be fully automatic to handle the repetitive work involved in generating prompts, conducting inferences, and executing the generated codes. A preliminary evaluation focusing on the prompt detail, problem publication date, and difficulty level demonstrates the successful integration of our framework with the LeetCode coding platform and highlights the applicability of the pass-ratio@n metric.

Keywords

Acknowledgement

This work was supported by an Institute of Information & Communications Technology Planning & Evaluation (IITP) grant (2022-0-00995, automated reliable source code generation from natural language descriptions, 95%) and a National Research Council of Science & Technology (NST) grant (Global-23-001, SeCode: Collaborative intelligent model for secure program code generator, 5%) funded by the Korea government (MSIT).

References

  1. M. M. Abdollah Pour and S. Momtazi, Comparative study of text representation and learning for persian named entity recognition, ETRI J. 44 (2022), no. 5, 794-804. DOI 10.4218/etrij.2021-0269
  2. C. Park, J. Lim, J. Ryu, H. Kim, and C. Lee, Simple and effective neural coreference resolution for korean language, ETRI J. 43 (2021), no. 6, 1038-1048. DOI 10.4218/etrij.2020-0282
  3. A. Prakash, N. K. Singh, and S. K. Saha, Automatic extraction of similar poetry for study of literary texts: An experiment on hindi poetry, ETRI J. 44 (2022), no. 3, 413-425. DOI 10.4218/etrij.2019-0396
  4. M. Chen, J. Tworek, H. Jun, Q. Yuan, H. P. de Oliveira Pinto, J. Kaplan, H. Edwards, Y. Burda, N. Joseph, G. Brockman, A. Ray, R. Puri, G. Krueger, M. Petrov, H. Khlaaf, G. Sastry, P. Mishkin, B. Chan, S. Gray, N. Ryder, M. Pavlov, A. Power, L. Kaiser, M. Bavarian, C. Winter, P. Tillet, F. P. Such, D. Cummings, M. Plappert, F. Chantzis, E. Barnes, A. Herbert-Voss, W. H. Guss, A. Nichol, A. Paino, N. Tezak, J. Tang, I. Babuschkin, S. Balaji, S. Jain, W. Saunders, C. Hesse, A. N. Carr, J. Leike, J. Achiam, V. Misra, E. Morikawa, A. Radford, M. Knight, M. Brundage, M. Murati, K. Mayer, P. Welinder, B. McGrew, D. Amodei, S. McCandlish, I. Sutskever, and W. Zaremba, Evaluating large language models trained on code, arXiv preprint, 2021. DOI 10.48550/arXiv.2107.03374
  5. Y. Li, D. Choi, J. Chung, N. Kushman, J. Schrittwieser, R. Leblond, T. Eccles, J. Keeling, F. Gimeno, A. D. Lago, T. Hubert, P. Choy, C. de Masson d'Autume, I. Babuschkin, X. Chen, P.-S. Huang, J. Welbl, S. Gowal, A. Cherepanov, J. Molloy, D. J. Mankowitz, E. S. Robson, P. Kohli, N. de Freitas, K. Kavukcuoglu, and O. Vinyals, Competition-level code generation with alphacode, Sci. 378 (2022), no. 6624, 1092-1097. https://doi.org/10.1126/science.abq1158
  6. E. Nijkamp, B. Pang, H. Hayashi, L. Tu, H. Wang, Y. Zhou, S. Savarese, and C. Xiong, Codegen: An open large language model for code with multi-turn program synthesis, (International Conference on Learning Representations), 2022. https://api.semanticscholar.org/CorpusID:252668917
  7. K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu, BLEU: A method for automatic evaluation of machine translation, (Association for Computational Linguistics, Philadelphia, PA, USA), 2002, pp. 311-318.
  8. S. Ren, D. Guo, S. Lu, L. Zhou, S. Liu, D. Tang, N. Sundaresan, M. Zhou, A. Blanco, and S. Ma, Codebleu: a method for automatic evaluation of code synthesis, arXiv preprint, 2020, DOI 10.48550/arXiv.2009.10297
  9. S. Kulal, P. Pasupat, K. Chandra, M. Lee, O. Padon, A. Aiken, and P. S. Liang, SPoC: Search-based Pseudocode to Code, Proceedings of the 33rd International Conference on Neural Information Processing Systems Edited by H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alche-Buc, E. Fox, and R. Garnett, Vol. 32, Curran Associates, Inc., 2019.
  10. Y. Feng, S. Vanam, M. Cherukupally, W. Zheng, M. Qiu, and H. Chen, Investigating code generation performance of chatgpt with crowdsourcing social data, (IEEE 47th Annual Computers, Software, and Applications conference (COMPSAC), Torino, Italy), 2023, pp. 876-885.
  11. N. Nguyen and S. Nadi, An empirical evaluation of github copilot's code suggestions, (IEEE/ACM 19th International Conference on Mining Software Repositories (MSR), Pittsburgh, PA, USA), 2022, pp. 1-5.
  12. B. Yetistiren, I. Ozsoy, and E. Tuzun, Assessing the quality of github copilot's code generation, (Proc. of the 18th International Conference on Predictive Models and Data Analytics in Software Engineering, Singapore), 2022, pp. 62-71.
  13. T. Kim, Y. Jang, C. Lee, H. Koo, and H. Kim, Smartmark: Software watermarking scheme for smart contracts, (IEEE/ACM 45th International Conference on Software Engineering (ICSE), Melbourne, Australia), 2023, pp. 283-294.
  14. J. Austin, A. Odena, M. Nye, M. Bosma, H. Michalewski, D. Dohan, E. Jiang, C. Cai, M. Terry, Q. Le, and C. Sutton, Program synthesis with large language models, arXiv preprint, 2021. DOI 10.48550/arXiv.2108.07732
  15. Q. Zheng, X. Xia, X. Zou, Y. Dong, S. Wang, Y. Xue, Z. Wang, L. Shen, A. Wang, Y. Li, T. Su, Z. Yang, and J. Tang, Code-GeeX: A pre-trained model for code generation with multilingual evaluations on humaneval-XX, (Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, Long Beach, CA, USA), 2023, pp. 5673-5684.
  16. B. Athiwaratkun, S. K. Gouda, Z. Wang, X. Li, Y. Tian, M. Tan, W. U. Ahmad, S. Wang, Q. Sun, and M. Shang, Multilingual evaluation of code generation models, (The Eleventh International Conference on Learning Representations), 2023.
  17. B. Roziere, 2023. Code llama: Open foundation models for code.
  18. D. Hendrycks, S. Basart, S. Kadavath, M. Mazeika, A. Arora, E. Guo, C. Burns, S. Puranik, H. He, D. Song, and J. Steinhardt, Measuring coding challenge competence with apps, (Proc. of the Neural Information Processing Systems Track on Datasets and Benchmarks), 2021.
  19. X.-Y. Li, J.-T. Xue, Z. Xie, and M. Li, Think outside the code: Brainstorming boosts large language models in code generation, arXiv preprint, 2023, DOI 10.48550/arXiv.2305.10679
  20. OpenAI, Gpt-4 technical report, 2023.