The increasing adoption of programming education necessitates efficient and accurate methods for evaluating students' coding assignments. Traditional manual grading is time-consuming, often inconsistent, and prone...
详细信息
The increasing adoption of programming education necessitates efficient and accurate methods for evaluating students' coding assignments. Traditional manual grading is time-consuming, often inconsistent, and prone to subjective biases. This paper explores the application of large language models (LLMs) for the automated evaluation of programming assignments. LLMs can use advanced natural language processing capabilities to assess code quality, functionality, and adherence to best practices, providing detailed feedback and grades. We demonstrate the effectiveness of LLMs through experiments comparing their performance with human evaluators across various programming tasks. Our study evaluates the performance of several LLMs for automated grading. Gemini 1.5 Pro achieves an exact match accuracy of 86% and a +/- 1 accuracy of 98%. GPT4o also demonstrates strong performance, with exact match and +/- 1 accuracies of 69% and 97%, respectively. Both models correlate highly with human evaluations, indicating their potential for reliable automated grading. However, models such as Llama 3 70B and Mixtral 8 x 7B exhibit low accuracy and alignment with human grading, particularly in problem-solving tasks. These findings suggest that advanced LLMs are instrumental in scalable, automated educational assessment. Additionally, LLMs enhance the learning experience by offering personalized, instant feedback, fostering an iterative learning process. The findings suggest that LLMs could play a pivotal role in the future of programming education, ensuring scalability and consistency in evaluation.
In a typical introductory programming course, grading student-submitted programs involves an autograder which compiles and runs the programs and tests their functionality with predefined test cases, with no attention ...
详细信息
ISBN:
(纸本)9798400705328
In a typical introductory programming course, grading student-submitted programs involves an autograder which compiles and runs the programs and tests their functionality with predefined test cases, with no attention to the source code. However, in an educational setting, grading based on inspection of the source code is required for two main reasons (1) awarding partial marks to 'partially correct' code that may be failing the testcase check (2) awarding marks (or penalties) based on source code quality or specific criteria that the instructor may have laid out in the problem statement (e.g. 'implement sorting using bubble-sort'). However, grading based on studying the source code can be highly time consuming when the course has a large enrollment. In this paper we present the design and evaluation of an AI Assistant for source code grading, which we have named TA Buddy. TA Buddy is powered by Code Llama, a large language model especially trained for code related tasks, which we fine-tuned using a graded programs dataset. Given a problem statement, student code submissions and a grading rubric, TA Buddy can be asked to generate suggested grades, i.e. ratings for the various rubric criteria, for each submission. The human teaching assistant (TA) can then accept or overrule these grades. We evaluated the TA Buddy-assisted manual grading against 'pure' manual grading and found that the time taken to grade reduced by 24% while maintaining grade agreement in the two cases at 90%.
In programming courses, providing students with concise and constructive feedback on faulty submissions (programs) is highly desirable. However, providing feedback manually is often time-consuming and tedious. To rele...
详细信息
In programming courses, providing students with concise and constructive feedback on faulty submissions (programs) is highly desirable. However, providing feedback manually is often time-consuming and tedious. To release tutors from the manual construction of concise feedback, researchers have proposed approaches such as CLARA and Refactory to construct feedback automatically. The key to such approaches is to fix a faulty program by making it equivalent to one of its correct reference programs whose overall structure is identical to that of the faulty submission. However, for a newly released assignment, it is likely that there are no correct reference programs at all, let alone correct reference programs sharing identical structure with the faulty submission. Therefore, in this paper, we propose AssignmentMender generating concise patches for newly released assignments. The key insight of AssignmentMender is that a faulty submission can be repaired by reusing fine-grained code snippets from submissions (even when they are faulty) for the same assignment. It automatically locates suspicious code in the faulty program and leverages static analysis to retrieve reference code from existing submissions with a graph-based matching algorithm. Finally, it generates candidate patches by modifying the suspicious code based on the reference code. Different from existing approaches, AssignmentMender exploits faulty submissions in addition to bug-free submissions to generate patches. Another advantage of AssignmentMender is that it can leverage submissions whose overall structures are different from those of the to-be-fixed submission. Evaluation results on 128 faulty submissions from 10 assignments show that AssignmentMender improves the state-of-the-art in feedback generation for newly released assignments. A case study involving 40 students and 80 submissions further provides initial evidence showing that the proposed approach is useful in practice.
Hardware description languages (HDLs) are pivotal for the development of hardware designs. The programming courses for HDLs are also popular in both universities and online course platforms. Similar to programming ass...
详细信息
Hardware description languages (HDLs) are pivotal for the development of hardware designs. The programming courses for HDLs are also popular in both universities and online course platforms. Similar to programming assignments of software languages (SLs), these of HDLs also actively call for automated program repair (APR) techniques to provide personalized feedback for students. However, the research of APR techniques targeting HDL programming assignments is still in an early stage. Due to the significantly different programming mechanism of HDLs from SLs, the only APR technique (i.e., CirFix) targeting HDL programming assignments contributes a customized repair pipeline. However, the fundamental challenges in the design of HDL-oriented fault localization and patch generation still remain unresolved. In this work, we propose a signal value transition-guided defect repair technique named Strider by capturing the intrinsic features of HDLs. This technique consists of a time-aware dynamic defect localization approach to precisely localize defects, and a signal value transition-guided patch synthesis approach to effectively generate fixes. We further construct a dataset of 57 real defects from HDL programming assignments for tool evaluation. The evaluation reveals the overfitting issue of the pioneering tool CirFix and the significant improvement of Strider over CirFix in terms of both effectiveness and efficiency. In particular, Strider is more effective by correctly fixing 2.3x as many defects as CirFix in the real defect dataset, and is 23x more efficient by generating a correct fix within 5 min on average in the synthetic defect dataset, while CirFix takes around 2 h on average.
Clustering of source code is a technique that can help improve feedback in automated program assessment. Grouping code submissions that contain similar mistakes can, for instance, facilitate the identification of stud...
详细信息
Clustering of source code is a technique that can help improve feedback in automated program assessment. Grouping code submissions that contain similar mistakes can, for instance, facilitate the identification of students' difficulties to provide targeted feedback. Moreover, solutions with similar functionality but possibly different coding styles or progress levels can allow personalized feedback to students stuck at some point based on a more developed source code or even detect potential cases of plagiarism. However, existing clustering approaches for source code are mostly inadequate for automated feedback generation or assessment systems in programming education. They either give too much emphasis to syntactical program features, rely on expensive computations over pairs of programs, or require previously collected data. This paper introduces an online approach and implemented tool-AsanasCluster-to cluster source code submissions to programming assignments. The proposed approach relies on program attributes extracted from semantic graph representations of source code, including control and data flow features. The obtained feature vector values are fed into an incremental k-means model. Such a model aims to determine the closest cluster of solutions, as they enter the system, timely, considering clustering is an intermediate step for feedback generation in automated assessment. We have conducted a twofold evaluation of the tool to assess (1) its runtime performance and (2) its precision in separating different algorithmic strategies. To this end, we have applied our clustering approach on a public dataset of real submissions from undergraduate students to programming assignments, measuring the runtimes for the distinct tasks involved: building a model, identifying the closest cluster to a new observation, and recalculating partitions. As for the precision, we partition two groups of programs collected from GitHub. One group contains implementations of two search
The aim of the paper is to present observations on automatic and semi-automatic assessment for programming assignments used in different e-learning contexts. Teaching of programming is an important part of different I...
详细信息
The aim of the paper is to present observations on automatic and semi-automatic assessment for programming assignments used in different e-learning contexts. Teaching of programming is an important part of different Informatics Engineering, Computer Science or Informatics, Computing, Information Technology and Communication courses in Universities and high schools. Students taking these courses have to demonstrate competences in problem solving and programming by creating working programs. Checking program validity is usually based on testing a program on diverse test cases. Testing for batch-type problems involves creating a set of input data cases, running a program submitted by a contestant with those input cases, analysing obtained outputs, etc. Assessment of programming assignments is as complex as testing of software systems. A lot of automatic assessment systems for programming assignments have been created to support teachers in submission assessment. However the problem of balance between the quality and the speed of assessment for programming assignments is important. Authors conducted the research on the possibilities of advanced semi-automatic approach in assessment, which can be used as compromise between manual and automatic assessment. A semi-automatic testing environment for evaluating programming assignments is developed, and the practical use of this system in Lithuania's optional programming maturity examination is presented. Presented research is useful for evaluating results of engineering education in general, and informatics/computer engineering education particularly.
This PhD research explores the problem of building a system for providing real-time formative feedback for programming assignments given to college/university students. Such system would maximize learning outcomes whi...
详细信息
ISBN:
(纸本)9798400701399
This PhD research explores the problem of building a system for providing real-time formative feedback for programming assignments given to college/university students. Such system would maximize learning outcomes while minimizing the effort from the tutor to construct such system. We propose an approach to building such a system and assessing its effectiveness, as well as outlines topics for future research.
Recent studies show that AI-driven code generation tools, such as Large Language Models, are able to solve most of the problems usually presented in introductory programming classes. However, it is still unknown how t...
详细信息
ISBN:
(纸本)9798400701382
Recent studies show that AI-driven code generation tools, such as Large Language Models, are able to solve most of the problems usually presented in introductory programming classes. However, it is still unknown how they cope with Object Oriented programming assignments, where the students are asked to design and implement several interrelated classes (either by composition or inheritance) that follow a set of best-practices. Since the majority of the exercises in these tools' training dataset are written in English, it is also unclear how well they function with exercises published in other languages. In this paper, we report our experience using GPT-3 to solve 6 real-world tasks used in an Object Oriented programming course at a Portuguese University and written in Portuguese. Our observations, based on an objective evaluation of the code, performed by an open-source Automatic Assessment Tool, show that GPT-3 is able to interpret and handle direct functional requirements, however it tends not to give the best solution in terms of object oriented design. We perform a qualitative analysis of GPT-3's output, and gather a set of recommendations for computer science educators, since we expect students to use and abuse this tool in their academic work.
We have designed and implemented game-themed programming assignment modules targeted specifically for adoption in existing introductory programming classes. These assignments are self-contained: so that faculty member...
详细信息
ISBN:
(纸本)9781595939470
We have designed and implemented game-themed programming assignment modules targeted specifically for adoption in existing introductory programming classes. These assignments are self-contained: so that faculty members with no background in graphics or gaming can selectively pick and choose a subset to combine with their own assignments in existing classes. This paper begins with a survey of previous results. Based on this survey, the paper summarizes the important considerations when designing materials for selective adoption. The paper then describes our design, implementation, and assessment efforts. Our result is a road map that guides faculty members in experimenting with game-themed programming assignments by incrementally adopting/customizing suitable materials for their classes.
In recent years, research has increasingly focused on developing intelligent tutoring systems that provide data-driven support for students in need of assistance during programming assignments. One goal of such intell...
详细信息
ISBN:
(纸本)9781450383264
In recent years, research has increasingly focused on developing intelligent tutoring systems that provide data-driven support for students in need of assistance during programming assignments. One goal of such intelligent tutors is to provide students with quality interventions comparable to those human tutors would give. While most studies focused on generating different forms of on-demand support, such as next-step hints and worked examples, at any given moment during the programming assignment, there is a lack of research on why human tutors would provide different forms of proactive interventions to students in different situations. This information is critical to know to allow the intelligent programming environments to select the appropriate type of student support at the right moment. In this work, we studied human tutors' reasons for providing interventions during two introductory programming assignments in a block-based environment. Three human tutors evaluated a sample of 86 struggling moments identified from students' log data using a data-driven model. The human tutors specified whether and why an intervention was needed (or not) for each struggling moment. We analyzed the expert tags and their consensus discussions and extracted three main reasons that made the experts decide to intervene: "missing key components to make progress", "using wrong or unnecessary blocks", "misusing needed blocks", "having critical logic errors", "needing confirmation and next steps", and "unclear student intention". We use six case studies to illustrate specific student code trace examples and the tutors' reasons for intervention. We also discuss the potential types of automatic interventions that could address these cases. Our work sheds light on when and why students might need programming interventions. These insights contribute towards improving the quality of automated, data-driven support in programming learning environments.
暂无评论