Teachers deal with plagiarism on a regular basis, so they try to prevent and detect plagiarism, a task that is complicated by the large size of some classes. Students who cheat often try to hide their plagiarism (obfu...
详细信息
Teachers deal with plagiarism on a regular basis, so they try to prevent and detect plagiarism, a task that is complicated by the large size of some classes. Students who cheat often try to hide their plagiarism (obfuscate), and many different similarity detection engines (often called plagiarism detection tools) have been built to help teachers. This article focuses only on plagiarism detection and presents a detailed systematic review of the field of source-code plagiarism detection in academia. This review gives an overview of definitions of plagiarism, plagiarism detection tools, comparison metrics, obfuscation methods, datasets used for comparison, and algorithm types. Perspectives on the meaning of source-code plagiarism detection in academia are presented, together with categorisations of the available detection tools and analyses of their effectiveness. While writing the review, some interesting insights have been found about metrics and datasets for quantitative tool comparison and categorisation of detection algorithms. Also, existing obfuscation methods classifications have been expanded together with a new definition of "source-code plagiarism detection in academia."
Authorship identification is essential to the detection of undesirable deception of others' content misuse or exposing the owners of some anonymous malicious content. While it is widely studied for natural languag...
详细信息
ISBN:
(纸本)9781450389785
Authorship identification is essential to the detection of undesirable deception of others' content misuse or exposing the owners of some anonymous malicious content. While it is widely studied for natural languages, it is rarely considered for programming languages. Accordingly, a PAN@FIRE task, named Authorship Identification of sourcecode (AI-SOCO), is proposed with the focus on the identification of sourcecode authors. The dataset consists of crawled sourcecodes submitted by the top 1,000 human users with 100 correct C++ submissions or more from the codeForces online judge platform. The participating systems are asked to predict the author of a given sourcecode from the predefined list of code authors. In total, 60 teams registered on the task's CodaLab page. Out of them, 14 teams submitted 94 runs. The results are surprisingly high with many teams and baselines breaking the 90% accuracy barrier. These systems used a wide range of models and techniques from pretrained word embeddings (especially, those that are tweaked to handle sourcecode) to stylometric features.
Computer science education has promised open access around the world, but access is largely determined by what human language you speak. As younger students learn computer science it is less appropriate to assume that...
详细信息
ISBN:
(纸本)9781450379519
Computer science education has promised open access around the world, but access is largely determined by what human language you speak. As younger students learn computer science it is less appropriate to assume that they should learn English beforehand. To that end, we present codeInternational, the first tool to translate code between human languages. To develop a theory of non-English code, and inform our translation decisions, we conduct a study of public code repositories on GitHub. The study is to the best of our knowledge the first on human-language in code and covers 2.9 million Java repositories. To demonstrate codeInternational's educational utility, we build an interactive version of the popular English-language Karel reader and translate it into 100 spoken languages. Our translations have already been used in classrooms around the world, and represent a first step in an important open CS-education problem.
Authorship identification is essential to the detection of undesirable deception of others’ content misuse or exposing the owners of some anonymous malicious content. While it is widely studied for natural languages,...
详细信息
ISBN:
(纸本)9781450389785
Authorship identification is essential to the detection of undesirable deception of others’ content misuse or exposing the owners of some anonymous malicious content. While it is widely studied for natural languages, it is rarely considered for programming languages. Accordingly, a [email protected] task, named Authorship Identification of sourcecode (AI-SOCO), is proposed with the focus on the identification of sourcecode authors. The dataset consists of crawled sourcecodes submitted by the top 1,000 human users with 100 correct C++ submissions or more from the codeForces online judge platform. The participating systems are asked to predict the author of a given sourcecode from the predefined list of code authors. In total, 60 teams registered on the task’s CodaLab page. Out of them, 14 teams submitted 94 runs. The results are surprisingly high with many teams and baselines breaking the 90% accuracy barrier. These systems used a wide range of models and techniques from pretrained word embeddings (especially, those that are tweaked to handle sourcecode) to stylometric features.
Today there are many source-code similarity detection tools. These tools are used for many purposes and one of them is plagiarism detection, in which context this paper is written. Every time a new tool is developed a...
详细信息
ISBN:
(纸本)9789532330953
Today there are many source-code similarity detection tools. These tools are used for many purposes and one of them is plagiarism detection, in which context this paper is written. Every time a new tool is developed authors want to show that it is better than existing ones, and so they perform comparisons. Often these comparisons tend to be unfair towards the existing tools, for which there can be multiple reasons, such as the lack of calibration of existing tools. Almost all tools have configuration parameters, but often they are not calibrated before the comparison. The paper presents a way of calibrating the tools to keep the comparison more objective.
Perspectives of students on what constitutes source-code plagiarism may differ based on their educational background. Surveys have been conducted with home students undertaking computing and joint computing subject de...
详细信息
Perspectives of students on what constitutes source-code plagiarism may differ based on their educational background. Surveys have been conducted with home students undertaking computing and joint computing subject degrees at higher education institutions throughout the UK, China, and South Cyprus, and a total of 984 responses have been statistically analysed to determine the common areas of understanding and misunderstanding among students on various topics related to source-code plagiarism. The study identifies those topics which are well understood, and those topics which are not properly understood across the different groups of students, and is the first study which specifically discusses Cypriot student perceptions on source-code plagiarism. This study provides useful information to educators (teaching home and international students) who wish to better inform their students on the issues of plagiarism and source-code plagiarism. Finally, the survey results revealed that although students who were informed about plagiarism better understood what actions constitute plagiarism, some topics were still unclear among students regardless of the students' educational background and whether they had been previously informed about plagiarism.
To enhance the energy efficiency and performance of algorithms with Graphics Processing Unit (GPU) accelerators in source-code development, we consider the power efficiency based on data transfer bandwidth and power...
详细信息
To enhance the energy efficiency and performance of algorithms with Graphics Processing Unit (GPU) accelerators in source-code development, we consider the power efficiency based on data transfer bandwidth and power consumption in key situations. First, a set of primitives is abstracted from program statements. Then, data transfer bandwidth and power consumption in different granularity sizes are consid- ered and mapped into proper primitives. With these mappings, a programmer can intuitively determine the power efficiency and performance in different running states of a thread. Finally, this intuition enables the programmer to tune the algorithm in order to achieve the best energy efficiency and performance. Using these power-aware principles, two Fast Fourier Transform (FFT) methods are compared. The mapping be- tween power consumption and primitives is helpful for algorithm tuning in source-code levels.
source-code plagiarism detection is an unfortunate but necessary activity when reviewing assignments of programming courses. While being reasonably easy to fool, string-based comparisons offer a high degree of accurac...
详细信息
ISBN:
(纸本)9781467321686;9781467321709
source-code plagiarism detection is an unfortunate but necessary activity when reviewing assignments of programming courses. While being reasonably easy to fool, string-based comparisons offer a high degree of accuracy with almost no false positives and usually a good string similarity metric is the length of their longest common subsequence. In the case of two strings, the dynamic programming algorithm for this calculation unfortunately takes quadratic time even if the strings are equal. In this paper we present an algorithm that, given a batch of source-code files, efficiently finds all pairs of similar files by preprocessing the files and then using a fast branch-and-bound algorithm to find only those pairs whose longest common subsequence is indicative of plagiarism.
This paper proposes a method for detecting plagiarism in source-codes using deep features. The embeddings for programs are obtained using a character-level Recurrent Neural Network (char-RNN), which is pre-trained on ...
详细信息
ISBN:
(纸本)9781538633540
This paper proposes a method for detecting plagiarism in source-codes using deep features. The embeddings for programs are obtained using a character-level Recurrent Neural Network (char-RNN), which is pre-trained on Linux Kernel source-code. Many popular plagiarism detection tools are based on n-gram techniques at syntactic level. However, these approaches to plagiarism detection fail to capture long term dependencies (non-contiguous interaction) present in the source-code. Contrarily, the proposed deep features capture non-contiguous interaction within n-grams. These are generic in nature and there is no need to fine-tune the char-RNN model again to program submissions from each individual problem-set. Our experiments show the effectiveness of deep features in the task of classifying assignment program submissions as copy, partial-copy and non-copy. Comparing our proposed features with handcrafted features (source-code metrics and textual features), we report f1-score improvement of 9.5% for binary classification and 5% for three-way classification tasks respectively.
The Binary-based attestation (BA) mechanism presented by the Trusted Computing Group can equip the application with the capability of genuinely identifying configurations of remote system. However, BA only supports ...
详细信息
The Binary-based attestation (BA) mechanism presented by the Trusted Computing Group can equip the application with the capability of genuinely identifying configurations of remote system. However, BA only supports the attestation for specific patterns of binary codes defined by a trusted party, mostly the software vendor, for a particular version of a software. In this paper, we present a source-code Oriented Attestation (SCOA) framework to enable custom built application to be attested to in the TCG attestation architecture. In SCOA, security attributes are bond with the sourcecodes of an application instead of its binaries codes. With a proof chain generated by a Trusted Building System to record the building procedure, the challengers can determine whether the binary interacted with is genuinely built from a particular set of sourcecodes. Moreover, with the security attribute certificates assigned to the sourcecodes, they can determine the trustworthiness of the binary. In this paper, we present a TBS implementation with virtualization.
暂无评论