Storing ultra-large amounts of unstructured data (often called objects or blobs) is a fundamental task for several object-based storage engines, data warehouses, data-lake systems, and key-value stores. These systems ...
详细信息
Storing ultra-large amounts of unstructured data (often called objects or blobs) is a fundamental task for several object-based storage engines, data warehouses, data-lake systems, and key-value stores. These systems cannot currently leverage similarities between objects, which could be vital in improving their space and time performance. An important use case in which we can expect the objects to be highly similar is the storage of large-scale versioned source code datasets, such as the Software Heritage Archive (Di Cosmo and Zacchiroli, 2017). This use case is particularly interesting given the extraordinary size (1.5 PiB), the variegated nature, and the high repetitiveness of the at-issue corpus. In this paper we discuss and experiment with content-and context-based compression techniques for source-code collections that tailor known and novel tools to this setting in combination with state-of-the-art general-purpose compressors and the information coming from the Software Heritage Graph. We experiment with our compressors over a random sample of the entire corpus, and four large samples of source code files written in different popular languages: C/C++, Java, JavaScript, and Python. We also consider two scenarios of usage for our compressors, called Backup and File-Access scenario, where the latter adds to the former the support for single file retrieval. As a net result, our experiments show (i) how much "compressible" each language is, (ii) which content-or context-based techniques compress better and are faster to (de)compress by possibly supporting individual file access, and (iii) the ultimate compressed size that, according to our estimate, our best solution could achieve in storing all the source code written in these languages and available in the Software Heritage Archive: namely, in 3 TiB (down from their original 78 TiB total size, with an average compression ratio of 4%).
In higher education, especially in programming-intensive fields like computer science, safeguarding students' source code is crucial to prevent theft that could impact learning and future careers. Traditional stor...
详细信息
In higher education, especially in programming-intensive fields like computer science, safeguarding students' source code is crucial to prevent theft that could impact learning and future careers. Traditional storage solutions like Google Drive are vulnerable to hacking and alterations, highlighting the need for stronger protection. This work explores digital technologies that enhance source code security, with a focus on Blockchain and NFTs. Due to Blockchain's decentralized and immutable nature, NFTs can be used to control code ownership, improving security, traceability, and preventing unauthorized access. This approach effectively addresses existing gaps in protecting academic intellectual property. However, as Bennett et al. highlight, while these technologies have significant potential, challenges remain in large-scale implementation and user acceptance. Despite these hurdles, integrating Blockchain and NFTs presents a promising opportunity to enhance academic integrity. Successful adoption in educational settings may require a more inclusive and innovative strategy.
This paper considers the problem of source code plagiarism by students within the computing disciplines and reports the results of a survey of students in Computing departments in 18 institutions in the U. K. This sur...
详细信息
This paper considers the problem of source code plagiarism by students within the computing disciplines and reports the results of a survey of students in Computing departments in 18 institutions in the U. K. This survey was designed to investigate how well students understand the concept of source code plagiarism and to discover what, if any, specific aspects might cause particular confusion. An analysis of the results was carried out to assess understanding by topic and to discover whether various demographic factors may have an influence on that understanding. Within the survey sample, it appeared that the demographic factors tested did not generally affect students' understanding of source code plagiarism. However, analysis of the data for specific topics revealed that there are several areas of activity where the boundary between acceptable and unacceptable behavior is not clearly understood. These findings have implications for plagiarism education programs.
Efficient detection of plagiarism in programming assignments of students is of a great importance to the educational procedure. This paper presents a clustering oriented approach for facing the problem of source code ...
详细信息
Efficient detection of plagiarism in programming assignments of students is of a great importance to the educational procedure. This paper presents a clustering oriented approach for facing the problem of source code plagiarism. The implemented software, called PDetect, accepts as input a set of program sources and extracts subsets (the clusters of plagiarism) such that each program within a particular subset has been derived from the same original. PDetect proposes the use of an appropriate measure for evaluating plagiarism detection performance and supports the idea of combining different plagiarism detection schemes. Furthermore, a cluster analysis is performed in order to provide information beneficial to the plagiarism detection process. PDetect is designed such that it may be easily adapted over any keyword-based programming language and it is quite beneficial when compared with earlier (state-of-the-art) plagiarism detection approaches.
At the present time the plagiarism is a growing problem due to a lot of easily accessible resources, and many papers deal with this topic. New algorithms are constantly being created, but there are not currently manny...
详细信息
ISBN:
(纸本)9783319955223;9783319955216
At the present time the plagiarism is a growing problem due to a lot of easily accessible resources, and many papers deal with this topic. New algorithms are constantly being created, but there are not currently manny of systems, that we could use for plagiarism detection. Our aim is to explore plagiarism on a large scale. This paper focuses on selecting the appropriate representation of the source code, that is very important when searching for plagiarism. There is an overview of the current representation possibilities. We focus on representation source code using AST. Comparison of the tree structures is time-consuming operation. We will try to find how effectively represent AST in order to facilitate comparison. There are two ways to represent AST. Representation by hashing or using characteristic vectors. We present the experiment and results on which we choose the appropriate form of the representation.
This article presents a proposal for the detection of programming source code similitude in academic environments. The objective of this proposal is to provide support to professors in detecting plagiarism in student ...
详细信息
This article presents a proposal for the detection of programming source code similitude in academic environments. The objective of this proposal is to provide support to professors in detecting plagiarism in student homework assignments in introductory computer programming courses. The developed tool, codeSIGHT, is based on a modification of the Greedy String Tiling algorithm. The tool was tested in one theoretical and three real scenarios, obtaining similitude detections for assignments ranging from those that contained code without modifications to assignments containing insertions of procedural instructions inside the main code. The results verified the efficiency of the tool at the first five levels of the plagiarism spectrum for programming code, in addition to supporting suspicions of plagiarism in real scenarios. (c) 2013 Wiley Periodicals, Inc. Comput Appl Eng Educ 23:13-22, 2015;View this article online at;DOI
The detection of similarities in source code has applications not only in software re-engineering (to eliminate redundancies) but also in software plagiarism detection. This later can be a challenging problem since mo...
详细信息
The detection of similarities in source code has applications not only in software re-engineering (to eliminate redundancies) but also in software plagiarism detection. This later can be a challenging problem since more or less extensive edits may have been performed on the original copy: insertion or removal of useless chunks of code, rewriting of expressions, transposition of code, inlining and outlining of functions, etc. In this paper, we propose a new similarity detection technique not only based on token sequence matching but also on the factorization of the function call graphs. The factorization process merges shared chunks (factors) of codes to cope, in particular, with inlining and outlining. The resulting call graph offers a view of the similarities with their nesting relations. It is useful to infer metrics quantifying similarity at a function level. (C) 2012 Elsevier B.V. All rights reserved.
source code authorship attribution is the task of identifying who develops the code based on learning based on the programmer style. It is one of the critical activities which used extensively in different aspects suc...
详细信息
source code authorship attribution is the task of identifying who develops the code based on learning based on the programmer style. It is one of the critical activities which used extensively in different aspects such as computer security, computer law, and plagiarism. This paper attempts to investigate source code authorship attribution by capturing natural language aspects of the code rather than only using minimal set of syntactic and stylistic code features as explored in the previous literature. It proposes an evolutionary feature selection model to improve the accuracy of authorship attribution by implementing two language models (uni-gram and bi-gram). The proposed approach uses K-Nearest Neighbor as a classifier and Genetic Algorithm as a feature selection technique. Two experiments have been demonstrated on a public Authorship Attribution dataset on GitHub, the experiments include various evolutionary feature selection models. Notably, the obtained results in both experiments were compared with the related studies, and show a significant improvement in terms of accuracy.
The use of source code Author Profiles (SCAP) represents a new, highly accurate approach to source code authorship identification that is, unlike previous methods, language independent. While accuracy is clearly a cru...
详细信息
The use of source code Author Profiles (SCAP) represents a new, highly accurate approach to source code authorship identification that is, unlike previous methods, language independent. While accuracy is clearly a crucial requirement of any author identification method, in cases of litigation regarding authorship, plagiarism, and so on, there is also a need to know why it is claimed that a piece of code is written by a particular author. What is it about that piece of code that suggests a particular author? What features in the code make one author more likely than another? In this study, we describe a means of identifying the high-level features that contribute to source code authorship identification using as a tool the SCAP method. A variety of features are considered for Java and Common Lisp and the importance of each feature in determining authorship is measured through a sequence of experiments in which we remove one feature at a time. The results show that, for these programs, comments, layout features and package-related naming influence classification accuracy whereas user-defined naming, an obvious programmer related feature, does not appear to influence accuracy. A comparison is also made between the relative feature contributions in programs written in the two languages. (C) 2007 Elsevier Inc. All rights reserved.
Statement frequency data can inform programming language research and provide a solid basis for frequency-based code analysis. This paper presents an analysis of programming language statement frequency in a large cor...
详细信息
Statement frequency data can inform programming language research and provide a solid basis for frequency-based code analysis. This paper presents an analysis of programming language statement frequency in a large corpus of C, C++, and Java source code, comprised of more than 54 million lines of code. Across these languages, the top four work-performing statement types are Method/Function Call, Assignment, If, and Return. As compared to studies of Formula Translating System, Common Business Oriented Language and Programming Language One in the 1970s, the main change is the prevalence of method/function calls. Statement use frequency across languages is remarkably similar, and within each individual language, most statement types have a frequency distribution that occupies a small range. A more detailed examination of assignment and looping statement types shows that many assignments simply involve copying of data and that C++/Java use for statements more than C. Copyright (C) 2014 John Wiley & Sons, Ltd.
暂无评论