Previous studies have shown that there is a non-trivial amount of duplication in sourcecode. This paper analyzes a corpus of 4.5 million non-fork projects hosted on GitHub representing over 428 million files written ...
详细信息
Previous studies have shown that there is a non-trivial amount of duplication in sourcecode. This paper analyzes a corpus of 4.5 million non-fork projects hosted on GitHub representing over 428 million files written in Java, C++, Python, and JavaScript. We found that this corpus has a mere 85 million unique files. In other words, 70% of the code on GitHub consists of clones of previously created files. There is considerable variation between language ecosystems. JavaScript has the highest rate of file duplication, only 6% of the files are distinct. Java, on the other hand, has the least duplication, 60% of files are distinct. Lastly, a project-level analysis shows that between 9% and 31% of the projects contain at least 80% of files that can be found elsewhere. These rates of duplication have implications for systems built on open source software as well as for researchers interested in analyzing large code bases. As a concrete artifact of this study, we have created DejaVu, a publicly available map of code duplicates in GitHub repositories.
Building parsers is an essential task for the development of many tools, from software maintenance tools to any kind of business-specific, programmable environment having a command-line interface. Whilst grammars for ...
详细信息
Building parsers is an essential task for the development of many tools, from software maintenance tools to any kind of business-specific, programmable environment having a command-line interface. Whilst grammars for many programming languages are available, these are, very often, almost useless because of the large diffusion of dialects and variants not contemplated by standard grammars. Writing a grammar by hand is clearly feasible, however it can be a tedious and error-prone task, requiring appropriate skills not always available. Grammar inference is a possible, challenging approach for obtaining suitable grammars from program examples. However, inference from scratch poses serious scalability issues and tends to produce correct, but meaningless grammars, hard to be understood and used to build tools. This paper describes an approach, based on genetic algorithms, for evolving existing grammars towards target (dialect) grammars, inferring changes from examples written using the dialect. Results obtained experimenting the inference of C dialect rules show that the algorithm is able to successfully evolve the grammar. Inspections indicated that the changes automatically made to the grammar during its evolution preserved its meaningfulness, and were comparable to what a developer could have done by hand.
Due to the increasing growth in the variety of Android malware, it is important to distinguish between the unique types of each. In this paper, we introduce the use of a decompiled sourcecode for malicious code class...
详细信息
Due to the increasing growth in the variety of Android malware, it is important to distinguish between the unique types of each. In this paper, we introduce the use of a decompiled sourcecode for malicious code classification. This decompiled sourcecode provides deeper analysis opportunities and understanding of the nature of malware. Malicious code differs from text due to syntax rules of compilers and the effort of attackers to evade potential detection. Hence, we adapt Natural Language Processing-based techniques under some constraints for malicious code classification. First, the proposed methodology decompiles the Android Package Kit files, then API calls, keywords, and non-obfuscated tokens are extracted from the sourcecode and categorized to stop-tokens, feature-tokens, and long-tail-tokens. We also introduce the use of generalized N-tokens to represent tokens that are typically less frequent. Our approach was evaluated, in comparison to the use of API calls and permissions for features, as a baseline, and their combination, as well as in comparison to the use of neural network architectures based on decompiled Android Package Kits. A rigorous evaluation of comprehensive public real-world Android malware datasets, including 24,553 apps that were categorized to 71 families for the malicious families classification, and 60,000 apps for malicious code detection was performed. Our approach outperformed the baselines in both tasks. (C) 2020 Elsevier B.V. All rights reserved.
A static analysis method is one of the popular methods of software codeanalysis. Such method allows checking the code for compliance with the language specification as well as finding potential vulnerabilities. In th...
详细信息
A static analysis method is one of the popular methods of software codeanalysis. Such method allows checking the code for compliance with the language specification as well as finding potential vulnerabilities. In this work, a static analysis of a corpus of listings of open-source Python applications is performed. Using the Bandit library, statistical values of various categories of potential vulnerabilities are found, and a rating table of vulnerabilities detected in the dataset involved is constructed. A qualitative analysis of threats is performed according to their severity based on the CWE data.
With the growing awareness of the importance of software maintenance has come a re-evaluation of software maintenance tools, Such tools range from sourcecode analysers to semi-intelligent tools which seek to reconstr...
详细信息
With the growing awareness of the importance of software maintenance has come a re-evaluation of software maintenance tools, Such tools range from sourcecode analysers to semi-intelligent tools which seek to reconstruct system designs and specification documents from sourcecode. However, it is clear that relying solely upon sourcecode as a basis for reverse-engineering has many problems. These problems include poor abstraction, which leads to over-detailed specification models and the inability to link other parts of a software system, such as documentation and user expertise, to the underlying code. This paper describes the work of the Esprit DOCKET project which has developed a prototype environment to support the development of a system model linking user-oriented, business aspects of a system, to operational code using a variety of knowledge source inputs: code, documents and user expertise. The aim is to provide a coherent model to form the basis for system understanding and to support the software change and evolution process.
Querying sourcecode is an essential aspect of a variety of software engineering tasks such as program understanding, reverse engineering, program structure analysis, and program flow analysis. In this paper, we prese...
详细信息
Querying sourcecode is an essential aspect of a variety of software engineering tasks such as program understanding, reverse engineering, program structure analysis, and program flow analysis. In this paper, we present and demonstrate the use of an algebraic sourcecode query technique that blends expressive power with query compactness. The query framework of sourcecode Algebra, or SCA, permits users to express complex sourcecode queries and views as algebraic expressions. Queries are expressed on an extensible, object-oriented database that stores program sourcecode. The SCA algebraic approach offers multiple benefits such as an applicative query language, high expressive power, seamless handling of structural and flow information, clean formalism, and potential for query optimization. We present a case study where SCA expressions are used to query a program in terms of program organization, resource flow, control flow, metrics, and syntactic structure. Our experience with an SCA-based prototype query processor indicates that an algebraic approach to sourcecode queries combines the benefits of expressive power and compact query formulation.
Large software systems need to be modified to remain useful. Changes can be more easily performed when their design has been carefully documented. This paper presents an approach to quickly find design patterns that h...
详细信息
Large software systems need to be modified to remain useful. Changes can be more easily performed when their design has been carefully documented. This paper presents an approach to quickly find design patterns that have been implemented into a software system. The devised solution greatly reduces the performed checks by organising the search for a design pattern as tree traversals, where candidate classes are carefully positioned into trees. By automatically tagging classes with design pattern roles we make it easier for developers to reason with large software systems. Our approach can provide documentation that lets developers understand the role each class is playing, assess the quality of the code, have assistance for refactoring and enhancing the functionalities of the software system.
Identifying code duplication in large multi-platform software systems is a challenging problem. This is due to a variety of reasons including the presence of high-level programming languages and structures interleaved...
详细信息
Identifying code duplication in large multi-platform software systems is a challenging problem. This is due to a variety of reasons including the presence of high-level programming languages and structures interleaved with hardware-dependent low-level resources and assembler code, the use of GUI-based configuration scripts generating commands to compile the system, and the extremely high number of possible different configurations. This paper studies the extent and the evolution of code duplications in the Linux kernel. Linux is a large, multi-platform software system;it is based on the Open source concept, and so there are no obstacles in discussing its implementation. In addition, it is decidedly too large to be examined manually: the current Linux kernel release (2.4.18) is about three million LOCs. Nineteen releases, from 2.4.0 to 2.4.18, were processed and analyzed, identifying code duplication among Linux subsystems by means of a metric-based approach. The obtained results support the hypothesis that the Linux system does not contain a relevant fraction of code duplication. Furthermore, code duplication tends to remain stable across releases, thus suggesting a fairly stable structure, evolving smoothly without any evidence of degradation. (C) 2002 Elsevier Science B.V. All rights reserved.
One approach to measuring and managing the complexity of software, as it evolves over time, is to exploit software metrics. Metrics have been used to estimate the complexity of the maintenance effort, to facilitate ch...
详细信息
One approach to measuring and managing the complexity of software, as it evolves over time, is to exploit software metrics. Metrics have been used to estimate the complexity of the maintenance effort, to facilitate change impact analysis, and as an indicator for automatic detection of a transformation that can improve the quality of a system. However, there has been little effort directed at applying software metrics to the maintenance of grammar-based software applications, such as compilers, editors, program comprehension tools and embedded systems. In this paper, we adapt the software metrics that are commonly used to measure program complexity and apply them to the measurement of the complexity of grammar-based software applications. Since the behaviour of a grammar-based application is typically choreographed by the grammar rules, the measure of complexity that our metrics provide can guide maintainers in locating problematic areas in grammar-based applications. Copyright (C) 2004 John Wiley Sons, Ltd.
暂无评论