Code comment plays an important role in program understanding, and a large number of automatic comment generation methods have been proposed in recent years. To get a better effect of generating comments, many studies...
详细信息
Code comment plays an important role in program understanding, and a large number of automatic comment generation methods have been proposed in recent years. To get a better effect of generating comments, many studies try to extract a variety of information (e.g., code tokens, AST traverse sequence, APIs call sequence) from source code as model input. In this study, we found that the bytecode compiled from the source code can provide useful information for comment generation, hence we propose to use the information from bytecode to assist the comment generation. Specifically, we extract the control flow graph (CFG) from the bytecode and propose a serialization method to obtain the CFG sequence that preserves the program structure. Then, we discuss three methods for introducing bytecode information for different models. We collected 390,000 Java methods from the maven repository, and created a dataset of 101,124 samples after deduplication and preprocessing to evaluate our method. The results show that introducing the information extracted from the bytecode can improve the BLEU-4 of 7 comment generation models.
Context: Code understandability plays a crucial role in software development, as developers spend between 58% and 70% of their time reading source code. Improving code understandability can lead to enhanced productivi...
详细信息
Context: Code understandability plays a crucial role in software development, as developers spend between 58% and 70% of their time reading source code. Improving code understandability can lead to enhanced productivity and save maintenance costs. Problem: Experimental studies aim to establish what makes code more or less understandable in a controlled setting, but ignore that what makes code easier to understand in the real world also depends on extraneous elements such as developers' background and project culture and guidelines. Not accounting for the influence of these factors may lead to results that are sound but have little external validity. Goal: We aim to investigate how developers improve code understandability during software development through code review comments. Our assumption is that code reviewers are specialists in code quality within a project. Method and Results: We manually analyzed 2,401 code review comments from Java open-source projects on GitHub and found that over 42% of all comments focus on improving code understandability, demonstrating the significance of this quality attribute in code reviews. We further explored a subset of 385 comments related to code understandability and identified eight categories of code understandability concerns, such as incomplete or inadequate code documentation, bad identifier, and unnecessary code. Among the suggestions to improve code understandability, 83.9% were accepted and integrated into the codebase. Among these, only two (less than 1%) ended up being reverted later. We also identified types of patches that improve code understandability, ranging from simple changes (e.g., removing unused code) to more context-dependent improvements (e.g., replacing method calling chains by existing API). Finally, we investigated the potential coverage of four well-known linters to flag the identified code understandability issues. These linters cover less than 30% of these issues, although some of them could be ea
Static Analysis (SA) and Dynamic Analysis (DA) are complementary techniques for searching web application vulnerabilities. Typically, SA detects more vulnerabilities but reports a higher number of false positives, whe...
详细信息
Static Analysis (SA) and Dynamic Analysis (DA) are complementary techniques for searching web application vulnerabilities. Typically, SA detects more vulnerabilities but reports a higher number of false positives, whereas DA finds less but with better precision. In this paper, we blend SA and DA to simultaneously improve the detection and decrease the false alarms. Our approach starts with SA to identify an initial set of potential vulnerabilities. Then, the target application is executed to obtain specific runtime information. These data are used to automatically configure the DA, improving its ability to confirm if the vulnerabilities reported by the SA are indeed exploitable. We evaluated the proposed approach using 49 WordPress plugins with more than 450 SQLi vulnerabilities. Our approach was able to confirm either as a vulnerability or a false alarm 76.7% of the results reported by the SA, decreasing tremendously the usual need for manual work, which is a huge improvement for security practitioners.
Combining caching with source coding, a hybrid content delivery system further facilitates the shift towards Information-Centric Networks. This is a promising technology heralded as the next phase in network design. H...
详细信息
Combining caching with source coding, a hybrid content delivery system further facilitates the shift towards Information-Centric Networks. This is a promising technology heralded as the next phase in network design. However, finding the optimal balance between the source coding gains and the computational complexity is itself an NP-hard problem. By modelling the problem using a path-based approach, this paper outlines iterative algorithms that can be tuned to provide control over this trade-off. So too, a necessary condition of optimality is derived. This condition can be applied repeatedly to improve the performance of the results from the iterative algorithms. The Ant Colony Optimisation family of meta-heuristic algorithms is adapted to solve this problem, providing a benchmark that outperforms the Genetic Algorithm presented in prior work. The iterative algorithms have a larger time complexity than other solutions, but still converge in polynomial time. When combined with the optimality condition, they outperform all of the currently proposed algorithms that solve this problem to date. More specifically, this approach produces results that are found to fall in the 99.97th percentile on average.
Software system documentation is almost always expressed informally in natural language and free text. Examples include requirement specifications, design documents, manual pages, system development journals, error lo...
详细信息
Software system documentation is almost always expressed informally in natural language and free text. Examples include requirement specifications, design documents, manual pages, system development journals, error logs, and related maintenance reports. In our 2002 seminal paper we proposed a method based on information retrieval to recover traceability links between source code and free text documents. A premise of our work was that programmers use meaningful names for program items, such as functions, variables, types, classes, and methods. The paper paved the way to the adoption of IR in software engineering opening a new perspective. Reflecting on the past twenty years we briefly overview the many results that have been achieved, however, the emergence of new technologies, such as AI, pose unprecedented challenges.
As technology advances and new features emerge, the demand for Android applications continues to grow, leading to rapid release schedules. These accelerated development timelines often push developers to make rushed c...
详细信息
As technology advances and new features emerge, the demand for Android applications continues to grow, leading to rapid release schedules. These accelerated development timelines often push developers to make rushed changes, often resulting in suboptimal design practices, commonly known as code smells. These issues can degrade application quality, drive up maintenance costs, lead to unexpected behaviors, and complicate evolution and re-engineering efforts. While substantial research has focused on identifying Android-specific and object-oriented code smells, comparatively less attention has been devoted to their systematic refactoring and evaluation. This study introduces a web-based technique, validated through a tool specifically developed to detect 20 Android-specific code smells and automatically refactor 10 of them. Our approach surpasses traditional desktop and plugin solutions by providing easy accessibility, cross-platform compatibility, and eliminating setup requirements. When applied to six open-source and two industrial Android applications and evaluated against the ISO/IEC 25010 quality standard, our tool demonstrated considerable improvements: reducing CPU utilization by 15.39%, lowering memory consumption by 12.85%, and enhancing battery efficiency by up to 5.78%. The tool's accuracy, validated through precision, recall, and F-measure metrics, achieved averages of 91.81% precision, 97.77% recall, and a 94.67% F-measure. This study enhances the Android application development lifecycle by offering developers a feasible solution for optimizing CPU efficiency, reducing memory use, and minimizing battery consumption.
Software reverse engineering plays a crucial role in identifying design patterns and reconstructing software architectures by analyzing system implementations and producing abstract representations across multiple lay...
详细信息
Software reverse engineering plays a crucial role in identifying design patterns and reconstructing software architectures by analyzing system implementations and producing abstract representations across multiple layers. This research introduces a novel feature engineering approach that integrates both behavioral and structural analysis of code, resulting in a feature-rich sequential representation. This transformation enables the effective use of transformers and attention mechanisms to detect design patterns in source code. Our results emphasize the importance of context in distinguishing between various design patterns, demonstrating that the proposed sequence format, with its sensitivity to token order, significantly improves the model's capacity to differentiate between similar patterns. By leveraging the power of attention mechanisms, our approach efficiently discards irrelevant code elements, focusing on the most critical features for accurate patterns detection. Additionally, we show that this sequential code representation can be utilized to augment training data, leading to enhanced model accuracy. Trained on a diverse set of code samples representing all 23 GoF design patterns, sourced from repositories such as GitHub and Bitbucket, our methodology achieved an accuracy of 92%. Evaluation metrics further validate the robustness of the approach. This study underscores the potential of context-driven, feature-engineered representations in advancing design patterns detection and contributes a comprehensive new dataset that supports behavioral code analysis, setting the stage for future research in this area.
Malware detection is a critical issue in software engineering as it directly threatens user information security. Existing approaches often focus on individual modality (either source code or binary code) for the dete...
详细信息
Malware detection is a critical issue in software engineering as it directly threatens user information security. Existing approaches often focus on individual modality (either source code or binary code) for the detection, but it ignores to effectively exploit the complementary information between them. This limits the detection performance, especially in complex and evasive malware scenarios. In this paper, we take Android applications written in Java as objects, and provide a novel fine-grained multimodal fusion method with large pre-trained models to combine the features from source and binary codes for the malware detection. For the source code modality, we employ the graphical user interface (GUI) as a framework to segment the source code into snippets, and use a pre-trained programming language model to extract feature representations. For the binary code modality, we convert binary code into grayscale images and fine-tune a pre-trained vision model to extract features indirectly. We then implement cross-modal attention and devise a contrastive loss to align features across modalities, supplementing this with supervised classification loss to refine the multimodal fusion process specifically for malware detection. Our experiments, conducted using the Data-MD and Data-MC benchmarks, demonstrate that our approach achieves a precision of 0.977 and a recall of 0.984 in detecting malware. This underscores the advantages of using large pre-trained models for feature representation and the fusion of information across different modalities for effective malware detection.
Change classification, today known as Just-in-Time Defect Prediction, is a technique for predicting software bugs at the change level of granularity. Several ideas came together to form change classification: predicti...
详细信息
Change classification, today known as Just-in-Time Defect Prediction, is a technique for predicting software bugs at the change level of granularity. Several ideas came together to form change classification: predictions on code changes, using word-level textual features, use of machine learning classifiers, and leveraging open source code repositories. While change classification has led to a robust line of research, it has not yet had significant industrial adoption. A key recommendation is to explore explainability features so developers can better understand why a prediction is being made. We explore how large language models can advance this work by providing prediction explanations and bug fix suggestions.
We establish a coding theorem and a matching converse theorem for separate encodings and joint decoding of individual sequences using finite-state machines. The achievable rate region is characterized in terms of the ...
详细信息
We establish a coding theorem and a matching converse theorem for separate encodings and joint decoding of individual sequences using finite-state machines. The achievable rate region is characterized in terms of the Lempel-Ziv (LZ) complexities, the conditional LZ complexities and the joint LZ complexity of the two source sequences. An important feature that is needed to this end, which may be interesting on its own right, is a certain asymptotic form of a chain rule for LZ complexities, which we establish in this work. The main emphasis in the achievability scheme is on the universal decoder and its properties. We then show that the achievable rate region is universally attainable by a modified version of Draper's universal incremental Slepian-Wolf (SW) coding scheme, provided that there exists a low-rate reliable feedback link.
暂无评论