检索结果-内蒙古大学图书馆

A Comprehensive Evaluation of Parameter-Efficient Fine-Tuning on Method-Level Code Smell Detection

学校读者我要写书评

暂无评论

arXiv 2024年

作者： Zhang, Beiqi Liang, Peng Zhou, Xin Zhou, Xiyu Lo, David Feng, Qiong Li, Zengyang Li, Lin School of Computer Science Wuhan University Wuhan China School of Computing and Information Systems Singapore Management University Singapore School of Computer Science Nanjing University of Science and Technology Nanjing China School of Computer Science Central China Normal University Wuhan China School of Computer Science and Artificial Intelligence Wuhan University of Technology Wuhan China

Code smells, which are suboptimal coding practices that can potentially lead to defects or maintenance issues, can negatively impact the quality of software systems. Most existing code smell detection methods rely on heuristics-based or machine learning (ML) and deep learning (DL)-based techniques. However, these techniques have several drawbacks (e.g., unsatisfactory performance). Large language Models (LLMs) have garnered significant attention in the software engineering (SE) field, achieving state-of-the-art performance across a wide range of SE tasks. Parameter-Efficient Fine-Tuning (PEFT) methods, which are commonly used to adapt LLMs to specific tasks with fewer parameters and reduced computational resources, have emerged as a promising approach for enhancing the performance of LLMs in various SE tasks. However, LLMs have not yet been explored for code smell detection, and their effectiveness for this task remains unclear. Furthermore, no comprehensive investigation has been conducted on the efficiency of PEFT methods for method-level code smell detection. In this regard, we systematically evaluate the effectiveness of state-of-the-art PEFT methods on both small and large language Models (LMs) for method-level code smell detection. To begin, we constructed high-quality java code smell datasets sourced from GitHub. We then fine-tuned four small LMs and six LLMs using various PEFT techniques, including prompt tuning, prefix tuning, LoRA, and (IA)3, for code smell detection. Our comparison against full fine-tuning revealed that PEFT methods not only achieve comparable or better effectiveness but also consume less peak GPU memory. Our analysis further explored the performance of small LMs versus LLMs in the context of code smell detection. Surprisingly, we found that LLMs did not outperform small LMs in this specific task, suggesting that smaller models may be more suited for method-level code smell detection. We also investigated the impact of varying hyper-param

关键词： java programming language

TestBench: Evaluating Class-Level Test Case Generation Capability of Large language Models

学校读者我要写书评

暂无评论

arXiv 2024年

作者： Zhang, Quanjun Shang, Ye Fang, Chunrong Gu, Siqi Zhou, Jianyi Chen, Zhenyu The State Key Laboratory for Novel Software Technology Nanjing University China Huawei Cloud Computing Technologies Co. Ltd. China

In this paper, we introduce TestBench, a benchmark for class-level LLM-based test case generation. We construct a dataset of 108 java programs from 9 real-world, large-scale projects on GitHub, each representing a different thematic domain. We then design three distinct types of prompts based on context descriptions, including self-contained context, full context, and simple context. Besides, we propose a fine-grained evaluation framework that considers five aspects of test cases: syntactic correctness, compilation correctness, test correctness, code coverage rate, and defect detection rate. Furthermore, we propose a heuristic algorithm to repair erroneous test cases generated by LLMs. We evaluate CodeLlama-13b, GPT-3.5, and GPT-4 on the TestBench, and our experimental results indicate that larger models demonstrate a greater ability to effectively utilize contextual information, leading to generate higher-quality test cases. Smaller models may struggle with the noise introduced by the extensive information contained within the full context. However, when using the simplified version, namely the simple context, which is derived from the full context via abstract syntax tree analysis, the performance of these models improves significantly. Our analysis highlights the current progress and pinpoints future directions to further enhance the effectiveness of models by handling contextual information for test case generation. © 2024, CC BY.

关键词： java programming language

LLM-Assisted Static Analysis for Detecting Security Vulnerabilities

学校读者我要写书评

暂无评论

arXiv 2024年

作者： Li, Ziyang Dutta, Saikat Naik, Mayur University of Pennsylvania United States Cornell University United States

Software is prone to security vulnerabilities. Program analysis tools to detect them have limited effectiveness in practice due to their reliance on human labeled specifications. Large language models (or LLMs) have shown impressive code generation capabilities but they cannot do complex reasoning over code to detect such vulnerabilities especially since this task requires whole-repository analysis. We propose IRIS, a neuro-symbolic approach that systematically combines LLMs with static analysis to perform whole-repository reasoning for security vulnerability detection. Specifically, IRIS leverages LLMs to infer taint specifications and perform contextual analysis, alleviating needs for human specifications and inspection. For evaluation, we curate a new dataset, CWE-Bench-java, comprising 120 manually validated security vulnerabilities in real-world java projects. A state-of-the-art static analysis tool CodeQL detects only 27 of these vulnerabilities whereas IRIS with GPT-4 detects 55 (+28) and improves upon CodeQL’s average false discovery rate by 5% points. Furthermore, IRIS identifies 6 previously unknown vulnerabilities which cannot be found by existing tools. © 2024, CC BY-NC-ND.

关键词： java programming language

Assessment of Spatial Geometry of the java Back-Arc Thrust in West java, Indonesia Based on Geophysical Data

学校读者我要写书评

暂无评论

SSRN

SSRN 2024年

作者： Amukti, Rian Bantan, Rashad A. Aboulela, Hamdy Handayani, Lina Aribowo, Sonny King Abdulaziz University Marine Geology Department Jeddah Saudi Arabia Bandung Indonesia

The West java Back-arc (WJBT) in the northern part of the java area was subjected to numerous magnitudes of seismic events in the java Back-arc Thrust system, which are genetically related to the well-known subduction of the Australian Plate beneath java Island, Indonesia. The study aims to identify and model the subsurface structure of the North West java area where the active fault is found. This work used accessible data, which consists of gravity, seismic reflection, and well data, to provide a wider vision lateral view of the geometry model of the subsurface fault and to delineate the existence of many others in the area under investigation. The gravity data covered the area from north of Bandung to Pamanukan, and seismic and well data was available for the Karawang to Indramayu area. The Bouguer gravity anomaly was used to obtain basin modeling and to analyze density changes for fault identification. The results show interplay with the Bogor Basin Uplifting with the thrusting system in the java back-arc region. Since seismic activities will play a vital role in the future urban and development strategy in the study area, the fault structure's characteristics should be taken into consideration to mitigate their effects. © 2024, The Authors. All rights reserved.

关键词： java programming language

javaVFC: java Vulnerability Fixing Commits from Open-source Software

学校读者我要写书评

暂无评论

arXiv 2024年

作者： Bui, Tan Tun, Yan Naing Cheng, Yiran Irsan, Ivana Clairine Zhang, Ting Kang, Hong Jin School of Computing and Information Systems Singapore Management University Singapore Institute of Information Engineering Chinese Academy of Sciences Beijing China

We present a comprehensive dataset of java vulnerability-fixing commits (VFCs) to advance research in java vulnerability analysis. Our dataset, derived from thousands of open-source java projects on GitHub, comprises two variants: javaVFC and javaVFC-EXTENDED. The dataset was constructed through a rigorous process involving heuristic rules and multiple rounds of manual labeling. We initially used keywords to filter candidate VFCs based on commit messages, then refined this keyword set through iterative manual labeling. The final labeling round achieved a precision score of 0.7 among three annotators. We applied the refined keyword set to 34,321 open-source java repositories with over 50 GitHub stars, resulting in javaVFC with 784 manually verified VFCs and javaVFC-EXTENDED with 16,837 automatically identified VFCs. Both variants are presented in a standardized JSONL format for easy access and analysis. This dataset supports various research endeavors, including VFC identification, fine-grained vulnerability detection, and automated vulnerability repair. The javaVFC and javaVFC-EXTENDED are publicly available at https://***/records/13731781. Copyright © 2024, The Authors. All rights reserved.

关键词： java programming language

Arrays in Practice An Empirical Study of Array Access Patterns on the JVM

学校读者我要写书评

暂无评论

arXiv 2024年

作者： Åkerblom, Beatrice Castegren, Elias Department of Computer and System Sciences Stockholm University Sweden Department of Information Technology Uppsala University Sweden

The array is a data structure used in a wide range of programs. Its compact storage and constant time random access makes it highly efficient, but arbitrary indexing complicates the analysis of code containing array accesses. Such analyses are important for compiler optimisations such as bounds check elimination. The aim of this work is to gain a better understanding of how arrays are used in real-world programs. While previous work has applied static analyses to understand how arrays are accessed and used, we take a dynamic approach. We empirically examine various characteristics of array usage by instrumenting programs to log all array accesses, allowing for analysis of array sizes, element types, from where arrays are accessed and to which extent sequences of array accesses form recognizable patterns. The programs in the study were collected from the Renaissance benchmark suite, all running on the java Virtual Machine. We account for characteristics displayed by the arrays investigated, finding that most arrays have a small size, are accessed by only one or two classes and by a single thread. On average over the benchmarks, 69.8% of the access patterns consist of uncomplicated traversals. Most of the instrumented classes (over 95%) do not use arrays directly at all. These results come from tracing data covering 3,803,043,390 array accesses made across 168,686 classes. While our analysis has only been applied to the Renaissance benchmark suite, the methodology can be applied to any program running on the java Virtual Machine. This study, and the methodology in general, can inform future runtime implementations and compiler optimisations. Copyright © 2024, The Authors. All rights reserved.

关键词： java programming language

The Impact of Mutability on Cyclomatic Complexity in java

学校读者我要写书评

暂无评论

arXiv 2024年

作者： Bagaev, Marat Khabibrakhmanova, Alisa Sabaev, Georgy Bugayenko, Yegor HSE Moscow Russia Huawei Moscow Russia

In java, some object attributes are mutable, while others are immutable (with the "final" modifier attached to them). Objects that have at least one mutable attribute may be referred to as "mutable" objects. We suspect that mutable objects have higher McCabe’s Cyclomatic Complexity (CC) than immutable ones. To validate this intuition, we analysed 862,446 java files from 1,000 open-GitHub repositories. Our results demonstrated that immutable objects are almost three times less complex than mutable ones. It can be therefore assumed that using more immutable classes could reduce the overall complexity and maintainability of the code base. © 2024, CC BY.

关键词： java programming language

Quantifying the benefits of code hints for refactoring deprecated java APIs

学校读者我要写书评

暂无评论

arXiv 2024年

作者： David, Cristina Kesseli, Pascal Kroening, Daniel Zhang, Hanliang Bristol United Kingdom Zurich Switzerland Seattle WA United States

When done manually by engineers at Amazon and other companies, refactoring legacy code in order to eliminate uses of deprecated APIs is an error-prone and time-consuming process. In this paper, we investigate to which degree refactorings for deprecated java APIs can be automated, and quantify the benefit of javadoc code hints for this task. To this end, we build a symbolic and a neural engine for the automatic refactoring of deprecated APIs. The former is based on type-directed and component-based program synthesis, whereas the latter uses LLMs. We applied our engines to refactor the deprecated methods in the Oracle JDK 15. Our experiments show that code hints are enabling for the automation of this task: even the worst engine correctly refactors 71% of the tasks with code hints, which drops to at best 14% on tasks without. Adding more code hints to javadoc can hence boost the refactoring of code that uses deprecated APIs. © 2024, CC BY.

关键词： java programming language

tabulapdf: An R Package to Extract Tables from PDF Documents

学校读者我要写书评

暂无评论

arXiv 2024年

作者： Sepúlveda, Mauricio Vargas Leeper, Thomas J. Paskhalis, Tom Aristarán, Manuel Merrill, Jeremy B. Tigas, Mike Department of Political Science University of Toronto Munk School of Global Affairs and Public Policy University of Toronto Canada Department of Political Science Trinity College Dublin Ireland The Washington Post United States

tabulapdf is an R package that utilizes the Tabula java library to import tables from PDF files directly into R. This tool can reduce time and effort in data extraction processes in fields like investigative journalism. It allows for automatic and manual table extraction, the latter facilitated through a Shiny interface, enabling manual areas selection with a computer mouse for data retrieval. © 2024, CC BY.

关键词： java programming language