检索结果-内蒙古大学图书馆

MRC-VulLoc: Software source code vulnerability localization based on multi-choice reading comprehension

COMPUTERS & SECURITY 2024年 141卷

作者： Tang, Gaigai Yang, Lin Zhang, Long Kuang, Hongyu Wang, Huiqiang Henan Univ Sch Software Kaifeng 475004 Peoples R China Harbin Engn Univ Sch Comp Sci & Technol Harbin 150000 Peoples R China Acad Mil Sci Inst Syst Engn Natl Key Lab Sci & Technol Informat Syst Secur Beijing 100000 Peoples R China

Recently, automatic vulnerability detection approaches based on machine learning (ML) have outperformed traditional rule -based approaches in terms of detection performance. Existing ML -based approaches typically concentrate on function or line granularity, which fail to realize accurate vulnerability localization and are insufficient to support effective root cause analysis of vulnerability. To address this issue, we propose a new approach that maps the multi -choice reading comprehension (MRC) task to the vulnerability localization task at the granularity of vulnerability triggering path named MRC-VulLoc. Initially, we design six large datasets (including C/C++ and Java languages) in the form of MRC. Subsequently, we introduce a novel pre -trained vulnerability localization model, combining the effective code semantic comprehension ability of pre -trained model with the advantages of Bidirectional Short -Term Memory Network (Bi-LSTM) and Convolutional Neural Network (CNN) models. Lastly, we conduct experiments to evaluate the vulnerability localization with several state-of-the-art MRC approaches and vulnerability detectors. Experimental results demonstrate the effectiveness of the proposed datasets in evaluating MRC approaches for vulnerability localization. Furthermore, MRC-VulLoc achieves higher precision on vulnerability localization compared to comparative vulnerability detectors.

关键词： source code Vulnerability localization Machine learning MRC

来源：评论

学校读者我要写书评

暂无评论

Comparing the Pretrained Models of source code by Re-pretraining Under a Unified Setup

引用

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2024年第12期35卷 17768-17778页

作者： Niu, Changan Li, Chuanyi Ng, Vincent Luo, Bin Nanjing Univ Software Inst State Key Lab Novel Software Technol Nanjing 210093 Peoples R China Univ Texas Dallas Human Language Technol Res Inst Richardson TX 75083 USA

Recent years have seen the successful application of large pretrained models of source code (codePTMs) to code representation learning, which have taken the field of software engineering (SE) from task-specific solutions to task-agnostic generic models. By the remarkable results, codePTMs are seen as a promising direction in both academia and industry. While a number of codePTMs have been proposed, they are often not directly comparable because they differ in experimental setups such as pretraining dataset, model size, evaluation tasks, and datasets. In this article, we first review the experimental setup used in previous work and propose a standardized setup to facilitate fair comparisons among codePTMs to explore the impacts of their pretraining tasks. Then, under the standardized setup, we re-pretrain codePTMs using the same model architecture, input modalities, and pretraining tasks, as they declared and fine-tune each model on each evaluation SE task for evaluating. Finally, we present the experimental results and make a comprehensive discussion on the relative strength and weakness of different pretraining tasks with respect to each SE task. We hope our view can inspire and advance the future study of more powerful codePTMs.

关键词： Pretraining task source code supervised learning

来源：评论

学校读者我要写书评

暂无评论

Reducing the Impact of Time Evolution on source code Authorship Attribution via Domain Adaptation

引用

ACM TRANSACTIONS ON SOFTWARE ENGINEERING AND METHODOLOGY 2024年第6期33卷 1-27页

作者： Li, Zhen Zhao, Shasha Chen, Chen Chen, Qian Hebei Univ Sch Cyber Secur & Comp Baoding 071002 Peoples R China Univ Cent Florida Ctr Res Comp Vis Orlando FL 32816 USA Univ Texas San Antonio San Antonio TX 78249 USA Huazhong Univ Sci & Technol Sch Cyber Sci & Engn Wuhan 430074 Peoples R China

source code authorship attribution is an important problem in practical applications such as plagiarism detection, software forensics, and copyright disputes. Recent studies show that existing methods for source code authorship attribution can be significantly affected by time evolution, leading to a decrease in attribution accuracy year by year. To alleviate the problem of Deep Learning (DL)-based source code authorship attribution degrading in accuracy due to time evolution, we propose a new framework called Time Domain Adaptation (TimeDA) by adding new feature extractors to the original DL-based code attribution framework that enhances the learning ability of the original model on source domain features without requiring new or more source data. Moreover, we employ a centroid-based pseudo-labeling strategy using neighborhood clustering entropy for adaptive learning to improve the robustness of DL-based code authorship attribution. Experimental results show that TimeDA can significantly enhance the robustness of DL-based source code authorship attribution to time evolution, with an average improvement of 8.7% on the Java dataset and 5.2% on the C++ dataset. In addition, our TimeDA benefits from employing the centroid-based pseudo-labeling strategy, which significantly reduced the model training time by 87.3% compared to traditional unsupervised domain adaptive methods.

关键词： Authorship attribution source code time evolution deep learning domain adaptation

来源：评论

学校读者我要写书评

暂无评论

Discovering and exploring cases of educational source code plagiarism with Dolos

引用

SOFTWAREX 2024年 26卷

作者： Maertens, Rien Van Neyghem, Maarten Geldhof, Maxiem Van Petegem, Charlotte Strijbol, Niko Dawyndt, Peter Mesuere, Bart Univ Ghent Dept Appl Math Comp Sci & Stat Ghent Belgium

source code plagiarism is a significant issue in educational practice, and educators need user-friendly tools to cope with such academic dishonesty. This article introduces the latest version of Dolos, a state-of-theart ecosystem of tools for detecting and preventing plagiarism in educational source code. In this new version, the primary focus has been on enhancing the user experience. Educators can now run the entire plagiarism detection pipeline from a new web app in their browser, eliminating the need for any installation or configuration. Completely redesigned analytics dashboards provide an instant assessment of whether a collection of source files contains suspected cases of plagiarism and how widespread plagiarism is within the collection. The dashboards support hierarchically structured navigation to facilitate zooming in and out of suspect cases. Clusters are an essential new component of the dashboard design, reflecting the observation that plagiarism can occur among larger groups of students. To meet various user needs, the Dolos software stack for source code plagiarism detection now includes a self-hostable web app, a JSON application programming interface (API), a command line interface (CLI), a JavaScript library and a preconfigured Docker container. Clear documentation and a free-to-use instance of the web app can be found at https://***. The source code is also available on GitHub.

关键词： Web app Plagiarism source code Academic dishonesty Cheating Learning analytics Educational data mining Online learning Programming language

来源：评论

学校读者我要写书评

暂无评论

On the compressibility of large-scale source code datasets

引用

JOURNAL OF SYSTEMS AND SOFTWARE 2025年 227卷

作者： Boffa, Antonio Di Cosmo, Roberto Ferragina, Paolo Guerra, Andrea Manzini, Giovanni Vinciguerra, Giorgio Zacchiroli, Stefano Ecole Polytech Fed Lausanne EPFL Lausanne Switzerland Inria Paris France Univ Paris Cite Paris France St Anna Sch Adv Studies Dept EMbeDS Pisa Italy Univ Pisa Dept Comp Sci Pisa Italy Inst Polytech Paris LTCI Telecom Paris Palaiseau France

Storing ultra-large amounts of unstructured data (often called objects or blobs) is a fundamental task for several object-based storage engines, data warehouses, data-lake systems, and key-value stores. These systems cannot currently leverage similarities between objects, which could be vital in improving their space and time performance. An important use case in which we can expect the objects to be highly similar is the storage of large-scale versioned source code datasets, such as the Software Heritage Archive (Di Cosmo and Zacchiroli, 2017). This use case is particularly interesting given the extraordinary size (1.5 PiB), the variegated nature, and the high repetitiveness of the at-issue corpus. In this paper we discuss and experiment with content-and context-based compression techniques for source-code collections that tailor known and novel tools to this setting in combination with state-of-the-art general-purpose compressors and the information coming from the Software Heritage Graph. We experiment with our compressors over a random sample of the entire corpus, and four large samples of source code files written in different popular languages: C/C++, Java, JavaScript, and Python. We also consider two scenarios of usage for our compressors, called Backup and File-Access scenario, where the latter adds to the former the support for single file retrieval. As a net result, our experiments show (i) how much "compressible" each language is, (ii) which content-or context-based techniques compress better and are faster to (de)compress by possibly supporting individual file access, and (iii) the ultimate compressed size that, according to our estimate, our best solution could achieve in storing all the source code written in these languages and available in the Software Heritage Archive: namely, in 3 TiB (down from their original 78 TiB total size, with an average compression ratio of 4%).

关键词： Data compression source code Storage systems Locality-sensitive hashing Software Heritage Version control systems

来源：评论

学校读者我要写书评

暂无评论

Employing Blockchain, NFTs, and Digital Certificates for Unparalleled Authenticity and Data Protection in source code: A Systematic Review

引用

COMPUTERS 2025年第4期14卷 131-131页

作者： Lopez, Leonardo Juan Ramirez Ledezma, Genesis Gabriela Morillo Univ El Bosque Engn Fac Osiris & Bioaxis Res Grp Bogota 111321 Colombia

In higher education, especially in programming-intensive fields like computer science, safeguarding students' source code is crucial to prevent theft that could impact learning and future careers. Traditional storage solutions like Google Drive are vulnerable to hacking and alterations, highlighting the need for stronger protection. This work explores digital technologies that enhance source code security, with a focus on Blockchain and NFTs. Due to Blockchain's decentralized and immutable nature, NFTs can be used to control code ownership, improving security, traceability, and preventing unauthorized access. This approach effectively addresses existing gaps in protecting academic intellectual property. However, as Bennett et al. highlight, while these technologies have significant potential, challenges remain in large-scale implementation and user acceptance. Despite these hurdles, integrating Blockchain and NFTs presents a promising opportunity to enhance academic integrity. Successful adoption in educational settings may require a more inclusive and innovative strategy.

关键词： blockchain data protection digital certificates non-fungible tokens (NFTs) source code

来源：评论

学校读者我要写书评

暂无评论

Meta-Heuristic Guided Feature Optimization for Enhanced Authorship Attribution in Java source code

引用

IEEE ACCESS 2023年 11卷 141657-141673页

作者： Al-Ahmad, Bilal Al-Madi, Nailah Alzaqebah, Abdullah Alkhawaldeh, Rami S. Aldebei, Khaled Kabir, Md. Faisal Altaharwa, Ismail Abu-Faraj, Mua'ad Aljarah, Ibrahim Univ Jordan Aqaba 77110 Jordan St Cloud State Univ Dept Comp Sci & Informat Technol St Cloud MN 56301 USA Princess Sumaya Univ Technol Amman 11941 Jordan Al Ahliyya Amman Univ Fac Informat Technol Comp Sci Dept Amman 19628 Jordan Penn State Univ Harrisburg Sch Sci Engn & Technol Middletown PA 17057 USA Univ Jordan Amman 11942 Jordan

source code authorship attribution is the task of identifying who develops the code based on learning based on the programmer style. It is one of the critical activities which used extensively in different aspects such as computer security, computer law, and plagiarism. This paper attempts to investigate source code authorship attribution by capturing natural language aspects of the code rather than only using minimal set of syntactic and stylistic code features as explored in the previous literature. It proposes an evolutionary feature selection model to improve the accuracy of authorship attribution by implementing two language models (uni-gram and bi-gram). The proposed approach uses K-Nearest Neighbor as a classifier and Genetic Algorithm as a feature selection technique. Two experiments have been demonstrated on a public Authorship Attribution dataset on GitHub, the experiments include various evolutionary feature selection models. Notably, the obtained results in both experiments were compared with the related studies, and show a significant improvement in terms of accuracy.

关键词： source coding Feature extraction Genetic algorithms codes Task analysis Java Classification algorithms Evolutionary computation data mining feature selection java source code authorship attribution

来源：评论

学校读者我要写书评

暂无评论

Sharing practices of software artefacts and source code for reproducible research

引用

INTERNATIONAL JOURNAL OF DATA SCIENCE AND ANALYTICS 2024年 1-12页

作者： Jean-Quartier, Claire Jeanquartier, Fleur Stryeck, Sarah Simon, Joerg Soeser, Birgit Hasani-Mavriqi, Ilire Graz Univ Technol Res Data Management Graz Austria Univ Nat Resources & Life Sci Dept Forest & Soil Sci Human Ctr AI Lab Vienna Austria Res Ctr Pharmaceut Engn GmbH Graz Austria Graz Univ Technol Inst Interact Syst & Data Sci Graz Austria

While source code of software and algorithms depicts an essential component in all fields of modern research involving data analysis and processing steps, it is uncommonly shared upon publication of results throughout disciplines. Simple guidelines to generate reproducible source code have been published. Still, code optimization supporting its repurposing to different settings is often neglected and even less thought of to be registered in catalogues for a public reuse. Though all research output should be reasonably curated in terms of reproducibility, it has been shown that researchers are frequently non-compliant with availability statements in their publications. These do not even include the use of persistent unique identifiers that would allow referencing archives of code artefacts at certain versions and time for long-lasting links to research articles. In this work, we provide an analysis on current practices of authors in open scientific journals in regard to code availability indications, FAIR principles applied to code and algorithms. We present common repositories of choice among authors. Results further show disciplinary differences of code availability in scholarly publications over the past years. We advocate proper description, archiving and referencing of source code and methods as part of the scientific knowledge, also appealing to editorial boards and reviewers for supervision.

关键词： source code Reproducibility FAIR principles Open science Software availability

来源：评论

学校读者我要写书评

暂无评论

A comparative study of adversarial training methods for neural models of source code

引用

FUTURE GENERATION COMPUTER SYSTEMS-THE INTERNATIONAL JOURNAL OF ESCIENCE 2023年第1期142卷 165-181页

作者： Li, Zhen Huang, Xiang Li, Yangrui Chen, Guenevere Hebei Univ Sch Cyber Secur & Comp Baoding 071002 Hebei Peoples R China Huazhong Univ Sci & Technol Sch Cyber Sci & Engn Wuhan 430074 Hubei Peoples R China Univ Texas San Antonio Dept Elect & Comp Engn San Antonio TX 78249 USA

Adversarial training has been employed by researchers to protect AI models of source code. However, it is still unknown how adversarial training methods in this field compare to each other in effectiveness and robustness. This study surveys and investigates existing adversarial training methods, and conducts experiments to evaluate these neural models' performance in the domain of source code. First, we examine the process of adversarial training to identify four dimensions that could be used to classify different adversarial training methods into five categories, which are Mixing Directly, Composite Loss, Adversarial Fine-tuning, Min-max + Composite Loss, and Min-max. Second, we conduct empirical evaluations of these classified adversarial training methods under two tasks (i.e., code summarization and code authorship attribution) to determine their performance of effectiveness and robustness. Experimental results indicate that the performance of certain combinations of adversarial training techniques (i.e., min-max with composite loss, or directly-sample with ordinary loss) would be much better than other combinations or other techniques used alone. Our experiments also reveal that the model's robustness of defensive methods can be enhanced by using diverse input data for adversarial training, and that the number of fine-tuning epochs has little or no impact on model's performance.(c) 2022 Elsevier B.V. All rights reserved.

关键词： Adversarial training Robustness source code Comparative study

来源：评论

学校读者我要写书评

暂无评论

BiAn: Smart Contract source code Obfuscation

引用

IEEE TRANSACTIONS ON SOFTWARE ENGINEERING 2023年第9期49卷 4456-4476页

作者： Zhang, Pengcheng Yu, Qifan Xiao, Yan Dong, Hai Luo, Xiapu Wang, Xiao Zhang, Meng Hohai Univ Coll Comp & Informat Nanjing 210000 Peoples R China Sun Yat Sen Univ Sch Cyber Sci & Technol Shenzhen 518000 Peoples R China RMIT Univ Sch Comp Technol Melbourne Vic 3000 Australia Hong Kong Polytech Univ Dept Comp Hong Kong 999077 Peoples R China

With the rising prominence of smart contracts, security attacks targeting them have increased, posing severe threats to their security and intellectual property rights. Existing simplistic datasets hinder effective vulnerability detection, raising security concerns. To address these challenges, we propose BiAn, a source code level smart contract obfuscation method that generates complex vulnerability test datasets. BiAn protects contracts by obfuscating data flows, control flows, and code layouts, increasing complexity and making it harder for attackers to discover vulnerabilities. Our experiments with buggy contracts showed an average complexity enhancement of approximately 174% after obfuscation. Decompilers Vandal and Gigahorse had total failure rate increments of 38.8% and 40.5% respectively. Obfuscated contracts also decreased vulnerability detection rates in more than 50% of cases for ten widely-used static analysis detection tools.

关键词： Smart contracts codes source coding Security Complexity theory Intellectual property Layout Blockchain Ethereum smart contract source code obfuscation

来源：评论

学校读者我要写书评

暂无评论

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

时间限定

文献类型

馆藏选择

核心期刊

语言

文献类型

帮助

文字说明：

检索规则说明：

检索范例：

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案：

请选择收藏分类：

通借通还

建议与咨询 留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

时间限定

文献类型

馆藏选择

核心期刊

语言

文献类型

帮助

文字说明：

检索规则说明：

检索范例：

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案： 新增检索档案 确定 取消

请选择收藏分类： 新增自定义分类 确定 取消

通借通还

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

请选择保存的检索档案：

请选择收藏分类：