检索结果-内蒙古大学图书馆

Deep Learning-based source code classification and Visualization of Decision Rationales

Procedia Computer Science 2024年 246卷 2753-2761页

作者： Shinichi Oeda Seiya Mitsumori Department of Information and Computer Engineering National Institute of Technology Kisarazu College 11-1 Kiyomidaihigashi 2-chome Kisarazu City Chiba Japan Advanced Course of Control and Information Engineering National Institute of Technology Kisarazu College 11-1 Kiyomidaihigashi 2-chome Kisarazu City Chiba Japan

In programming education, it is crucial to provide instruction that is tailored to students’ proficiency levels. For this purpose, an objective evaluation of each student’s coding ability is essential. Additionally, it is necessary to understand the characteristics of source code written by both beginners and advanced students. Previous research has successfully converted the structural information of source code into graphs and assessed coding skills using deep learning, achieving high accuracy. However, it remains unclear which specific structural elements significantly influence these assessments. This study addresses this gap by transforming source code into abstract syntax trees and developing a model that uses Graph Convolutional Networks to categorize code as either beginner or advanced users based on learned structural information. Furthermore, we apply Integrated Gradients to visualize the decision-making basis of our model and elucidate the structural characteristics distinguishing source code written by beginner and advanced users.

关键词： Abstract Syntax Tree Deep Learning Explainable AI Graph Convolutional Networks Integrated Gradients source code classification

来源：评论

学校读者我要写书评

暂无评论

Comparison of Image-Based and Text-Based source code classification Using Deep Learning

引用

SN Computer Science 2020年第5期1卷 1-13页

作者： Kiyak, Elife Ozturk Cengiz, Ayse Betul Birant, Kokten Ulas Birant, Derya The Graduate School of Natural and Applied Sciences Dokuz Eylul University Izmir 35390 Turkey Department of Computer Engineering Dokuz Eylul University Izmir 35390 Turkey

source code classification (SCC) is a task to assign codes into different categories according to a criterion such as according to their functionalities, programming languages or vulnerabilities. Many source code archives are organized according to the programming languages, and thereby, the desired code fragments can be easily accessed by searching within the archive. However, manually organizing source code archives by field experts is labor intensive and impractical because of the fast-growing available source codes. Therefore, this study proposes new convolutional neural network (CNN) architectures to build source code classifiers that automatically identify programming languages from source codes. This is the first study in which the performances of deep learning algorithms on programming language identification are compared on both image and text files. In this study, the experiments are performed on three source code datasets to identify eight programming languages, including C, C++, C# , Go, Python, Ruby, Rust, and Java. The comparative results indicate that although text-based SCC and image-based SCC approaches achieve very high (> 93.5 %) and similar accuracies, text-based classification has significantly better performance in terms of execution time. © 2020, Springer Nature Singapore Pte Ltd.

关键词： Deep learning Image classification Programming languages Software engineering source code classification Text mining

来源：评论

学校读者我要写书评

暂无评论

codeBERT-Attack: Adversarial attack against source code deep learning models via pre-trained model

引用

JOURNAL OF SOFTWARE-EVOLUTION AND PROCESS 2024年第3期36卷

作者： Zhang, Huangzhao Lu, Shuai Li, Zhuo Jin, Zhi Ma, Lei Liu, Yang Li, Ge Peking Univ Key Lab High Confidence Software Technol Beijing Peoples R China Microsoft Microsoft Res Asia Beijing Peoples R China Univ Alberta Momentum Lab Edmonton AB Canada Nanyang Technol Univ Sch Comp Sci & Engn Singapore Singapore Peking Univ Key Lab High Confidence Software Technol 16191 Sci Bldg5 Yiheyuan Rd Beijing 100871 Peoples R China

Over the past few years, the software engineering (SE) community has widely employed deep learning (DL) techniques in many source code processing tasks. Similar to other domains like computer vision and natural language processing (NLP), the state-of-the-art DL techniques for source code processing can still suffer from adversarial vulnerability, where minor code perturbations can mislead a DL model's inference. Efficiently detecting such vulnerability to expose the risks at an early stage is an essential step and of great importance for further enhancement. This paper proposes a novel black-box effective and high-quality adversarial attack method, namely codeBERT-Attack (CBA), based on the powerful large pre-trained model (i.e., codeBERT) for DL models of source code processing. CBA locates the vulnerable positions through masking and leverages the power of codeBERT to generate textual preserving perturbations. We turn codeBERT against DL models and further fine-tuned codeBERT models for specific downstream tasks, and successfully mislead these victim models to erroneous outputs. In addition, taking the power of codeBERT, CBA is capable of effectively generating adversarial examples that are less perceptible to programmers. Our in-depth evaluation on two typical source code classification tasks (i.e., functionality classification and code clone detection) against the most widely adopted LSTM and the powerful fine-tuned codeBERT models demonstrate the advantages of our proposed technique in terms of both effectiveness and efficiency. Furthermore, our results also show (1) that pre-training may help codeBERT gain resilience against perturbations further, and (2) certain pre-training tasks may be beneficial for adversarial robustness.

关键词： black-box adversarial attack pre-trained model source code classification

来源：评论

学校读者我要写书评

暂无评论

Workflow analysis of data science code in public GitHub repositories

引用

EMPIRICAL SOFTWARE ENGINEERING 2023年第1期28卷 7-7页

作者： Ramasamy, Dhivyabharathi Sarasua, Cristina Bacchelli, Alberto Bernstein, Abraham Univ Zurich Dept Informat Zurich Switzerland

Despite the ubiquity of data science, we are far from rigorously understanding how coding in data science is performed. Even though the scientific literature has hinted at the iterative and explorative nature of data science coding, we need further empirical evidence to understand this practice and its workflows in detail. Such understanding is critical to recognise the needs of data scientists and, for instance, inform tooling support. To obtain a deeper understanding of the iterative and explorative nature of data science coding, we analysed 470 Jupyter notebooks publicly available in GitHub repositories. We focused on the extent to which data scientists transition between different types of data science activities, or steps (such as data preprocessing and modelling), as well as the frequency and co-occurrence of such transitions. For our analysis, we developed a dataset with the help of five data science experts, who manually annotated the data science steps for each code cell within the aforementioned 470 notebooks. Using the first-order Markov chain model, we extracted the transitions and analysed the transition probabilities between the different steps. In addition to providing deeper insights into the implementation practices of data science coding, our results provide evidence that the steps in a data science workflow are indeed iterative and reveal specific patterns. We also evaluated the use of the annotated dataset to train machine-learning classifiers to predict the data science step(s) of a given code cell. We investigate the representativeness of the classification by comparing the workflow analysis applied to (a) the predicted data set and (b) the data set labelled by experts, finding an F1-score of about 71% for the 10-class data science step prediction problem.

关键词： Data science Data science workflow Data science workflow Workflow analysis Data science life cycle source code classification Notebooks Jupyter notebooks

来源：评论

学校读者我要写书评

暂无评论

A Stacked Bidirectional LSTM Model for Classifying source codes Built in MPLs 21st

A Stacked Bidirectional LSTM Model for Classifying Source Co...

引用

21st Joint European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases (ECML PKDD)

作者： Rahman, Md Mostafizer Watanobe, Yutaka Kiran, Rage Uday Kabir, Raihan Univ Aizu Grad Dept Comp Sci & Engn Aizu Wakamatsu Fukushima Japan Dhaka Univ Engn & Technol Dept Comp Sci & Engn Gazipur Bangladesh

ISBN: (纸本)9783030937331;9783030937324

Over the years, programmers have improved their programming skills and can now write code in many different languages to solve problems. A lot of new code is being generated all over the world regularly. Since a programming problem can be solved in many different languages, it is quite difficult to identify the problem from the written source code. Therefore, a classification model is needed to help programmers identify the problems built (written/developed) in Multi-Programming Languages (MPLs). This classification model can help programmers learn better programming. However, source code classification models based on deep learning are still lacking in the field of programming education and software engineering. To address this gap, we propose a stacked Bidirectional Long Short-Term Memory (Bi-LSTM) neural network-based model for classifying source codes developed in MPLs. To accomplish this research, we collect a large number of real-world source codes from the Aizu Online Judge (AOJ) system. The proposed model is trained, validated, and tested on the AOJ dataset. Various hyperparameters are fine-tuned to improve the performance of the model. Based on the experimental results, the proposed model achieves an accuracy of about 93% and an F1-score of 89.24%. Moreover, the proposed model outperforms the state-of-the-art models in terms of other evaluation matrices such as precision (90.12%) and recall (89.48%).

关键词： Deep learning classification source code classification Stacked Bi-LSTM LSTM Programming education Software engineering

来源：评论

学校读者我要写书评

暂无评论

引用

21st IEEE International Working Conference on source code Analysis and Manipulation (SCAM)

作者： Saiki, Kazuya Ihara, Akinori Wakayama Univ Social Software Engn Lab SocSEL Wakayama Japan

ISBN: (纸本)9781665448970

A benchmark is an action to assess performance (e.g., program execution time) by developers preparing and running several test cases over a long period. To reasonably assess the performance of method-level code snippets, developers could use a micro benchmark. Some micro benchmarks for JavaScript provide online web services (e.g., jsPerf and ***). Developers easily search code snippets with better performance in the micro benchmark service. Then, the developers will find many similar code snippets for different functions in the service because the micro benchmark service has a collection of versatile method-level code snippets. To find replaceable code snippets with better performance, we tackle to distinguish similar code snippets for different functions with more fine-grained size than method-level in micro benchmark services. This study proposes an approach to collect diverse code snippets using the similar function. The approach measures the similarity using code2Vec between some code snippets assessed in the micro benchmark service, and find an appropriate threshold to associate with the code snippets using the similar function. Using the micro benchmark service jsPerf dataset that the authors collected, this study evaluates the usefulness of our approach. Specifically, we collect code snippets related to the most frequent topics "innerHTML vs removeChild" and "for vs forEach" assessed in jsPerf. Consequently, we find our approach achieves higher precision (98% and 92%) to identify diverse code snippets using the similar function.

关键词： micro benchmark software quality program analysis source code classification

来源：评论

学校读者我要写书评

暂无评论

Research on classification of Malware source code

引用

Journal of Shanghai Jiaotong university(Science) 2014年第4期19卷 425-430页

作者：陈嘉玫赖谷鑫 Department of Information Management National Sun Yat-Sen University Department of Information Management Chinese Culture University

In the face threat of the Internet attack, malware classification is one of the promising solutions in the field of intrusion detection and digital forensics. In previous work, researchers performed dynamic analysis or static analysis after reverse engineering. But malware developers even use anti-virtual machine(VM) and obfuscation techniques to evade malware classifiers. By means of the deployment of honeypots, malware source code could be collected and analyzed. source code analysis provides a better classification for understanding the purpose of attackers and forensics. In this paper, a novel classification approach is proposed, based on content similarity and directory structure similarity. Such a classification avoids to re-analyze known malware and allocates resources for new malware. Malware classification also let network administrators know the purpose of attackers. The experimental results demonstrate that the proposed system can classify the malware efficiently with a small misclassification ratio and the performance is better than virustotal.

关键词： malware source code classification static analysis honeypot

来源：评论

学校读者我要写书评

暂无评论

Execution Path classification for Vulnerability Analysis and Detection 1

引用

12th International Joint Conference on E-Business and Telecommunications (ICETE)

作者： Stergiopoulos, George Katsaros, Panagiotis Gritzalis, Dimitris AUEB Dept Informat Informat Secur & Crit Infrastruct Protect INFOSEC Athens Greece Aristotle Univ Thessaloniki Dept Informat GR-54006 Thessaloniki Greece

ISBN: (数字)9783319302225

ISBN: (纸本)9783319302225;9783319302218

Various commercial and open-source tools exist, developed both by the industry and academic groups, which are able to detect various types of security bugs in applications' source code. However, most of these tools are prone to non-negligible rates of false positives and false negatives, since they are designed to detect a priori specified types of bugs. Also, their analysis scalability to large programs is often an issue. To address these problems, we present a new source code analysis technique based on execution path classification. We develop a prototype tool to test our method's ability to detect different types of information-flow dependent bugs. Our approach is based on classifying the Risk of likely exploits inside source code execution paths using two measuring functions: Severity and Vulnerability. For an Application Under Test (AUT), we analyze every single pair of input vector and program sink in an execution path, which we call an Information Block (IB). Severity quantifies the danger level of an IB using static analysis and a variation of the Information Gain algorithm. On the other hand, an IB's Vulnerability rank quantifies how certain the tool is that an exploit exists on a given execution path. The Vulnerability function is based on tainted object propagation. The Risk of each IB is the combination of its computed Severity and Vulnerability measurements through an aggregation operation over two fuzzy sets using a Fuzzy Logic system. An IB is characterized of a high risk, when both its Severity and Vulnerability rankings have been found to be above the low zone. In this case, our prototype tool called Entroine reports a detected code exploit. The tool was tested on 45 Java vulnerable programs from NIST's Juliet Test Suite, which implement three different types of exploits. All existing code exploits were detected without any false positive.

关键词： code exploits Software vulnerabilities source code classification Fuzzy logic Tainted object propagation

来源：评论

学校读者我要写书评

暂无评论

Automated Exploit Detection using Path Profiling The Disposition Should Matter, Not the Position 12

Automated Exploit Detection using Path Profiling <i>The Disp...

引用

12th International Joint Conference on E-Business and Telecommunications (ICETE)

作者： Stergiopoulos, George Petsanas, Panagiotis Katsaros, Panagiotis Gritzalis, Dimitris Athens Univ Econ & Business Dept Informat Informat Secur & Crit Infrastruct Protect INFOSEC Athens Greece Aristotle Univ Thessaloniki Dept Informat Thessaloniki Greece

ISBN: (纸本)9789897581403

Recent advances in static and dynamic program analysis resulted in tools capable to detect various types of security bugs in the Applications under Test (AUT). However, any such analysis is designed for a priori specified types of bugs and it is characterized by some rate of false positives or even false negatives and certain scalability limitations. We present a new analysis and source code classification technique, and a prototype tool aiming to aid code reviews in the detection of general information flow dependent bugs. Our approach is based on classifying the criticality of likely exploits in the source code using two measuring functions, namely Severity and Vulnerability. For an AUT, we analyse every single pair of input vector and program sink in an execution path, which we call an Information Block (IB). A classification technique is introduced for quantifying the Severity (danger level) of an IB by static analysis and computation of its Entropy Loss. An IB's Vulnerability is quantified using a tainted object propagation analysis along with a Fuzzy Logic system. Possible exploits are then characterized with respect to their Risk by combining the computed Severity and Vulnerability measurements through an aggregation operation over two fuzzy sets. An IB is characterized of a high risk, when both its Severity and Vulnerability rankings have been found to be above the low zone. In this case, a detected code exploit is reported by our prototype tool, called Entroine. The effectiveness of the approach has been tested by analysing 45 Java programs of NIST's Juliet Test Suite, which implement 3 different common weakness exploits. All existing code exploits were detected without any false positive.

关键词： code Exploits Software Vulnerabilities source code classification Fuzzy Logic Tainted Object Propagation

来源：评论

学校读者我要写书评

暂无评论

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案：

请选择收藏分类：

通借通还

建议与咨询 留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案： 新增检索档案 确定 取消

请选择收藏分类： 新增自定义分类 确定 取消

通借通还

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

请选择保存的检索档案：

请选择收藏分类：