binary code similarity detection (BCSD) has numerous applications, including malware detection, vulnerability search, plagiarism detection, and patch identification. Recent studies have demonstrated that with the rapi...
详细信息
ISBN:
(纸本)9798350381993;9798350382006
binary code similarity detection (BCSD) has numerous applications, including malware detection, vulnerability search, plagiarism detection, and patch identification. Recent studies have demonstrated that with the rapid progress of machine learning (ML) techniques, various BCSD approaches based on machine learning have exhibited stronger performance than traditional methods. However, current ML-based BCSD approaches tend to ignore the issue of training samples, and most ML-based BCSD approaches are based on supervised learning, which is suffered from the labelling difficulties. To mitigate these issues, we propose ConFunc: a function-level binary code similarity detection framework based on contrastive learning. Performance evaluation shows that ConFunc enhances the Mean Reciprocal Rank (MRR) and Recall rates (Recall@1) of baseline models by fully harnessing the potential of the data. Additionally, ConFunc demonstrates stronger performance in scenarios with scarce data, achieving the baseline model's performance on the entire dataset using only 10% of the complete dataset. In real-world patch identification and vulnerability search tasks, ConFunc consistently outperforms other baseline models in MRR and Recall@10.
Security of real-world cyber systems has drawn a lot of attention in recent years, especially when machine learning techniques are widely deployed into different layers of cyber systems. With the technology of machine...
详细信息
Security of real-world cyber systems has drawn a lot of attention in recent years, especially when machine learning techniques are widely deployed into different layers of cyber systems. With the technology of machine learning, especially adversarial machine learning techniques, the attacks and defenses in cyber systems have shown a lot of new characteristics. In this dissertation, two major works regarding the attacks and defenses in real world cyber systems including dynamic spectrum sensing systems and High Performance Computing (HPC) systems and software systems are discussed. In the first work, we revisit this security vulnerability of cooperative spectrum sensing as an adversarial machine learning problem and propose a novel learning-empowered framework named Learning-Evaluation-Beating (LEB) to mislead fusion centers. Given the gap between the new LEB attack and existing defenses, we introduced a non-invasive and parallel method named influence-limiting defense sided with existing defenses to defend against LEB-based or other similar attacks. In the second work, we offer a novel perspective, treating the anomaly detection in HPC systems based on log files as a sequential decision process, and further applying reinforcement learning techniques to detect anomalies or malicious users. Start from there, we also provide a binary code similarity detection-based method that can be applied to a more general scenario in software systems through utilizing Recurrent Neural Network (RNN) and Siamese Neural Network to detect malwares from the binaries generated by the processor that executing the program.
binary function embedding models are applicable to various downstream tasks within IoT device software systems and have demonstrated advantages in numerous binary analysis tasks, such as vulnerability (homologous) fun...
详细信息
In the field of cybersecurity, the ability to compute similarity scores at the function level for binarycode is of utmost importance. Considering that a single binary file may contain an extensive amount of functions...
详细信息
In the field of cybersecurity, the ability to compute similarity scores at the function level for binarycode is of utmost importance. Considering that a single binary file may contain an extensive amount of functions, an effective learning framework must exhibit both high accuracy and efficiency when handling substantial volumes of data. Nonetheless, conventional methods encounter several limitations. Firstly, accurately annotating different pairs of functions with appropriate labels poses a significant challenge, thereby making it difficult to employ supervised learning methods without risk of overtraining. Secondly, while SOTA models often rely on pretrained encoders or fine-grained graph comparison techniques, these approaches suffer from drawbacks related to time and memory consumption. Thirdly, the momentum update algorithm utilized in graph -based contrastive learning models can result in information leakage. Surprisingly, none of the existing articles address this issue. This research focuses on addressing the challenges associated with large-scale binary code similarity detection (BCSD). To overcome the aforementioned problems, we propose GraphMoCo: a graph momentum contrast model that leverages multimodal structural information for efficient binary function representation learning on a large scale. We adopt an unsupervised learning strategy. Our approach eliminates the need for manual labeling. By leveraging the intrinsic structural information at multiple levels of the binarycode, our model could achieve higher accuracy with a simple CNN -based model. By introducing the preshuffle mechanism, the issue of information leakage in graph momentum update algorithm is mitigated. The evaluation results indicate that GraphMoCo exhibits superior performance compared to SOTA approaches in the function pair search task, showing an average improvement of 7% on AUC, and 10% on MRR and Recall@1. Furthermore, GraphMoCo achieves a MAP of 0.93 on the more challenging data
binary vulnerability detection is a significant area of research in computer security. The existing methods for detecting binary vulnerabilities primarily rely on binarycodesimilarity analysis, detecting vulnerabili...
详细信息
binary vulnerability detection is a significant area of research in computer security. The existing methods for detecting binary vulnerabilities primarily rely on binarycodesimilarity analysis, detecting vulnerabilities by comparing the similarities embedded in binarycodes. Recently, Transformer-based models have achieved significant progress in this field, leveraging their advantage in handling sequential data to better understand the semantics of assembly code. However, to prevent the out-of-vocabulary (OOV) problems, assembly code typically needs to be normalized, which would lose some important numerical and jump information. In this paper, we propose HAformer, a Transformer-based model, which semantically fuses hexadecimal machine codes and assembly codes to extract richer semantic information from binarycodes. By incorporating the hexadecimal machine code and a newly designed assembly code normalization method, HAformer can alleviate the problem of numerical information loss caused by traditional assembly code normalization, thereby addressing the issue of OOV. Evaluation results demonstrate that our HAformer outperforms the baseline method in the Recall@1 metric by 16.9%, 25.5% and 19.2% in cross-optimization level, cross-compiler and cross-architecture environments, respectively. In real-world vulnerability detection experiments, HAformer exhibits the highest accuracy.
Analyzing the behavior of cryptographic functions in stripped binaries is a challenging but essential task, which is crucial in software security fields such as malware analysis and legacy code inspection. However, th...
详细信息
Analyzing the behavior of cryptographic functions in stripped binaries is a challenging but essential task, which is crucial in software security fields such as malware analysis and legacy code inspection. However, the inherent high logical complexity of cryptographic algorithms makes their analysis more difficult than that of ordinary code, and the general absence of symbolic information in binaries exacerbates this challenge. Existing methods for cryptographic algorithm identification frequently rely on data or structural pattern matching, which limits their generality and effectiveness while requiring substantial manual effort. In response to these challenges, we present FoC (Figure out the Cryptographic functions), a novel framework that leverages large language models (LLMs) to identify and analyze cryptographic functions in stripped *** FoC, we first build an LLM-based generative model (FoC-BinLLM) to summarize the semantics of cryptographic functions in natural language form, which is intuitively readable to analysts. Subsequently, based on the semantic insights provided by FoC-BinLLM, we further develop a binary code similarity detection model (FoC-Sim), which allows analysts to effectively retrieve similar implementations of unknown cryptographic functions from a library of known cryptographic functions. The predictions of generative model like FoC-BinLLM are inherently difficult to reflect minor alterations in binarycode, such as those introduced by vulnerability patches. In contrast, the change-sensitive representations generated by FoC-Sim compensate for the shortcomings to some extent. To support the development and evaluation of these models, and to facilitate further research in this domain, we also construct a comprehensive cryptographic binary dataset and introduce an automatic method to create semantic labels for extensive binary functions. Our evaluation results are promising. FoC-BinLLM outperforms ChatGPT by 14.61% on the ROUGE-L score,
暂无评论