Existing binary code similarity detection (BCSD) methods often overlook the actual execution information and local semantic details of programs, leading to suboptimal performance in assembly code semantic representati...
详细信息
Cross-platform binary code similarity detection aims at detecting whether two or more pieces of binarycode are similar or not. Existing approaches that combine control flow graphs(CFGs)-based function representation ...
详细信息
Cross-platform binary code similarity detection aims at detecting whether two or more pieces of binarycode are similar or not. Existing approaches that combine control flow graphs(CFGs)-based function representation and graph convolutional network(GCN)-based similarity analysis are the best-performing ones. Due to a large amount of convolutional computation and the loss of structural information, the use of convolution networks will inevitably bring problems such as high overhead and sometimes inaccuracy. To address these issues, we propose a fast cross-platform binary code similarity detection framework that takes advantage of natural language processing(NLP)and inductive graph neural network(GNN) for basic blocks embedding and function representation respectively by simulating extracting structural features and temporal features. GNN's node-centric and small batch is a suitable training way for large CFGs, it can greatly reduce computational overhead. Various NLP basic block embedding models and GNNs are evaluated. Experimental results show that the scheme with long short term memory(LSTM)for basic blocks embedding and inductive learning-based Graph SAGE(GAE) for function representation outperforms the state-of-the-art works. In our framework, we can take only 45% overhead. Improve efficiency significantly with a small performance trade-off.
Cross-platform binary code similarity detection is determining whether a pair of binary functions coming from different platforms are similar, and plays an important role in many areas. Traditional methods focus on us...
详细信息
Cross-platform binary code similarity detection is determining whether a pair of binary functions coming from different platforms are similar, and plays an important role in many areas. Traditional methods focus on using platform-independent characteristic strands intersecting or control flow graph (CFG) matching to compute the similarity and have shortages in terms of efficiency and scalability. The existing deep-learning-based methods improve the efficiency but have a low accuracy and still using manually constructed features. Aiming at these problems, a cross-platform binary code similarity detection method based on neural machine translation (NMT) and graph embedding is proposed in this manuscript. We train an NMT model and a graph embedding model to automatically extract two parts of semantics of the binarycode and represent it as a high-dimension vector, named an embedding. Then the similarity of two binary functions can be measured by the distance between their corresponding embeddings. We implement a prototype named SimInspector. Our comparative experiment result shows that SimInspector outperforms the state-of-the-art approach, Gemini, by about 6% with respect to similaritydetection accuracy, and maintains a good efficiency.
The technique of binary code similarity detection (BCSD) has been applied in many fields, such as malware detection, plagiarism detection and vulnerability search, etc. Existing solutions for the BCSD problem usually ...
详细信息
The technique of binary code similarity detection (BCSD) has been applied in many fields, such as malware detection, plagiarism detection and vulnerability search, etc. Existing solutions for the BCSD problem usually compare specific features between binaries based on the control flow graphs of functions from binaries or compute the embedding vector of binary functions and solve the problem based on deep learning algorithms. In this paper, from another research perspective, we propose a new and lightweight method to solve cross-version BCSD problem based on multiple features. It transforms binary functions into vectors and signals and computes the similarity coefficient value and correlation coefficient value for solving cross-version BCSD problem. Without relying on the CFG of functions, deep learning algorithms and other related attributes, our method works directly on the raw bytes of each binary and it can be used as an alternative method to coping with various complex situations that exist in the real-world environment. We implement the method and evaluate it on a custom dataset with about 423,282 samples. The result shows that the method could perform well in cross-version BCSD field, and the recall of our method could reach 96.63%, which is almost the same as the state-of-the-art static solution.
Widespread code reuse allows vulnerabilities to proliferate among a vast variety of firmware. There is an urgent need to detect these vulnerable codes effectively and efficiently. By measuring code similarities, AI-ba...
详细信息
Widespread code reuse allows vulnerabilities to proliferate among a vast variety of firmware. There is an urgent need to detect these vulnerable codes effectively and efficiently. By measuring code similarities, AI-based binary code similarity detection is applied to detecting vulnerable code at scale. Existing studies have proposed various function features to capture the commonality for similaritydetection. Nevertheless, the significant code syntactic variability induced by the diversity of IoT hardware architectures diminishes the accuracy of binary code similarity detection. In our earlier study and the tool Asteria, we adopted a Tree-LSTM network to summarize function semantics as function commonality, and the evaluation result indicates an advanced performance. However, it still has utility concerns due to excessive time costs and inadequate precision while searching for large-scale firmware bugs. To this end, we propose a novel deep learning-enhancement architecture by incorporating domain knowledge-based pre-filtration and re-ranking modules, and we develop a prototype named ASTERIA-PRO based on Asteria. The pre-filtration module eliminates dissimilar functions, thus reducing the subsequent deep learning-model calculations. The re-ranking module boosts the rankings of vulnerable functions among candidates generated by the deep learning model. Our evaluation indicates that the pre-filtration module cuts the calculation time by 96.9%, and the re-ranking module improves MRR and Recall by 23.71% and 36.4%, respectively. By incorporating these modules, ASTERIA-PRO outperforms existing state-of-the-art approaches in the bug search task by a significant margin. Furthermore, our evaluation shows that embedding baseline methods with pre-filtration and re-ranking modules significantly improves their precision. We conduct a large-scale real-world firmware bug search, and ASTERIA-PRO manages to detect 1,482 vulnerable functions with a high precision 91.65%.
binary code similarity detection is an effective analysis technique for vulnerability, bug, and plagiarism detection in software for which the source code cannot be obtained. The recent proliferation of IoT devices ha...
详细信息
binary code similarity detection is an effective analysis technique for vulnerability, bug, and plagiarism detection in software for which the source code cannot be obtained. The recent proliferation of IoT devices has also increased the demand for similaritydetection across different architectures. However, there are currently not many examples of feature extraction methods using neural machine translation (NMT) models being applied to similaritydetection in basic block units across different architectures. In this research, we propose new methods that extract features at a higher speed and detect similarities across different architectures with higher accuracy than existing methods for basic block feature extraction using neural machine translation models. We assume that the intermediate representation of the NMT model, which learned the translation of basic blocks across different architectures, includes the semantics of the instructions in the basic block. Hence we adopted the intermediate representation as the features of the basic blocks. Then, we applied the linear transformation used in bilingual word embedding to match the embedding space of basic blocks across different architectures. This enables the similaritydetection in basic block units across different architectures with higher accuracy than the distance learning method used in existing research to match the embedding space. In the evaluation experiment, we compare the Precision at k (P@k) on the same dataset with existing research methods and our method achieved the highest accuracy of 92%. In addition, We also compare the time required for feature extraction using GPUs, and found that it was up to 16 times faster.
Cross-architecture binary code similarity detection plays an important role in different security domains. In view of the low accuracy and poor scalability of existing cross-architecture detection technologies, we pro...
详细信息
ISBN:
(纸本)9783031565823;9783031565830
Cross-architecture binary code similarity detection plays an important role in different security domains. In view of the low accuracy and poor scalability of existing cross-architecture detection technologies, we propose Optir-SBERT, which is the first technology to detect cross-architecture binarycodesimilarity based on optimized LLVM IR. At the same time, we design a new data set binaryIR, which is more diverse and provides a benchmark data set for subsequent research work based on LLVM IR. In terms of cross-architecture binary code similarity detection, the accuracy of Optir-SBERT reaches 94.38%, and the contribution of optimization is 3.99%. In terms of vulnerability detection, the average accuracy of Optir-SBERT reach 93.9%, and the contribution of optimization is 7%. The results are better than existing state-of-the-art (SOTA) cross-architecture detection technologies. In order to improve the efficiency of vulnerability detection in realistic scenarios, we introduced a file-level vulnerability identification mechanism on the basis of Optir-SBERT. The new model Optir-SBERT-F saved 45.36% of the detection time on the premise of a slight decrease in detection F value, which greatly improves the efficiency of vulnerability detection.
binary code similarity detection (BCSD) has many applications in computer security, whose task is to detect the similarity of two binary functions without having access to the source code. Recently deep learning metho...
详细信息
ISBN:
(纸本)9783031157776;9783031157769
binary code similarity detection (BCSD) has many applications in computer security, whose task is to detect the similarity of two binary functions without having access to the source code. Recently deep learning methods have shown better efficiency, accuracy, and potential in BCSD. Most of them reduce losses by the Siamese network, and they ignore some shortcomings of the Siamese network. In this paper, we introduce the idea of contrastive learning into graph neural networks and experimentally demonstrate that the way of training graph models by contrastive learning is significantly better than Siamese. In addition, we found that Principal Neighbourhood Aggregation for Graph Nets (PNA) has the best ability to extract structural information of control flow graph (CFG) among various graph neural networks.
binary code similarity detection (BCSD) plays a crucial role in various computer security applications, including vulnerability detection, malware detection, and software component analysis. With the development of th...
详细信息
binary code similarity detection (BCSD) plays a crucial role in various computer security applications, including vulnerability detection, malware detection, and software component analysis. With the development of the Internet of Things (IoT), there are many binaries from different instruction architecture sets, which require BCSD approaches robust against different architectures. In this study, we propose a novel IoT-oriented binary code similarity detection approach. Our approach leverages a customized transformer-based language model with disentangled attention to capture relative position information. To mitigate out-of-vocabulary (OOV) challenges in the language model, we introduce a base-token prediction pre-training task aimed at capturing basic semantics for unseen tokens. During function embedding generation, we integrate directed jumps, data dependency, and address adjacency to capture multiple block relations. We then assign different weights to different relations and use multi-layer Graph Convolutional Networks (GCN) to generate function embeddings. We implemented the prototype of IoTSim. Our experimental results show that our proposed block relation matrix improves IoTSim with large margins. With a pool size of 103, IoTSim achieves a recall@1 of 0.903 across architectures, outperforming the state-of-the-art approaches Trex, SAFE, and PalmTree.
The widespread reuse of open-source code amplifies the impact of vulnerabilities. Current vulnerability detection methods predominantly rely on binarycodesimilarity comparisons, which involve disassembling to obtain...
详细信息
The widespread reuse of open-source code amplifies the impact of vulnerabilities. Current vulnerability detection methods predominantly rely on binarycodesimilarity comparisons, which involve disassembling to obtain assembly code or control flow graphs. These methods depend on specific disassembly tools and complex preprocessing, limiting their applicability and detection speed. This paper proposes UniBin, a vulnerability detection method based on the multilayer Transformer encoder. By employing bidirectional LM, unidirectional LM, and sequence-to-sequence LM tasks on both binary and assembly code during the pre-training phase, UniBin learns richer semantic information from binary machine code, enabling efficient similarity comparison without disassembly and mitigating the limitations of disassembly. We cross-compile 55 widely used open-source C projects as datasets. After 52 hours of pre-training and 8 hours of fine-tuning, UniBin reaches an average accuracy of 98.3% in similaritydetection across compilation conditions, outperforming the state-of-the-art method. For search tasks across optimization options with a pool size of 1000, the Recall@1 metric improves by 28.2% (from 67.9% to 87.1%). UniBin eliminates dependency on specific disassembly tools and improves end-to-end binary analysis speed by over 36%. In real-world vulnerability detection tasks, UniBin detects all vulnerability functions with the lowest false positive rate of 0.16%.
暂无评论