检索结果-内蒙古大学图书馆

MalPacDetector: An LLM-Based Malicious NPM Package Detector

IEEE Transactions on Information Forensics and Security 2025年 20卷 6279-6291页

作者： Jian Wang Zhen Li Jixiang Qu Deqing Zou Shouhuai Xu Ziteng Xu Zhenwei Wang Hai Jin National Engineering Research Center for Big Data Technology and System Services Computing Technology and System Laboratory Hubei Key Laboratory of Distributed System Security Hubei Engineering Research Center on Big Data Security School of Cyber Science and Engineering Jinyinhu Laboratory Huazhong University of Science and Technology Wuhan China Jinyinhu Laboratory Wuhan China Department of Computer Science Laboratory for Cybersecurity Dynamics University of Colorado at Colorado Springs Colorado Springs CO USA Ant Technology Group Company Ltd. Hangzhou China National Engineering Research Center for Big Data Technology and System Services Computing Technology and System Laboratory Cluster and Grid Computing Laboratory School of Computer Science and Technology Huazhong University of Science and Technology Wuhan China

The Node Package Manager ( npm) registry contains millions of JavaScript packages widely shared between worldwide developers. However, npm has also been abused by attackers to spread malicious packages, highlighting the importance of detecting malicious npm packages. Existing malicious npm package detectors suffer from, among other things, high false positives and/or high false negatives. In this paper, we propose a novel Malicious npm Package Detector (MalPacDetector), which leverages Large Language Model (LLM) to automatically and dynamically generate features (rather than asking experts to manually define them). To evaluate the effectiveness of MalPacDetector and existing detectors, we construct a new npm package dataset, which overcomes the weaknesses of existing datasets (e.g., a small number of examples and a high repetition rate of malicious fragments). The experimental results show that MalPacDetector outperforms existing detectors by achieving a false positive rate of 1.3% and a false negative rate of 7.5%. In particular, MalPacDetector detects 39 previously unknown malicious packages, which are confirmed by the npm security team.

关键词： Feature extraction Detectors Malware Codes Training Security data mining Large language models big data Syntactics

来源：评论

学校读者我要写书评

暂无评论

Software-Defined, Fast and Strongly-Consistent data Replication for RDMA-Based PM datastores

Software-Defined, Fast and Strongly-Consistent Data Replicat...

引用

International Symposium on Parallel and Distributed Processing (IPDPS)

作者： Haodi Lu Haikun Liu Chencheng Ye Xiaofei Liao Fubing Mao Yu Zhang Hai Jin National Engineering Research Center for Big Data Technology and System Services Computing Technology and System Lab Cluster and Grid Computing Lab School of Computing Science and Technology Huazhong University of Science and Technology China

Modern storage systems typically replicate data on multiple servers to provide high reliability and availability. However, most commercially-deployed datastores often fail to offer low latency, high throughput, and strong consistency at the same time. This paper presents Whale, a Remote Direct Memory Access (RDMA) based primary-backup replication system for in-memory datastores. Whale achieves both low latency and strong consistency by decoupling metadata multicasting from data replication for all backup nodes, and using an optimistic commitment mechanism to respond to client write requests earlier. Whale achieves high throughput by propagating writes from the primary node to backup nodes asynchronously via RDMA-optimized chain replication. To further reduce the cost of data replication, we design a log-structured datastore to fully exploit the advantages of one-sided RDMA and Persistent Memory (PM). We implement Whale on a cluster equipped with PM and InfiniBand RDMA networks. Experimental results show that Whale achieves much higher throughput and lower latency than state-of-the-art replication protocols.

关键词：

来源：评论

学校读者我要写书评

暂无评论

LibAMM: empirical insights into approximate computing for accelerating matrix multiplication 24

LibAMM: empirical insights into approximate computing for ac...

引用

Proceedings of the 38th International Conference on Neural Information Processing systems

作者： Xianzhi Zeng Wenchao Jiang Shuhao Zhang National Engineering Research Center for Big DataTechnology and System Services Computing Technology and System Lab Cluster and Grid Computing Lab School of Computer Science and Technology Huazhong University of Science and Technology Wuhan China and Nanyang Technological University Singapore University of Technology and Design National Engineering Research Center for Big DataTechnology and System Services Computing Technology and System Lab Cluster and Grid Computing Lab School of Computer Science and Technology Huazhong University of Science and Technology Wuhan China

ISBN: (纸本)9798331314385

Matrix multiplication (MM) is pivotal in fields from deep learning to scientific computing, driving the quest for improved computational efficiency. Accelerating MM encompasses strategies like complexity reduction, parallel and distributed computing, hardware acceleration, and approximate computing techniques, namely AMM algorithms. Amidst growing concerns over the resource demands of large language models (LLMs), AMM has garnered renewed focus. However, understanding the nuances that govern AMM's effectiveness remains incomplete. This study delves into AMM by examining algorithmic strategies, operational specifics, dataset characteristics, and their application in real-world tasks. Through comprehensive testing across diverse datasets and scenarios, we analyze how these factors affect AMM's performance, uncovering that the selection of AMM approaches significantly influences the balance between efficiency and accuracy, with factors like memory access playing a pivotal role. Additionally, dataset attributes are shown to be vital for the success of AMM in applications. Our results advocate for tailored algorithmic approaches and careful strategy selection to enhance AMM's effectiveness. To aid in the practical application and ongoing research of AMM, we introduce LibAMM —a toolkit offering a wide range of AMM algorithms, benchmarks, and tools for experiment management. LibAMM aims to facilitate research and application in AMM, guiding future developments towards more adaptive and context-aware computational solutions.

关键词：

来源：评论

学校读者我要写书评

暂无评论

Implicit Supervision-Assisted Graph Collaborative Filtering for Third-Party Library Recommendation

引用

IEEE Transactions on services computing 2025年第3期18卷 1459-1471页

作者： Lianrong Chen Mingdong Tang Naidan Mei Fenfang Xie Guo Zhong Qiang He School of Information Science and Technology Guangdong University of Foreign Studies Guangzhou China Unicom (Guangdong) Industry Internet Company Guangzhou China National Engineering Research Center for Big Data Technology and System Services Computing Technology and System Lab Cluster and Grid Computing Lab School of Computer Science and Technology Huazhong University of Science and Technology Wuhan China Department of Computing Technologies Swinburne University of Technology Melbourne VIC Australia

Third-party libraries (TPLs) play a crucial role in software development. Utilizing TPL recommender systems can aid software developers in promptly finding useful TPLs. A number of TPL recommendation approaches have been proposed and among them graph neural network (GNN)-based recommendation is attracting the most attention. However, GNN-based approaches generate node representations through multiple convolutional aggregations, which is prone to introducing noise, resulting in the over-smoothing issue. In addition, due to the high sparsity of labelled data, node representations may be biased in real-world scenarios. To address these issues, this paper presents a TPL recommendation method named Implicit Supervision-assisted Graph Collaborative Filtering (ISGCF). Specifically, it takes the App-TPL interaction relationships as input and employs a popularity-debiased method to generate denoised App and TPL graphs. This reduces the noise introduced during graph convolution and alleviates the over-smoothing issue. It also employs a novel implicitly-supervised loss function to exploit the labelled data to learn enhanced node representations. Extensive experiments on a large-scale real-world dataset demonstrate that ISGCF achieves a significant performance advantage over other state-of-the-art TPL recommendation methods in Recall, NDCG and MAP. The experiments also validate the superiority of ISGCF in mitigating the over-smoothing problem.

关键词： Vectors Noise Training Collaborative filtering Recommender systems Libraries Predictive models Graph neural networks Electronic mail Convolution

来源：评论

学校读者我要写书评

暂无评论

ACGraph: Accelerating Streaming Graph Processing via Dependence Hierarchy 23

ACGraph: Accelerating Streaming Graph Processing via Depende...

引用

Proceedings of the 60th Annual ACM/IEEE Design Automation Conference

作者： Zihan Jiang Fubing Mao Yapu Guo Xu Liu Haikun Liu Xiaofei Liao Hai Jin Wei Zhang National Engineering Research Center for Big Data Technology and System Services Computing Technology and System Lab Cluster and Grid Computing Lab School of Computer Science and Technology Huazhong University of Science and Technology Wuhan China Department of Electronic and Computer Engineering HKUST Hong Kong

ISBN: (纸本)9798350323481

Streaming graph processing needs to timely evaluate continuous queries. Prior systems suffer from massive redundant computations due to the irregular order of processing vertices influenced by updates. To address this issue, we propose ACGraph, a novel streaming graph processing approach for monotonic graph algorithms. It maintains dependence trees during runtime, and makes affected vertices processed in a top-to-bottom order in the hierarchy of the dependence trees, thus normalizing the state propagation order and coalescing of multiple propagation to the same vertices. Experimental results show that ACGraph reduces the number of updates by 50% on average, and achieves the speedup of 1.75~7.43× over state-of-the-art systems.

关键词： graph processing

来源：评论

学校读者我要写书评

暂无评论

It Takes Two to Tango: Serverless Workflow Serving via Bilaterally Engaged Resource Adaptation

arXiv

引用

arXiv 2025年

作者： Wu, Jing Wang, Lin Deng, Quanfeng Yu, Chen Zhang, Dong Yan, Bingheng Liu, Fangming National Engineering Research Center for Big Data Technology and System Services Computing Technology and System Lab Cluster and Grid Computing Lab Huazhong University of Science and Technology China Paderborn University Germany Inspur Data Co. Ltd. China Peng Cheng Laboratory China

Serverless platforms typically adopt an early-binding approach for function sizing, requiring developers to specify an immutable size for each function within a workflow beforehand. Accounting for potential runtime variability, developers must size functions for worst-case scenarios to ensure service-level objectives (SLOs), resulting in significant resource inefficiency. To address this issue, we propose Janus, a novel resource adaptation framework for serverless platforms. Janus employs a late-binding approach, allowing function sizes to be dynamically adapted based on runtime conditions. The main challenge lies in the information barrier between the developer and the provider: developers lack access to runtime information, while providers lack domain knowledge about the workflow. To bridge this gap, Janus allows developers to provide hints containing rules and options for resource adaptation. Providers then follow these hints to dynamically adjust resource allocation at runtime based on real-time function execution information, ensuring compliance with SLOs. We implement Janus and conduct extensive experiments with real-world serverless workflows. Our results demonstrate that Janus enhances resource efficiency by up to 34.7% compared to the state-of-the-art. © 2025, CC BY-NC-ND.

关键词： Resource allocation

来源：评论

学校读者我要写书评

暂无评论

Efficient distributed algorithms for holistic aggregation functions on random regular graphs

引用

Science China(Information Sciences) 2022年第5期65卷 32-50页

作者： Lin JIA Qiang-Sheng HUA Haoqiang FAN Qiuping WANG Hai JIN National Engineering Research Center for Big Data Technology and System/Services Computing Technology and System Lab/Cluster and Grid Computing Lab School of Computer Science and TechnologyHuazhong University of Science and Technology Institute for Interdisciplinary Information Science Tsinghua University

In this paper, we propose efficient distributed algorithms for three holistic aggregation functions on random regular graphs that are good candidates for network topology in next-generation data *** three holistic aggregation functions include SELECTION（select the k-th largest or smallest element）,DISTINCT（query the count of distinct elements）, MODE（query the most frequent element）. We design three basic techniques — Pre-order Network Partition, Pairwise-independent Random Walk, and Random Permutation Delivery, and devise the algorithms based on the techniques. The round complexity of the distributed SELECTION is Θ（log N） which meets the lower bound where N is the number of nodes and each node holds a numeric element. The round complexity of the distributed DISTINCT and MODE algorithms are O（log3N/log log N） and O（log2N log log N） respectively. All of our results break the lower bounds obtained on general graphs and our distributed algorithms are all based on the CON GE S T model, which restricts each node to send only O（log N） bits on each edge in one round under synchronous communications.

关键词： distributed algorithms holistic aggregation function random regular graph \({\cal C}{\cal O}{\cal N}{\cal G}{\cal E}{\cal S}{\cal T}\) model communication complexity round complexity

来源：评论

学校读者我要写书评

暂无评论

CBuild: Cluster-oriented Collaborative Image Building for Containers

引用

IEEE Transactions on Computers 2025年

作者： Huang, Zhuo Fan, Hao Tang, Bin Wu, Song Yu, Chen Jin, Hai Huazhong University of Science and Technology National Engineering Research Center for Big Data Technology and System Services Computing Technology and System Lab Cluster and Grid Computing Lab School of Computer Science and Technology Wuhan 430074 China

Starting a container needs to build a container image layer-by-layer if the required image is not available. However, the image building involves downloading a large amount of data, which significantly delays the development and deployment of containerized services. To reduce data downloads and accelerate image building, current methods typically focus on improving data sharing through reconstructing images. Unfortunately, these approaches show limited performance improvement in clusters as they only improve data sharing on a single node. In this paper, we find that there are significant duplicated remote file downloads between nodes in a cluster. Accordingly, we propose cBuild, a distributed file cache to minimize costly image data downloads in cluster environments. Specifically, to enable inter-node image data sharing, cBuild designs a non-intrusive interception mechanism based on network namespace, instead of directly detecting building instructions that dirty images. Based on the distribution characteristics of duplicated files in layers, cBuild places image files among nodes in a balanced manner to prevent transfer bottlenecks caused by hotspot nodes and employs a layer-aware searching strategy to quickly locate the desired files. We implement cBuild on the basis of Docker. Experiments show that cBuild improves building speed by up to 15.3 × and reduces the data downloading by 80%. © 1968-2012 IEEE.

关键词： Container Distributed Dockerfile Image building

来源：评论

学校读者我要写书评

暂无评论

How to Select Pre-Trained Code Models for Reuse? A Learning Perspective

How to Select Pre-Trained Code Models for Reuse? A Learning ...

引用

IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER)

作者： Zhangqian Bi Yao Wan Zhaoyang Chu Yufei Hu Junyi Zhang Hongyu Zhang Guandong Xu Hai Jin National Engineering Research Center for Big Data Technology and System Services Computing Technology and System Lab Cluster and Grid Computing Lab Wuhan China School of Computer Science and Technology Huazhong University of Science and Technology Wuhan China School of Big Data and Software Engineering Chongqing University Chongqing China School of Computer Science University of Technology Sydney Sydney Australia

ISBN: (数字)9798331535100

ISBN: (纸本)9798331535117

Pre-training a language model and then fine-tuning it has shown to be an efficient and effective technique for a wide range of code intelligence tasks, such as code generation, code summarization, and vulnerability detection. However, pre-training language models on a large-scale code corpus is compu-tationally expensive. Fortunately, many off-the-shelf Pre-trained Code Models (PCMs), such as CodeBERT, CodeT5, CodeGen, and Code Llama, have been released publicly. These models acquire general code understanding and generation capability during pre-training, which enhances their performance on downstream code intelligence tasks. With an increasing number of these public pre-trained models, selecting the most suitable one to reuse for a specific task is essential. In this paper, we systematically investigate the reusability of PCMs. We first explore three intuitive model selection methods that select by size, training data, or brute-force fine-tuning. Experimental results show that these straightforward techniques either perform poorly or suffer high costs. Motivated by these findings, we explore learning-based model selection strategies that utilize pre-trained models without altering their parameters. Specifically, we train proxy models to gauge the performance of pre-trained models, and measure the distribution deviation between a model's latent features and the task's labels, using their closeness as an indicator of model transferability. We conduct experiments on 100 widely-used open-source PCMs for code intelligence tasks, with sizes ranging from 42.5 million to 3 billion parameters. The results demonstrate that learning-based selection methods reduce selection time to 100 seconds, compared to 2,700 hours with brute-force fine-tuning, with less than 6% performance degradation across related tasks.

关键词： Phase change materials Analytical models Adaptation models Codes Costs Computational modeling Training data Machine learning Software Software development management

来源：评论

学校读者我要写书评

暂无评论

How to Select Pre-Trained Code Models for Reuse? A Learning Perspective

arXiv

引用

arXiv 2025年

作者： Bi, Zhangqian Wan, Yao Chu, Zhaoyang Hu, Yufei Zhang, Junyi Zhang, Hongyu Xu, Guandong Jin, Hai National Engineering Research Center for Big Data Technology and System Services Computing Technology and System Lab Cluster and Grid Computing Lab Wuhan China School of Computer Science and Technology Huazhong University of Science and Technology Wuhan China School of Big Data and Software Engineering Chongqing University Chongqing China School of Computer Science University of Technology Sydney Sydney Australia

Pre-training a language model and then fine-tuning it has shown to be an efficient and effective technique for a wide range of code intelligence tasks, such as code generation, code summarization, and vulnerability detection. However, pre-training language models on a large-scale code corpus is computationally expensive. Fortunately, many off-the-shelf Pre-trained Code Models (PCMs), such as CodeBERT, CodeT5, CodeGen, and Code Llama, have been released publicly. These models acquire general code understanding and generation capability during pre-training, which enhances their performance on downstream code intelligence tasks. With an increasing number of these public pre-trained models, selecting the most suitable one to reuse for a specific task is essential. In this paper, we systematically investigate the reusability of PCMs. We first explore three intuitive model selection methods that select by size, training data, or brute-force fine-tuning. Experimental results show that these straightforward techniques either perform poorly or suffer high costs. Motivated by these findings, we explore learning-based model selection strategies that utilize pre-trained models without altering their parameters. Specifically, we train proxy models to gauge the performance of pre-trained models, and measure the distribution deviation between a model's latent features and the task's labels, using their closeness as an indicator of model transferability. We conduct experiments on 100 widely-used open-source PCMs for code intelligence tasks, with sizes ranging from 42.5 million to 3 billion parameters. The results demonstrate that learning-based selection methods reduce selection time to 100 seconds, compared to 2,700 hours with brute-force fine-tuning, with less than 6% performance degradation across related tasks. © 2025, CC BY-NC-SA.

关键词： Adversarial machine learning

来源：评论

学校读者我要写书评

暂无评论

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案：

请选择收藏分类：

通借通还

建议与咨询 留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案： 新增检索档案 确定 取消

请选择收藏分类： 新增自定义分类 确定 取消

通借通还

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

请选择保存的检索档案：

请选择收藏分类：