检索结果-内蒙古大学图书馆

Annual International Symposium on Computer Architecture, ISCA

作者： Haifeng Liu Long Zheng Yu Huang Jingyi Zhou Chaoqiang Liu Runze Wang Xiaofei Liaot Hai Jinf Jingling Xue National Engineering Research Center for Big Data Technology and System/Services Computing Technology and System Lab/Cluster and Grid Computing Lab Huazhong University of Science and Technology China School of Computer Science and Engineering University of New South Wales Australia Zhejiang Lab Hangzhou China

ISBN: (数字)9798350326581

ISBN: (纸本)9798350326598

Personalized recommendation systems have become one of the most important Internet services nowadays. A critical challenge of training and deploying the recommendation models is their high memory capacity and bandwidth demands, with the embedding layers occupying hundreds of GBs to TBs of storage. The advent of memory disaggregation technology and Compute Express Link (CXL) provides a promising solution for memory capacity scaling. However, relocating memory-intensive embedding layers to CXL memory incurs noticeable performance degradation due to its limited transmission bandwidth, which is significantly lower than the host memory bandwidth. To address this, we introduce ReCXL, a CXL memory disaggregation system that utilizes near-memory processing for scalable, efficient recommendation model training. ReCXL features a unified, hardwareefficient NMP architecture that processes the entire embedding training within CXL memory, minimizing data transfers over the bandwidth-limited CXL and enhancing internal bandwidth. To further improve the performance, ReCXL incorporates softwarehardware co-optimizations, including sophisticated dependencyfree prefetching and fine-grained update scheduling, to maximize hardware utilization. Evaluation results show that ReCXL outperforms the CPU-GPU baseline and the naïve CXL memory by $7.1 \times \sim 10.6 \times(9.4 \times$ on average) and $12.7 \times \sim 31.3 \times(22.6 \times$ on average), respectively.

关键词： Training Degradation Scalability Prefetching Memory management Web and internet services Bandwidth

来源：评论

学校读者我要写书评

暂无评论

MeG2: In-Memory Acceleration for Genome Graphs Analysis

MeG2: In-Memory Acceleration for Genome Graphs Analysis

引用

Design Automation Conference

作者： Yu Huang Long Zheng Haifeng Liu Zhuoran Zhou Dan Chen Pengcheng Yao Qinggang Wang Xiaofei Liao Hai Jin National Engineering Research Center for Big Data Technology and System/Services Computing Technology and System Lab/Cluster and Grid Computing Laboratory Huazhong University of Science and Technology Wuhan China Zhejiang Lab Hangzhou China

Genome graphs analysis has emerged as an effective means to enable mapping DNA fragments (known as reads) to the reference genome. It replaces the traditional linear reference with a graph-based representation to augment the genetic variations and diversity information, significantly improving the quality of genotyping. The in-depth characterization of genome graphs analysis uncovers that it is bottlenecked by the irregular seed index access and the intensive alignment operation, stressing both the memory system and computing *** on these observations, we propose MeG 2 , a lightweight, commodity DRAM-compliant, processing-in-memory architecture to accelerate genome graphs analysis. MeG 2 is specifically integrated with the capabilities of both near-memory processing and bitwise in-situ computation. Specifically, MeG 2 leverages the low access latency of near-memory processing with the index-centric offload mechanism to alleviate the irregular memory access in the seeding procedure, and harnesses the row-parallel capacity of in-situ computation with the distance-aware technique to exploit the intensive computational parallelism in the alignment process. Results show that MeG 2 outperforms the CPU-, GPU-, and ASIC-based genome graphs analysis solutions by 502× (30.2×), 272× (15.1× ), and 5.5× (8.3×) for short (long) reads, while reducing energy consumption by 1628× (85.6×), 1443× (77.1×), and 7.8× (11.7×), respectively. We also demonstrate that MeG 2 offers significant improvements over existing PIM-based genome sequence analysis accelerators.

关键词：

来源：评论

学校读者我要写书评

暂无评论

It Takes Two to Tango: Serverless Workflow Serving via Bilaterally Engaged Resource Adaptation

arXiv

引用

arXiv 2025年

作者： Wu, Jing Wang, Lin Deng, Quanfeng Yu, Chen Zhang, Dong Yan, Bingheng Liu, Fangming National Engineering Research Center for Big Data Technology and System Services Computing Technology and System Lab Cluster and Grid Computing Lab Huazhong University of Science and Technology China Paderborn University Germany Inspur Data Co. Ltd. China Peng Cheng Laboratory China

Serverless platforms typically adopt an early-binding approach for function sizing, requiring developers to specify an immutable size for each function within a workflow beforehand. Accounting for potential runtime variability, developers must size functions for worst-case scenarios to ensure service-level objectives (SLOs), resulting in significant resource inefficiency. To address this issue, we propose Janus, a novel resource adaptation framework for serverless platforms. Janus employs a late-binding approach, allowing function sizes to be dynamically adapted based on runtime conditions. The main challenge lies in the information barrier between the developer and the provider: developers lack access to runtime information, while providers lack domain knowledge about the workflow. To bridge this gap, Janus allows developers to provide hints containing rules and options for resource adaptation. Providers then follow these hints to dynamically adjust resource allocation at runtime based on real-time function execution information, ensuring compliance with SLOs. We implement Janus and conduct extensive experiments with real-world serverless workflows. Our results demonstrate that Janus enhances resource efficiency by up to 34.7% compared to the state-of-the-art. © 2025, CC BY-NC-ND.

关键词： Resource allocation

来源：评论

学校读者我要写书评

暂无评论

FedMHO: Heterogeneous One-Shot Federated Learning Towards Resource-Constrained Edge Devices

arXiv

引用

arXiv 2025年

作者： Yao, Dezhong Shi, Yuexin Liu, Tongtong Xu, Zhiqiang National Engineering Research Center for Big Data Technology and System Services Computing Technology and System Lab Cluster and Grid Computing Lab School of Computer Science and Technology Huazhong University of Science and Technology Wuhan430074 China Mohamed bin Zayed University of Artificial Intelligence United Arab Emirates

Federated Learning (FL) is increasingly adopted in edge computing scenarios, where a large number of heterogeneous clients operate under constrained or sufficient resources. The iterative training process in conventional FL introduces significant computation and communication overhead, which is unfriendly for resource-constrained edge devices. One-shot FL has emerged as a promising approach to mitigate communication overhead, and model-heterogeneous FL solves the problem of diverse computing resources across clients. However, existing methods face challenges in effectively managing model-heterogeneous one-shot FL, often leading to unsatisfactory global model performance or reliance on auxiliary datasets. To address these challenges, we propose a novel FL framework named FedMHO, which leverages deep classification models on resource-sufficient clients and lightweight generative models on resource-constrained devices. On the server side, FedMHO involves a two-stage process that includes data generation and knowledge fusion. Furthermore, we introduce FedMHO-MD and FedMHO-SD to mitigate the knowledge-forgetting problem during the knowledge fusion stage, and an unsupervised data optimization solution to improve the quality of synthetic samples. Comprehensive experiments demonstrate the effectiveness of our methods, as they outperform state-of-the-art baselines in various experimental setups. Our code is available at https://***/YXShi2000/FedMHO. © 2025, CC BY.

关键词： Federated learning

来源：评论

学校读者我要写书评

暂无评论

Does Your Neural Code Completion Model Use My Code? A Membership Inference Approach

arXiv

引用

arXiv 2024年

作者： Wan, Yao Wan, Guanghua Zhang, Shijie Zhang, Hongyu Zhou, Pan Jin, Hai Sun, Lichao National Engineering Research Center for Big Data Technology and System Services Computing Technology and System Lab Cluster and Grid Computing Lab School of Computer Science and Technology Huazhong University of Science and Technology Wuhan China Huazhong University of Science and Technology China University of Leigh United States

Recent years have witnessed significant progress in developing deep learning-based models for automated code completion. Examples of such models include CodeGPT and StarCoder. These models are typically trained from a large amount of source code collected from open-source communities such as GitHub. Although using source code in GitHub has been a common practice for training deep-learning-based models for code completion, it may induce some legal and ethical issues such as copyright infringement. In this paper, we investigate the legal and ethical issues of current neural code completion models by answering the following question: Is my code used to train your neural code completion model? To this end, we tailor a membership inference approach (termed CodeMI) that was originally crafted for classification tasks to a more challenging task of code completion. In particular, since the target code completion models perform as opaque black boxes, preventing access to their training data and parameters, we opt to train multiple shadow models to mimic their behavior. The acquired posteriors from these shadow models are subsequently employed to train a membership classifier. Subsequently, the membership classifier can be effectively employed to deduce the membership status of a given code sample based on the output of a target code completion model. We comprehensively evaluate the effectiveness of this adapted approach across a diverse array of neural code completion models, (i.e., LSTM-based, CodeGPT, CodeGen, and StarCoder). Experimental results reveal that the LSTM-based and CodeGPT models suffer the membership leakage issue, which can be easily detected by our proposed membership inference approach with an accuracy of 0.842, and 0.730, respectively. Interestingly, our experiments also show that the data membership of current large language models of code, e.g., CodeGen and StarCoder, is difficult to detect, leaving amper space for further improvement. Finally, we also t

关键词： Deep learning

来源：评论

学校读者我要写书评

暂无评论

Fast Parallel Recovery for Transactional Stream Processing on Multicores

Fast Parallel Recovery for Transactional Stream Processing o...

引用

International Conference on Data Engineering

作者： Jianjun Zhao Haikun Liu Shuhao Zhang Zhuohui Duan Xiaofei Liao Hai Jin Yu Zhang National Engineering Research Center for Big Data Technology and System Services Computing Technology and System Lab Cluster and Grid Computing Lab School of Computer Science and Technology Huazhong University of Science and Technology Wuhan China School of Computer Science and Engineering Nanyang Technological University Singapore

ISBN: (数字)9798350317152

ISBN: (纸本)9798350317169

Transactional stream processing engines (TSPEs) have gained increasing attention due to their capability of processing real-time stream applications with transactional semantics. However, TSPEs remain susceptible to system failures and power outages. Existing TSPEs mainly focus on performance improvement, but still face a significant challenge to guarantee fault tolerance while offering high-performance services. We revisit commonly-used fault tolerance approaches in stream processing and database systems, and find that these approaches do not work well on TSPEs due to complex data dependencies. In this paper, we propose a novel TSPE called MorphStreamR to achieve fast failure recovery while guaranteeing low performance overhead at runtime. The key idea of MorphStreamR is to record intermediate results of resolved dependencies at runtime, and thus eliminate data dependencies to improve task parallelism during failure recovery. MorphStreamR further mitigates the runtime overhead by selectively tracking data dependencies and incorporating workload-aware log commitment. Experimental results show that MorphStreamR can significantly reduce the recovery time by up to 3.1 x while experiencing much less performance slowdown at runtime, compared with other applicable fault tolerance approaches.

关键词： Fault tolerance Runtime Fault tolerant systems Semantics Parallel processing Real-time systems Power system reliability

来源：评论

学校读者我要写书评

暂无评论

On Pipelined GCN with Communication-Efficient Sampling and Inclusion-Aware Caching

On Pipelined GCN with Communication-Efficient Sampling and I...

引用

IEEE Annual Joint Conference: INFOCOM, IEEE Computer and Communications Societies

作者： Shulin Wang Qiang Yu Xiong Wang Yuqing Li Hai Jin National Engineering Research Center for Big Data Technology and System Services Computing Technology and System Lab Cluster and Grid Computing Lab School of Computer Science and Technology Huazhong University of Science and Technology Wuhan China School of Cyber Science and Engineering Wuhan University Wuhan China

ISBN: (数字)9798350383508

ISBN: (纸本)9798350383515

Graph convolutional network (GCN) has achieved enormous success in learning structural information from unstructured data. As graphs become increasingly large, distributed training for GCNs is severely prolonged by frequent cross-worker communications. Existing efforts to improve the training efficiency often come at the expense of GCN performance, while the communication overhead persists. In this paper, we propose PSC-GCN, a holistic pipelined framework for distributed GCN training with communication-efficient sampling and inclusion-aware caching, to address the communication bottleneck while ensuring satisfactory model performance. Specifically, we devise an asynchronous pre-fetching scheme to retrieve stale statistics (features, embedding, gradient) of boundary nodes in advance, such that the embedding aggregation and model update are pipelined with statistics transmission. To alleviate communication volume and staleness effect, we introduce a variance-reduction based sampling policy, which prioritizes inner nodes over boundary ones for reducing the access frequency to remote neighbors, thus mitigating cross-worker statistics exchange. Complementing graph sampling, a feature caching module is co-designed to buffer hot nodes with high inclusion probability, ensuring that frequently sampled nodes will be available in local memory. Extensive evaluations on real-world datasets show the superiority of PSC-GCN over state-of-the-art methods, where we can reduce training time by 72%-80% without sacrificing model accuracy.

关键词： Training Estimation error Accuracy Graph convolutional networks Computational modeling Probability Benchmark testing

来源：评论

学校读者我要写书评

暂无评论

SharDAG: Scaling DAG-Based Blockchains Via Adaptive Sharding

SharDAG: Scaling DAG-Based Blockchains Via Adaptive Sharding

引用

International Conference on Data Engineering

作者： Feng Cheng Jiang Xiao Cunyang Liu Shijie Zhang Yifan Zhou Bo Li Baochun Li Hai Jin National Engineering Research Center for Big Data Technology and System Services Computing Technology and System Lab Cluster and Grid Computing Lab School of Computer Science and Technology Huazhong University of Science and Technology China Hong Kong University of Science and Technology Hong Kong University of Toronto Canada

ISBN: (数字)9798350317152

ISBN: (纸本)9798350317169

Directed Acyclic Graph (DAG)-based blockchain (a.k.a distributed ledger) has become prevalent for supporting highly concurrent applications. Its inherent parallel data structure accelerates block generation significantly, shifting the bottleneck from performance to storage scalability. An intuitive solution is to apply state sharding that divides the entire ledger (i.e., transactions and states) into multiple shards. While each node only stores proportional transactions, it suffers from the challenges of storing and ensuring the processing consistency of cross-shard transactions. In this paper, we propose SharDAG, a new mechanism that leverages adaptive sharding for DAG-based blockchains to achieve high performance and strong consistency. The key idea of SharDAG is to exploit unique characteristics - silent assets - and design a lightweight processing mechanism based on avatar account caching. Furthermore, we design a Byzantine resilient cross-shard verification mechanism with a theoretically optimal number of participating nodes, which guarantees the consistency and security of avatar account aggregation. Our comprehensive evaluations on real-world workloads demonstrate that SharDAG presents up to 3.8 x throughput improvement compared to the state-of-the-art and reduces the storage overhead of cross-shard transactions.

关键词： Sharding Directed acyclic graph Distributed ledger Avatars Scalability Receivers Throughput

来源：评论

学校读者我要写书评

暂无评论

StreamFP: Learnable Fingerprint-guided Data Selection for Efficient Stream Learning

arXiv

引用

arXiv 2024年

作者： Shi, Tongjun Zhang, Shuhao Chen, Binbin He, Bingsheng National Engineering Research Center for Big DataTechnology and System Services Computing Technology and System Lab Cluster Grid Computing Lab School of Computer Science and Technology Huazhong University of Science and Technology Wuhan430074 China Singapore University of Technology and Design Singapore National University of Singapore Singapore

Stream Learning (SL) requires models that can quickly adapt to continuously evolving data, posing significant challenges in both computational efficiency and learning accuracy. Effective data selection is critical in SL to ensure a balance between information retention and training efficiency. Traditional rule-based data selection methods struggle to accommodate the dynamic nature of streaming data, highlighting the necessity for innovative solutions that effectively address these challenges. Recent approaches to handling changing data distributions face challenges that limit their effectiveness in fast-paced environments. In response, we propose StreamFP, a novel approach that uniquely employs dynamic, learnable parameters called fingerprints to enhance data selection efficiency and adaptability in stream learning. StreamFP optimizes coreset selection through its unique fingerprint-guided mechanism for efficient training while ensuring robust buffer updates that adaptively respond to data dynamics, setting it apart from existing methods in stream learning. Experimental results demonstrate that StreamFP outperforms state-of-the-art methods by achieving accuracy improvements of 15.99%, 29.65%, and 51.24% compared to baseline models across varying data arrival rates, alongside a training throughput increase of 4.6x. © 2024, CC BY.

关键词： Data accuracy

来源：评论

学校读者我要写书评

暂无评论

Graph for Science: From API based Programming to Graph Engine based Programming for HPC

arXiv

引用

arXiv 2023年

作者： Zhang, Yu Wang, Zixiao Zhao, Jin Guo, Yuluo Yu, Hui Huang, Zhiying Shi, Xuanhua Liao, Xiaofei National Engineering Research Center for Big Data Technology and System Services Computing Technology and System Lab. Cluster and Grid Computing Lab. School of Computer Science and Technology Huazhong University of Science and Technology Wuhan China

Modern scientific applications predominantly run on large-scale computing platforms, necessitating collaboration between scientific domain experts and high-performance computing (HPC) experts. While domain experts are often skilled in customizing domain-specific scientific computing routines, which often involves various matrix computations, HPC experts are essential for achieving efficient execution of these computations on large-scale platforms. This process often involves utilizing complex parallel computing libraries tailored to specific matrix computation scenarios. However, the intricate programming procedure and the need for deep understanding in both application domains and HPC poses significant challenges to the widespread adoption of scientific computing. In this research, we observe that matrix computations can be transformed into equivalent graph representations, and that by utilizing graph processing engines, HPC experts can be freed from the burden of implementing efficient scientific computations. Based on this observation, we introduce a graph engine-based scientific computing (Graph for Science) paradigm, which provides a unified graph programming interface, enabling domain experts to promptly implement various types of matrix computations. The proposed paradigm leverages the underlying graph processing engine to achieve efficient execution, eliminating the needs for HPC expertise in programming large-scale scientific applications. We evaluate the performance of the developed graph compute engine for three typical scientific computing routines. Our results demonstrate that the graph engine-based scientific computing paradigm achieves performance comparable to the best-performing implementations based on existing parallel computing libraries and bespoke implementations. Importantly, the paradigm greatly simplifies the development of scientific computations on large-scale platforms, reducing the programming difficulty for scientists and facilitating b

关键词： Application programming interfaces (API)

来源：评论

学校读者我要写书评

暂无评论

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案：

请选择收藏分类：

通借通还

建议与咨询 留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案： 新增检索档案 确定 取消

请选择收藏分类： 新增自定义分类 确定 取消

通借通还

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

请选择保存的检索档案：

请选择收藏分类：