检索结果-内蒙古大学图书馆

An adaptive switching scheme for iterative computing in the cloud

Frontiers of Computer Science 2014年第6期8卷 872-884页

作者： Yu ZHANG Xiaofei LIAO Hai JIN Li LIN Feng LU Service Computing Technology and System Lab Cluster and Grid Computing Lab School of Computer Science and Technology Huazhong University of Science and Technology Wuhan 430074 China

Delta-based accumulative iterative computation （DAIC） model is currently proposed to support iterative algorithms in a synchronous or an asynchronous way. However, both the synchronous DAIC model and the asynchronous DAIC model only satisfy some given conditions, respectively, and perform poorly under other conditions either for high synchronization cost or for many redundant activations. As a result, the whole performance of both DAIC models suffers from the serious network jitter and load jitter caused by multi- tenancy in the cloud. In this paper, we develop a system, namely Hyblter, to guarantee the performance of iterative algorithms under different conditions. Through an adaptive execution model selection scheme, it can efficiently switch between synchronous and asynchronous DAIC model in order to be adapted to different conditions, always getting the best performance in the cloud. Experimental results show that our approach can improve the performance of current solutions up to 39.0%.

关键词： iterative algorithm computational skew communication skew cloud delta-based accumulative iterative computation

来源：评论

学校读者我要写书评

暂无评论

An effective framework for asynchronous incremental graph processing

引用

Frontiers of Computer Science 2019年第3期13卷 539-551页

作者： Xinqiao LV Wei XIAO Yu ZHANG Xiaofei LIAO Hai JIN Qiangsheng HUA Service Computing Technology and System Lab Cluster and Grid Computing Lab School of Computer Science and Technology Huazhong University of Science and Technology Wuhan 430074 China

Although many graph processing systems have been proposed, graphs in the real-world are often dynamic. It is important to keep the results of graph computation up-todate. Incremental computation is demonstrated to be an efficient solution to update calculated results. Recently, many incremental graph processing systems have been proposed to handle dynamic graphs in an asynchronous way and are able to achieve better performance than those processed in a synchronous way. However, these solutions still suffer from sub-optimal convergence speed due to their slow propagation of important vertex state (important to convergence speed) and poor locality. In order to solve these problems, we propose a novel graph processing framework. It introduces a dynamic partition method to gather the important vertices for high locality, and then uses a priority-based scheduling algorithm to assign them with a higher priority for an effective processing order. By such means, it is able to reduce the number of updates and increase the locality, thereby reducing the convergence time. Experimental results show that our method reduces the number of updates by 30%, and reduces the total execution time by 35%, compared with state-of-the-art systems.

关键词： incremental computation graph processing iterative computation asynchronous convergence

来源：评论

学校读者我要写书评

暂无评论

Efficient FPGA-based graph processing with hybrid pull-push computational model

引用

Frontiers of Computer Science 2020年第4期14卷 13-28页

作者： Chengbo YANG Long ZHENG Chuangyi GUI Hai JIN National Engineering Research Center for Big Data Technology and System/Service Computing Technology and System Lab/Cluster and Grid Computing Lab School of Computer Science and TechnologyHuazhong University of Science and TechnologyWuhan430074China

Hybrid pull-push computational model can provide compelling results over either of single one for processing real-world *** and pipeline parallelism of FPGAs make it potential to process different stages of graph ***,considering the limited on-chip resources and streamline pipeline computation,the efficiency of hybrid model on FPGAs often suffers due to well-known random access feature of graph *** this paper,we present a hybrid graph processing system on FPGAs,which can achieve the best of both *** approach on FPGAs is unique and novel as ***,we propose to use edge block(consisting of edges with the same destination vertex set),which allows to sequentially access edges at block granularity for locality while still preserving the *** to the independence of blocks in the sense that all edges in an inactive block are associated with inactive vertices,this also enables to skip invalid blocks for reducing redundant ***,we consider a large number of vertices and their associated edge-blocks to maintain a predictable execution *** also present to switch models in advance with few stalls using their state *** evaluation on a wide variety of graph algorithms for many real-world graphs shows that our approach achieves up to 3.69x speedup over state-of-the-art FPGA-based graph processing systems.

关键词： graph processing efficiency computational model FPGAs

来源：评论

学校读者我要写书评

暂无评论

Reveal training performance mystery between Tensor Flow and PyTorch in the single GPU environment

引用

Science China(Information Sciences) 2022年第1期65卷 147-163页

作者： Hulin DAI Xuan PENG Xuanhua SHI Ligang HE Qian XIONG Hai JIN National Engineering Research Center for Big Data Technology and System Service Computing Technology and System LabSchool of Computer Science and Technology Huazhong University of Science and Technology Department of Computer Science University of Warwick

Deep learning has gained tremendous success in various fields while training deep neural networks(DNNs) is very compute-intensive, which results in numerous deep learning frameworks that aim to offer better usability and higher performance to deep learning practitioners. Tensor Flow and Py Torch are the two most popular frameworks. Tensor Flow is more promising within the industry context, while Py Torch is more appealing in academia. However, these two frameworks differ much owing to the opposite design philosophy:static vs dynamic computation graph. Tensor Flow is regarded as being more performance-friendly as it has more opportunities to perform optimizations with the full view of the computation graph. However, there are also claims that Py Torch is faster than Tensor Flow sometimes, which confuses the end-users on the choice between them. In this paper, we carry out the analytical and experimental analysis to unravel the mystery of comparison in training speed on single-GPU between Tensor Flow and Py Torch. To ensure that our investigation is as comprehensive as possible, we carefully select seven popular neural networks, which cover computer vision, speech recognition, and natural language processing(NLP). The contributions of this work are two-fold. First, we conduct the detailed benchmarking experiments on Tensor Flow and Py Torch and analyze the reasons for their performance difference. This work provides the guidance for the end-users to choose between these two frameworks. Second, we identify some key factors that affect the performance,which can direct the end-users to write their models more efficiently.

关键词： deep learning performance comparison TensorFlow PyTorch

来源：评论

学校读者我要写书评

暂无评论

Dependency-aware maintenance for dynamic grid services

Dependency-aware maintenance for dynamic grid services

引用

36th International Conference on Parallel Processing in Xi'an, ICPP

作者： Jin, Hai Qi, Li Wu, Song Luo, Yaqin Dai, Jie Cluster and Grid Computing Lab. Service Computing Technology and System Huazhong University of Science and Technology

ISBN: (纸本)076952933X

Any mistaken maintenance for the complicated and distributed grid can bring unpredictable disaster. Here we focus on the system availability issues caused by service dependencies during the maintenance in grid. A novel mechanism, called Cobweb Guardian, is proposed in this paper. It provides multiple granularities (service-, container-, and node-level) maintenance for service components in grid. By using the Cobweb Guardian, grid administrators can execute the maintaining task safely in runtime with high availability. The evaluation results show that our proposed dependency-aware maintenance can make the grid management more automatic and available. © 2007 IEEE.

关键词： Maintenance

来源：评论

学校读者我要写书评

暂无评论

GraSU: A fast graph update library for fpga-based dynamic graph processing 21

GraSU: A fast graph update library for fpga-based dynamic gr...

引用

2021 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, FPGA 2021

作者： Wang, Qinggang Zheng, Long Huang, Yu Yao, Pengcheng Gui, Chuangyi Liao, Xiaofei Jin, Hai Jiang, Wenbin Mao, Fubing National Engineering Research Center for Big Data Technology and System Service Computing Technology and System Lab Cluster and Grid Computing Lab Huazhong University of Science and Technology China

ISBN: (纸本)9781450382182

Existing FPGA-based graph accelerators, typically designed for static graphs, rarely handle dynamic graphs that often involve substantial graph updates (e.g., edge/node insertion and deletion) over time. In this paper, we aim to fill this gap. The key innovation of this work is to build an FPGA-based dynamic graph accelerator easily from any off-the-shelf static graph accelerator with minimal hardware engineering efforts (rather than from scratch). We observe\em spatial similarity of dynamic graph updates in the sense that most of graph updates get involved with only a small fraction of vertices. We therefore propose an FPGA library, called GraSU, to exploit spatial similarity for fast graph updates. GraSU uses a differential data management, which retains the high-value data (that will be frequently accessed) in the specialized on-chip UltraRAM while the overwhelming majority of low-value ones reside in the off-chip memory. Thus, GraSU can transform most of off-chip communications arising in dynamic graph updates into fast on-chip memory accesses. Our experiences show that GraSU can be easily integrated into existing state-of-the-art static graph accelerators with only 11 lines of code modifications. Our implementation atop AccuGraph using a Xilinx Alveo#8482;\U250 board outperforms two state-of-the-art CPU-based dynamic graph systems, Stinger and Aspen, by an average of 34.24× and 4.42× in terms of update throughput, improving further overall efficiency by 9.80× and 3.07× on average. © 2021 ACM.

关键词： Field programmable gate arrays (FPGA)

来源：评论

学校读者我要写书评

暂无评论

A novel memory-based storage optimization approach for RDF data supporting SPARQL query 1

引用

4th International Conference on Human-Centric computing, HumanCom'11 and the 6th International Conference on Embedded and Multimedia computing, EMC'11

作者： Zhao, Feng Wu, Delong Yuan, Pingpeng Jin, Hai Service Computing Technology and System Lab. Cluster and Grid Computing Lab. Huazhong University of Science and Technology 430074 Wuhan China

ISBN: (数字)9789400721050

ISBN: (纸本)9789400721043

Due to sparse of RDF data, RDF storage approaches using triple table or binary file rarely show high storage usage and high query performance. To achieve the goal of decreasing storage space and improving the efficiency and generality of query on RDF data, a memory-based storage optimization approach supporting SPARQL query is proposed in this paper. For storage efficiency, strings are transferred to 32-bit integer identifiers, RDFS/OWL is used to organize the RDF data storage model, vertical partition method and other ways are utilized to split and organize the RDF triples. For query generality, the storage model also partially supports SPARQL query. Furthermore, by getting the statistics of the underlying storage data, we construct a SPARQL query optimized module. Experiments show that our query optimization module greatly improves the performance of the SPARQL query. Compared with Jena Memory and RDF-3X, our storage method has higher performance. © 2011 Springer Science+Business Media B.V.

关键词： Digital storage

来源：评论

学校读者我要写书评

暂无评论

SymS: A symmetrical scheduler to improve multi-threaded program performance on NUMA systems

SymS: A symmetrical scheduler to improve multi-threaded prog...

引用

作者： Zhu, Liang Jin, Hai Liao, Xiaofei Service Computing Technology and System Lab Cluster and Grid Computing Lab School of Computer Science and Technology Huazhong University of Science and Technology Wuhan430074 China

Summary The nonuniform memory access (NUMA) architecture has been used extensively in data centers. Most of the previous works used single-threaded multiprogrammed workloads to study the performance of NUMA systems, which mainly focus on two classes of problems: resource contention and data locality. However, when running multi-threaded programs on NUMA systems, the critical thread of these programs significantly influences the system performance and brings new challenges that are different from those in a single-threaded situation. In particular, an additional scheduling scheme is desired to avoid the performance degradation caused by the critical thread of multi-threaded programs running on NUMA systems. This work presents a scheduler, Symmetrical Scheduler, which successfully solves the lagging problem by balancing the number of the costly remote shared data accesses for threads on NUMA systems. To the best of our knowledge, little work has been conducted to examine the performance impacted by the critical thread of multi-threaded programs on NUMA systems. By running the PARSEC benchmark on such systems, our methodology can improve the program performance by a factor of 6% on average and achieve maximally 25.3% improvement compared with Linux kernel scheduling mechanism. © 2015 John Wiley & Sons, Ltd.

关键词： Scheduling

来源：评论

学校读者我要写书评

暂无评论

RTGA: A Redundancy-free Accelerator for High-Performance Temporal Graph Neural Network Inference 24

RTGA: A Redundancy-free Accelerator for High-Performance Tem...

引用

61st ACM/IEEE Design Automation Conference, DAC 2024

作者： Yu, Hui Zhang, Yu Tan, Andong Lu, Chenze Zhao, Jin Liao, Xiaofei Jin, Hai Liu, Haikun National Engineering Research Center for Big Data Technology and System Service Computing Technology and System Lab Cluster and Grid Computing Lab School of Computer Science and Technology Huazhong University of Science and Technology China

ISBN: (纸本)9798400706011

Temporal Graph Neural Network (TGNN) has attracted much research attention because it can capture the dynamic nature of complex networks. However, existing solutions suffer from redundant computation overhead and excessive off-chip communications for TGNN inference because they often rely on redundant graph sampling and repeatedly fetching the features and vertex memory. This paper proposes a redundancy-free accelerator, RTGA, for high-performance TGNN inference. Specifically, RTGA proposes a redundancy-aware execution approach with temporal tree into a novel accelerator design to effectively eliminate unnecessary data processing for fewer redundant computations and off-chip communications and also designs a temporal-aware data caching method to improve data locality for TGNN. We have implemented and evaluated RTGA on a Xilinx Alveo U280 FPGA card. Compared with cutting-edge software solutions (i.e., TGN and TGL) and hardware solutions (i.e., BlockGNN and FlowGNN), RTGA improves the performance of TGNN inference by an average of 473.2x, 87.4x, 8.2x, and 6.9x and saves energy by 542.8x, 102.2x, 9.4x, and 8.3x, respectively. © 2024 Copyright is held by the owner/author(s). Publication rights licensed to ACM.

关键词： Spatio-temporal data

来源：评论

学校读者我要写书评

暂无评论

High-Performance and Resource-Efficient Dynamic Memory Management in High-Level Synthesis 24

High-Performance and Resource-Efficient Dynamic Memory Manag...

引用

61st ACM/IEEE Design Automation Conference, DAC 2024

作者： Wang, Qinggang Zheng, Long An, Zhaozeng Huang, Haoqin Zhu, Haoran Huang, Yu Yao, Pengcheng Liao, Xiaofei Jin, Hai National Engineering Research Center for Big Data Technology and System Service Computing Technology and System Lab Cluster and Grid Computing Lab Huazhong University of Science and Technology Wuhan China Zhejiang Lab Hangzhou China

ISBN: (纸本)9798400706011

With the merits of high productivity and ease of use, highlevel synthesis (HLS) tools bring hope to fast FPGA-based architecture development. However, their usability and popularity are still limited due to lack of support for dynamic memory management (DMM). Though HLS-compatible DMM solutions have been proposed recently, nevertheless, based on our investigation, none of them can hit high performance (i.e., minimal memory (de-)allocation latency) and resource efficiency (i.e., managing arbitrarily sized memory with minimal FPGA resource consumption) with one stone, seriously limiting their practicality. In response, we propose HeroDMM, a high-performance and resource-efficient dynamic memory manager for HLS. Specifically, HeroDMM organizes the managed memory area with a novel cartesian-like tree (CT) structure, a key to resolving the dilemma between (de-)allocation latency and resource efficiency standing in front of prior efforts. With the CT structure, HeroDMM further devises a delicate memory management algorithm and specializes the hardware implementation for achieving ever-higher performance while ensuring resource efficiency. Results show that HeroDMM outperforms state-of-the-art HLS-compatible DMM solutions by 61.69%∼99.99% in performance improvement and 23.79%∼97.22% in resource consumption savings. © 2024 Copyright is held by the owner/author(s). Publication rights licensed to ACM.

关键词： Memory management

来源：评论

学校读者我要写书评

暂无评论

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案：

请选择收藏分类：

通借通还

建议与咨询 留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案： 新增检索档案 确定 取消

请选择收藏分类： 新增自定义分类 确定 取消

通借通还

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

请选择保存的检索档案：

请选择收藏分类：