Delta-based accumulative iterative computation (DAIC) model is currently proposed to support iterative algorithms in a synchronous or an asynchronous way. However, both the synchronous DAIC model and the asynchronou...
详细信息
Delta-based accumulative iterative computation (DAIC) model is currently proposed to support iterative algorithms in a synchronous or an asynchronous way. However, both the synchronous DAIC model and the asynchronous DAIC model only satisfy some given conditions, respectively, and perform poorly under other conditions either for high synchronization cost or for many redundant activations. As a result, the whole performance of both DAIC models suffers from the serious network jitter and load jitter caused by multi- tenancy in the cloud. In this paper, we develop a system, namely Hyblter, to guarantee the performance of iterative algorithms under different conditions. Through an adaptive execution model selection scheme, it can efficiently switch between synchronous and asynchronous DAIC model in order to be adapted to different conditions, always getting the best performance in the cloud. Experimental results show that our approach can improve the performance of current solutions up to 39.0%.
Although many graph processing systems have been proposed, graphs in the real-world are often dynamic. It is important to keep the results of graph computation up-todate. Incremental computation is demonstrated to be ...
详细信息
Although many graph processing systems have been proposed, graphs in the real-world are often dynamic. It is important to keep the results of graph computation up-todate. Incremental computation is demonstrated to be an efficient solution to update calculated results. Recently, many incremental graph processing systems have been proposed to handle dynamic graphs in an asynchronous way and are able to achieve better performance than those processed in a synchronous way. However, these solutions still suffer from sub-optimal convergence speed due to their slow propagation of important vertex state (important to convergence speed) and poor locality. In order to solve these problems, we propose a novel graph processing framework. It introduces a dynamic partition method to gather the important vertices for high locality, and then uses a priority-based scheduling algorithm to assign them with a higher priority for an effective processing order. By such means, it is able to reduce the number of updates and increase the locality, thereby reducing the convergence time. Experimental results show that our method reduces the number of updates by 30%, and reduces the total execution time by 35%, compared with state-of-the-art systems.
Hybrid pull-push computational model can provide compelling results over either of single one for processing real-world *** and pipeline parallelism of FPGAs make it potential to process different stages of graph ***,...
详细信息
Hybrid pull-push computational model can provide compelling results over either of single one for processing real-world *** and pipeline parallelism of FPGAs make it potential to process different stages of graph ***,considering the limited on-chip resources and streamline pipeline computation,the efficiency of hybrid model on FPGAs often suffers due to well-known random access feature of graph *** this paper,we present a hybrid graph processing system on FPGAs,which can achieve the best of both *** approach on FPGAs is unique and novel as ***,we propose to use edge block(consisting of edges with the same destination vertex set),which allows to sequentially access edges at block granularity for locality while still preserving the *** to the independence of blocks in the sense that all edges in an inactive block are associated with inactive vertices,this also enables to skip invalid blocks for reducing redundant ***,we consider a large number of vertices and their associated edge-blocks to maintain a predictable execution *** also present to switch models in advance with few stalls using their state *** evaluation on a wide variety of graph algorithms for many real-world graphs shows that our approach achieves up to 3.69x speedup over state-of-the-art FPGA-based graph processing systems.
Deep learning has gained tremendous success in various fields while training deep neural networks(DNNs) is very compute-intensive, which results in numerous deep learning frameworks that aim to offer better usability ...
详细信息
Deep learning has gained tremendous success in various fields while training deep neural networks(DNNs) is very compute-intensive, which results in numerous deep learning frameworks that aim to offer better usability and higher performance to deep learning practitioners. Tensor Flow and Py Torch are the two most popular frameworks. Tensor Flow is more promising within the industry context, while Py Torch is more appealing in academia. However, these two frameworks differ much owing to the opposite design philosophy:static vs dynamic computation graph. Tensor Flow is regarded as being more performance-friendly as it has more opportunities to perform optimizations with the full view of the computation graph. However, there are also claims that Py Torch is faster than Tensor Flow sometimes, which confuses the end-users on the choice between them. In this paper, we carry out the analytical and experimental analysis to unravel the mystery of comparison in training speed on single-GPU between Tensor Flow and Py Torch. To ensure that our investigation is as comprehensive as possible, we carefully select seven popular neural networks, which cover computer vision, speech recognition, and natural language processing(NLP). The contributions of this work are two-fold. First, we conduct the detailed benchmarking experiments on Tensor Flow and Py Torch and analyze the reasons for their performance difference. This work provides the guidance for the end-users to choose between these two frameworks. Second, we identify some key factors that affect the performance,which can direct the end-users to write their models more efficiently.
Any mistaken maintenance for the complicated and distributed grid can bring unpredictable disaster. Here we focus on the system availability issues caused by service dependencies during the maintenance in grid. A nove...
详细信息
Existing FPGA-based graph accelerators, typically designed for static graphs, rarely handle dynamic graphs that often involve substantial graph updates (e.g., edge/node insertion and deletion) over time. In this paper...
详细信息
Due to sparse of RDF data, RDF storage approaches using triple table or binary file rarely show high storage usage and high query performance. To achieve the goal of decreasing storage space and improving the efficien...
详细信息
Summary The nonuniform memory access (NUMA) architecture has been used extensively in data centers. Most of the previous works used single-threaded multiprogrammed workloads to study the performance of NUMA systems, w...
详细信息
Temporal Graph Neural Network (TGNN) has attracted much research attention because it can capture the dynamic nature of complex networks. However, existing solutions suffer from redundant computation overhead and exce...
详细信息
With the merits of high productivity and ease of use, highlevel synthesis (HLS) tools bring hope to fast FPGA-based architecture development. However, their usability and popularity are still limited due to lack of su...
详细信息
暂无评论