distributed shared memory (DSM) systems can handle data-intensive applications and recently receiving more attention. A majority of existing DSM implementations are based on write-invalidation (WI) protocols, which ac...
详细信息
ISBN:
(纸本)9781665481069
distributed shared memory (DSM) systems can handle data-intensive applications and recently receiving more attention. A majority of existing DSM implementations are based on write-invalidation (WI) protocols, which achieve sub-optimal performance when the cache size is small. Specifically, the vast majority of invalidation messages become useless when evictions are frequent. the problem is troublesome regarding scarce memory resources in data centers. To this end, we propose a self-invalidation protocol Falcon to eliminate invalidation messages. It relies on per-operation timestamps to achieve the global memory order required by sequential consistency (SC). Furthermore, we conduct a comprehensive discussion on the two protocols with an emphasis on the cache size impact. We also implement both protocols atop a recent DSM system, Grappa. the evaluation shows that the optimal protocol can improve the performance of a KV database by 27% and a graph processing application by 71.4% against the vanilla cache-free scheme.
the Euler tour technique is a classical tool for designing parallel graph algorithms, originally proposed for the PRAM model. We ask whether it can be adapted to run efficiently on GPU. We focus on two established app...
详细信息
ISBN:
(纸本)9781665440660
the Euler tour technique is a classical tool for designing parallel graph algorithms, originally proposed for the PRAM model. We ask whether it can be adapted to run efficiently on GPU. We focus on two established applications of the technique: (1) the problem of finding lowest common ancestors (LCA) of pairs of nodes in trees, and (2) the problem of finding bridgis in undirected graphs. In our experiments, we compare theoretically optimal algorithms using the Euler tour technique against simpler heuristics supposed to perform particularly well on typical instances. We show that the Euler tour-based algorithms not only fulfill their theoretical promises and outperform practical heuristics on hard instances, but also perform on par withthem on easy instances.
Withparallel and distributed computing (PDC) now wide-spread, modern computing programs must incorporate PDC within the curriculum. ACM and ieee Computer Society's Computer Science curricular guidelines have reco...
详细信息
ISBN:
(纸本)9781665435772
Withparallel and distributed computing (PDC) now wide-spread, modern computing programs must incorporate PDC within the curriculum. ACM and ieee Computer Society's Computer Science curricular guidelines have recommended exposure to PDC concepts since 2013. More recently, a variety of initiatives have made PDC curricular content, lectures, and labs freely available for undergraduate computer science programs. Despite these efforts, progress in ensuring computer science students graduate with sufficient PDC exposure has been uneven. this paper discusses the impact of ABET's revised criteria that have required exposure to PDC to achieve accreditation for computer science programs since 2018. the authors reviewed 20 top ABET-accredited computer science programs and analyzed how they covered the required PDC components in their curricula. Using their own institutions as case studies, the authors examine in detail how three different ABET-accredited computer science programs covered PDC using different approaches, yet meeting the PDC requirements of these ABET criteria. the paper also shows how ACM/ieee Computer Society curricular guidelines for computer engineering and software engineering programs, along with ABET accreditation criteria, can cover PDC.
distributed deep learning framework tools should aim at high efficiency of training and inference of distributed exascale deep learning algorithms. there are three major challenges in this endeavor: scalability, adapt...
详细信息
ISBN:
(数字)9798350371284
ISBN:
(纸本)9798350371291
distributed deep learning framework tools should aim at high efficiency of training and inference of distributed exascale deep learning algorithms. there are three major challenges in this endeavor: scalability, adaptivity and efficiency. Any future framework will need to be adaptively utilized for a variety of heterogeneous hardware and network environments and will thus be required to be capable of scaling from single compute node up to large clusters. Further, it should be efficiently integrated into popular frameworks such as TensorFlow, PyTorch, etc. this paper proposes a dynamically hybrid (hierarchy) distribution structure for distributed deep learning, taking advantage of flexible synchronization on both centralized and decentralized architectures, implementing multi-level fine-grain parallelism on distributed platforms. It is scalable as the number of compute nodes increases, and can also adapt to various compute abilities, memory structures and communication costs.
Cache partitioning in tile-based CMP architectures is a challenging problem because of i) the need to determine capacity allocations with low computational overhead and ii) the need to place allocations close to where...
详细信息
ISBN:
(纸本)9781728168760
Cache partitioning in tile-based CMP architectures is a challenging problem because of i) the need to determine capacity allocations with low computational overhead and ii) the need to place allocations close to where they are used, in order to reduce access latency. Although, previous solutions have addressed the problem of reducing the computational overhead and incorporating locality-awareness, they suffer from the overheads of centrally determining allocations. In this paper, we propose DELTA, a novel distributed and locality-aware cache partitioning solution which works by exchanging asynchronous challenges among cores. the distributed nature of the algorithm coupled withthe low computational complexity allows for frequent reconfigurations at negligible cost and for the scheme to be implemented directly in hardware. the allocation algorithm is supported by an enforcement mechanism which enables locality-aware placement of data. We evaluate DELTA on 16- and 64-core tiled CMPs with multi-programmed workloads. Our evaluation shows that DELTA improves performance by 9% and 16%, respectively, on average, compared to an unpartitioned shared last-level cache.
Many graph processing systems have been recently developed for many-core processors. However, for iterative graph processing, due to the dependencies between vertices' states, the propagations of new states of ver...
详细信息
ISBN:
(纸本)9781665422352
Many graph processing systems have been recently developed for many-core processors. However, for iterative graph processing, due to the dependencies between vertices' states, the propagations of new states of vertices are inherently conducted along graph paths sequentially and are also dependent on each other. Despite the years' research effort, existing solutions still severely underutilize many-core processors to quickly propagate the new states of vertices, suffering from slow convergence speed. In this paper, we propose a dependency-driven programmable accelerator, DepGraph, which couples withthe core architecture of the many-core processor and can fundamentally alleviate the challenge of dependencies for faster state propagation. Specifically, we propose an effective dependency-driven asynchronous execution approach into novel microarchitecture designs for faster state propagations. DepGraph prefetches the vertices for the core on-the-fly along the dependency chains between their states and the active vertices' new states, aiming to effectively accelerate the propagations of the active vertices' new states and also ensure better data locality. through transforming the dependency chains along the frequently-used paths into direct ones at runtime and maintaining these calculated direct dependencies as a set of fast shortcuts, called hub index, DepGraph further accelerates most state propagations. Also, many propagations do not need to wait for the completion of other propagations, which enables more propagations to be effectively conducted along the paths with higher degree of parallelism. the experimental results show that for iterative graph processing on a simulated 64-core processor, a cutting-edge software graph processing system can achieve 5.0-22.7 times speedup after integrating with our DepGraph while incurring only 0.6% area cost. In comparison withthree state-of-the-art hardware solutions, i.e., HATS, Minnow, and PHI, DepGraph improves the performan
暂无评论