In this paper, we propose a new scheduling method that can simultaneously achieve two main goals of task scheduling in distributedparallel systems;i.e., to minimize the execution time of a parallel job without distur...
详细信息
ISBN:
(纸本)0769511538
In this paper, we propose a new scheduling method that can simultaneously achieve two main goals of task scheduling in distributedparallel systems;i.e., to minimize the execution time of a parallel job without disturbing the execution of other jobs. We challenge to achieve those goals by introducing a new scheduler, called active scheduler, that controls the priority of parallel programs dynamically and balances the workload of computers, depending on the current status of runtime environment. Priority of parallel programs is controlled by controlling the concurrency of the programs, We implemented a prototype system to evaluate the effectiveness of active scheduler. the results of experiments imply that the overhead of introducing active scheduler is bounded by 15 % of the original execution time, and it is in fact effective to adjust the execution of parallel programs to an actual distributedparallelprocessing environment in which many users execute their jobs at the same time.
Connected component identification is a fundamental problem in graph analytics, serving as a basis for subsequent computations in a wide range of applications. To determine connectivity, several parallel algorithms, w...
详细信息
ISBN:
(纸本)9781538643686
Connected component identification is a fundamental problem in graph analytics, serving as a basis for subsequent computations in a wide range of applications. To determine connectivity, several parallel algorithms, whose complexity is proportional to the number of edges or graph diameter, have been proposed. However, an optimal algorithm may extract graph components by working proportionally to the number of vertices, which can be orders of magnitude lower than the number of edges. We propose Afforest: an extension of the Shiloach-Vishkin connected components algorithm that approaches optimal work efficiency by processing subgraphs in each iteration. We prove the convergence of the algorithm, analyze its work efficiency characteristics, and provide further techniques to speed up processing graphs containing a huge component. Designed with modern parallel architectures in mind, we show that the algorithm exhibits higher memory locality than existing methods. Using both synthetic and real-world graphs, we demonstrate that Afforest achieves speedups of up to 67x over the state-of-the-art on multi-core CPUs (Broadwell, POWER8) and up to 23x on GPUs (Pascal).
We propose a new coding scheme for parallel binary erasure channels that may have time-varying characteristics. the proposed scheme consists of several spatially coupled codes, each assigned to a different channel, th...
详细信息
this paper presents an algorithmic approach for improving the performance of many types of stochastic dynamical simulations. the approach is to redesign existing algorithms that use sparse matrix-vector products (SPMV...
详细信息
ISBN:
(纸本)9780769546759
this paper presents an algorithmic approach for improving the performance of many types of stochastic dynamical simulations. the approach is to redesign existing algorithms that use sparse matrix-vector products (SPMV) with single vectors to instead use a more efficient kernel, the generalized SPMV (GSPMV), which computes with multiple vectors simultaneously. In this paper, we show how to redesign a dynamical simulation to exploit GSPMV in way that is not initially obvious because only one vector is available at a time. We study the performance of GSPMV as a function of the number of vectors, and demonstrate the use of GSPMV in the Stokesian dynamics method for the simulation of the motion of macromolecules in the cell. Specifically, for our application, we find that with modern multicore Intel microprocessors in clusters of up to 64 nodes, we can typically multiply by 8 to 16 vectors in only twice the time required to multiply by a single vector. After redesigning the Stokesian dynamics algorithm to exploit GSPMV, we measure a 30 percent speedup in performance in single-node, data parallel simulations.
the proceedings contain 54 papers. the topics discussed include: an experimental study of diversity with off-the-shelf antivirus engines;simulating fixed virtual nodes for adapting wireline protocols to MANET;seed sch...
ISBN:
(纸本)9780769536989
the proceedings contain 54 papers. the topics discussed include: an experimental study of diversity with off-the-shelf antivirus engines;simulating fixed virtual nodes for adapting wireline protocols to MANET;seed scheduling for peer-to-peer networks;proximity-aware distributed mutual exclusion for effective peer-to-peer replica management;towards improved overlay simulation using realistic topologies;analysis of round-robin implementations of processor sharing, including overhead;comparison of price-based static and dynamic job allocation schemes for grid computing systems;a rule based co-operative approach for cell selection in high speed cellular networks;sharing private information across distributed databases;energy-aware prefetching for parallel disk systems;attribute-based prevention of phishing attacks;TTM based security enhancement for inter-domain routing protocol;and introducing probability in RFID reader-to-reader anti-collision.
Reconfiguring a faulty hypercube into a maximal incomplete cube tends to lower potential performance degradation, because a hypercube so reconfigured often results in a much larger system that what is attained by any ...
详细信息
ISBN:
(纸本)0818656026
Reconfiguring a faulty hypercube into a maximal incomplete cube tends to lower potential performance degradation, because a hypercube so reconfigured often results in a much larger system that what is attained by any conventional reconfiguration scheme which identifies only complete subcubes. this paper proposes an efficient strategy for identifying all the maximal incomplete subcubes present in a faulty hypercube. the proposed strategy is distributed in that every healthy node executes the same identification algorithm independently at the same time.
this paper presents and evaluates a parallel Java implementation of the Finite-Difference Time-Domain (FDTD) method, which is a widely used numerical technique in computational electrodynamics. the Java version is par...
详细信息
ISBN:
(纸本)9781424416936
this paper presents and evaluates a parallel Java implementation of the Finite-Difference Time-Domain (FDTD) method, which is a widely used numerical technique in computational electrodynamics. the Java version is parallelized using MPJ Express-a thread-safe messaging library. MPJ Express provides a full implementation of the mpiJava 1.2 API specification. this specification defines a MPI-like binding for the Java language. this paper describes our experiences of implementing the Java version of the FDTD method. Towards the end of this paper, we evaluate and compare the performance of the Java version against its C counterpart on a 32 processing core Linux cluster of eight compute nodes.
this paper presents an improved and updated taxonomy for Time Warp based distributed synchronization protocols. this taxonomy aims to allow the grouping of several optmistic distributed simulation synchronization prot...
详细信息
ISBN:
(纸本)0769522327
this paper presents an improved and updated taxonomy for Time Warp based distributed synchronization protocols. this taxonomy aims to allow the grouping of several optmistic distributed simulation synchronization protocols, withthe objective to facilitate the task to decide which protocol is better for a specific simulation.
this paper provides necessary and sufficient conditions for deadlock-free unicast and multicast routing withthe path-based routing model in interconnection networks which use the wormhole switching technique. the the...
详细信息
this paper provides necessary and sufficient conditions for deadlock-free unicast and multicast routing withthe path-based routing model in interconnection networks which use the wormhole switching technique. the theory is developed around three central concepts: channel waiting, False Resource Cycles, and valid destination sets. the first two concepts are suitable extensions to those developed for unicast routing by two authors of this paper;the third concept has been developed by Lin and Ni. the necessary and sufficient conditions can be applied in a straightforward manner to prove deadlock freedom and to find more adaptive routing algorithms for collective communication. the latter point is illustrated by developing two routing algorithms for multicast communication in 2-D mesh architectures. the first algorithm uses fewer resources (channels) than an algorithm proposed in the literature but achieves the same adaptivity. the second achieves full adaptivity for multicast routing.
We present a parallel treecode for fast kernel summation in high dimensions-a common problem in data analysis and computational statistics. Fast kernel summations can be viewed as approximation schemes for dense kerne...
详细信息
ISBN:
(纸本)9781479986484
We present a parallel treecode for fast kernel summation in high dimensions-a common problem in data analysis and computational statistics. Fast kernel summations can be viewed as approximation schemes for dense kernel matrices. Treecode algorithms (or simply treecodes) construct low-rank approximations of certain off-diagonal blocks of the kernel matrix. these blocks are identified withthe help of spatial data structures, typically trees. there is extensive work on treecodes and their parallelization for kernel summations in three dimensions, but there is little work on high-dimensional problems. Recently, we introduced a novel treecode, ASKIT, which resolves most of the shortcomings of existing methods. We introduce novel parallel algorithms for ASKIT, derive complexity estimates, and demonstrate scalability on synthetic, scientific, and image datasets. In particular, we introduce a local essential tree construction that extends to arbitrary dimensions in a scalable manner. We introduce data transformations for memory locality and use GPU acceleration. We report results on the "Maverick" and "Stampede" systems at the Texas Advanced Computing Center. Our largest computations involve two billion points in 64 dimensions on 32,768 x86 cores and 8 million points in 784 dimensions on 16,384 x86 cores.
暂无评论