This paper envisions an extension to a standard instruction set which efficiently implements PRAM-style algorithms using explicit multi-threaded instruction-level parallelism (ILP);that is, Explicit Multi-Threading (X...
详细信息
ISBN:
(纸本)9780897919890
This paper envisions an extension to a standard instruction set which efficiently implements PRAM-style algorithms using explicit multi-threaded instruction-level parallelism (ILP);that is, Explicit Multi-Threading (XMT), a fine-grained computational paradigm covering the spectrum from algorithms through architecture to implementation is introduced;new elements are added where needed.
This paper analyzes job scheduling for parallel computers by using theoretical and experimental means. Based on existing architectures we first present a machine and a job model. Then, we propose a simple on-line algo...
详细信息
This paper analyzes job scheduling for parallel computers by using theoretical and experimental means. Based on existing architectures we first present a machine and a job model. Then, we propose a simple on-line algorithm employing job preemption without migration and derive theoretical bounds for the performance of the algorithm. The algorithm is experimentally evaluated with trace data from a large computing facility. These experiments show that the algorithm is highly sensitive on parameter selection and that substantial performance improvements over existing (non-preemptive) scheduling methods are possible.
The proceedings contain 68 papers. The topics discussed include: faster deterministic all pairs shortest paths in congest model;contention resolution with message deadlines;memory tagging: minimalist synchronization f...
ISBN:
(纸本)9781450369350
The proceedings contain 68 papers. The topics discussed include: faster deterministic all pairs shortest paths in congest model;contention resolution with message deadlines;memory tagging: minimalist synchronization for scalable concurrent data structures;work-efficient batch-incremental minimum spanning trees with applications to the sliding-window model;closing the gap between cache-oblivious and cache-adaptive analysis;optimal resource allocation for elastic and inelastic jobs;optimal parallelalgorithms in the binary-forking model;randomized incremental convex hull is highly parallel;almost universal anonymous rendezvous in the plane;the online multi-commodity facility location problem;unconditional lower bounds for adaptive massively parallel computation;on the hardness of massively parallel computation;and graph sparsification for derandomizing massively parallel computation with low space.
This paper discusses a parallel implementation of the e-relaxation algorithm for the min-cost flow problem on the CM-5. There is considerable loss in efficiency in going from one processor sequential implementation to...
详细信息
It is well known that simple randomized load balancing schemes can balance load effectively while incurring only a small overhead, making such schemes appealing for practical systems. In this paper, we provide new ana...
详细信息
ISBN:
(纸本)9780897918909
It is well known that simple randomized load balancing schemes can balance load effectively while incurring only a small overhead, making such schemes appealing for practical systems. In this paper, we provide new analyses for several such dynamic randomized load balancing schemes. Unlike previous analyses, we do not assume that in equilibrium each server is stochastically independent from other servers. Our work extends a previous analysis of the supermarket model, a model that abstracts a simple, efficient load balancing scheme in the setting where jobs arrive at a large system of parallel processors. In this model, customers arrive at a system of n servers as a Poisson stream of rate λn, λ
parallel Reinforcement Learning (RL) frameworks are essential for mapping RL workloads to multiple computational resources, allowing for faster generation of samples, estimation of values, and policy improvement. Thes...
详细信息
ISBN:
(纸本)9798400704161
parallel Reinforcement Learning (RL) frameworks are essential for mapping RL workloads to multiple computational resources, allowing for faster generation of samples, estimation of values, and policy improvement. These computational paradigms require a seamless integration of training, serving, and simulation workloads. Existing frameworks, such as Ray, are not managing this orchestration efficiently, especially in RL tasks that demand intensive input/output and synchronization between actors on a single node. In this study, we have proposed a solution implementing the reactor model, which enforces a set of actors to have a fixed communication pattern. This allows the scheduler to eliminate work needed for synchronization, such as acquiring and releasing locks for each actor or sending and processing coordination-related messages. Our framework, Lingua Franca (LF), a coordination language based on the reactor model, also supports true parallelism in Python and provides a unified interface that allows users to automatically generate dataflow graphs for RL tasks. In comparison to Ray on a single-node multi-core compute platform, LF achieves 1.21x and 11.62x higher simulation throughput in OpenAI Gym and Atari environments, reduces the average training time of synchronized parallel Q-learning by 31.2%, and accelerates multi-agent RL inference by 5.12x.
This paper concerns the problem of how to exploit parallelism during the phases of compilation involving syntax-directed analysis and translation. In particular, we address the problem of how to exploit parallelism du...
详细信息
Let (G) denote the independence number of a graph G, that is the maximum number of pairwise independent vertices in G. We present a parallel algorithm that computes in a planar graph G = (V, E), an independent set I ⊂...
详细信息
In a breakthrough result, Spielman and Teng (2004) developed a nearly-linear time solver for Laplacian linear equations, i.e. equations where the coefficient matrix is symmetric with non-negative diagonals and zero ro...
详细信息
ISBN:
(纸本)9798400704161
In a breakthrough result, Spielman and Teng (2004) developed a nearly-linear time solver for Laplacian linear equations, i.e. equations where the coefficient matrix is symmetric with non-negative diagonals and zero rowsums. Since the development of the Spielman-Teng solver, there has been substantial progress, simplifying and improving their result, but obtaining a fast practical, parallel Laplacian solver remains an open problem. We present a framework for obtaining extremely simple, parallel Laplacian linear equation solvers with nearly-linear work and sub-linear depth. Our framework allows us to parallelize any Laplacian solver based on repeated single-vertex approximate Gaussian elimination. We demonstrate this by parallelizing both the algorithm of Kyng and Sachdeva (2016) and the practical variant by Gao, Kyng, and Spielman (2023). Our framework is work-efficient in the sense of matching the sequential work of these algorithms. Our parallelization framework is very simple: We sample a subset of the current low-degree vertices (sparse columns), and in parallel we eliminate all vertices that are isolated in the resulting induced subgraph. This approach can be combined with any parallelizable approximate single-vertex elimination subroutine with sparse output. Given the simplicity of the approach, we believe that using it to parallelize the solver of Gao, Kyng, and Spielman (2023) is the most promising direction for obtaining practical parallel Laplacian solvers. If we additionally use a parallel spectral sparsification routine, our approach can be modified to work in polylogarithmic depth and nearly-linear work.
Cache-oblivious algorithms have the advantage of achieving good sequential cache complexity across all levels of a multi-level cache hierarchy, regardless of the specifies (cache size and cache line size) of each leve...
详细信息
ISBN:
(纸本)9781605586069
Cache-oblivious algorithms have the advantage of achieving good sequential cache complexity across all levels of a multi-level cache hierarchy, regardless of the specifies (cache size and cache line size) of each level. In this paper, we describe cache-oblivious sorting algorithms with optimal work, optimal cache complexity and polylogarithmic depth. Using known mappings, these lead to low cache complexities on shared-memory multiprocessors with a single level of private caches or a single shared cache. Moreover, the low cache complexities extend to shared-memory multiprocessors with common configurations of multi-level caches. The key factor in the low cache complexity on multiprocessors is the low depth of the algorithms we propose.
暂无评论