parallel Reinforcement Learning (RL) frameworks are essential for mapping RL workloads to multiple computational resources, allowing for faster generation of samples, estimation of values, and policy improvement. Thes...
详细信息
ISBN:
(纸本)9798400704161
parallel Reinforcement Learning (RL) frameworks are essential for mapping RL workloads to multiple computational resources, allowing for faster generation of samples, estimation of values, and policy improvement. These computational paradigms require a seamless integration of training, serving, and simulation workloads. Existing frameworks, such as Ray, are not managing this orchestration efficiently, especially in RL tasks that demand intensive input/output and synchronization between actors on a single node. In this study, we have proposed a solution implementing the reactor model, which enforces a set of actors to have a fixed communication pattern. This allows the scheduler to eliminate work needed for synchronization, such as acquiring and releasing locks for each actor or sending and processing coordination-related messages. Our framework, Lingua Franca (LF), a coordination language based on the reactor model, also supports true parallelism in Python and provides a unified interface that allows users to automatically generate dataflow graphs for RL tasks. In comparison to Ray on a single-node multi-core compute platform, LF achieves 1.21x and 11.62x higher simulation throughput in OpenAI Gym and Atari environments, reduces the average training time of synchronized parallel Q-learning by 31.2%, and accelerates multi-agent RL inference by 5.12x.
Let (G) denote the independence number of a graph G, that is the maximum number of pairwise independent vertices in G. We present a parallel algorithm that computes in a planar graph G = (V, E), an independent set I ⊂...
详细信息
We show an improved parallel algorithm for decomposing an undirected unweighted graph into small diameter pieces with a small fraction of the edges in between. These decompositions form critical subroutines in a numbe...
详细信息
ISBN:
(纸本)9781450315722
We show an improved parallel algorithm for decomposing an undirected unweighted graph into small diameter pieces with a small fraction of the edges in between. These decompositions form critical subroutines in a number of graph algorithms. Our algorithm builds upon the shifted shortest path approach introduced in [Blelloch, Gupta, Koutis, Miller, Peng, Tangwongsan. SPAA 2011]. By combining various stages of the previous algorithm, we obtain a significantly simpler algorithm with the same asymptotic guarantees as the best sequential algorithm.
This paper discusses a parallel implementation of the e-relaxation algorithm for the min-cost flow problem on the CM-5. There is considerable loss in efficiency in going from one processor sequential implementation to...
详细信息
The scheduling of a set of interdependent tasks on a specific family of parallel/distributed architectures is a basic problem arising in various areas. We consider tasks whose interdependency relationship has the form...
详细信息
ISBN:
(纸本)089791483X
The scheduling of a set of interdependent tasks on a specific family of parallel/distributed architectures is a basic problem arising in various areas. We consider tasks whose interdependency relationship has the form of a tree and architectures that have constant-dimensional grid-like structures. The total size of the task tree is known beforehand but no knowledge is assumed about the shape of the tree, implying that the scheduling must be dynamic. Important problems such as divide and conquer (D&C) satisfy this condition. On the other hand, many practical parallel/distributed architectures, such as linear array and mesh, fall into the class of constant-dimensional architectures. Here we give an efficient deterministic distributed scheme for mapping tasks to processors and for routing results in the architecture. Our scheme finishes the tasks in optimal time. If the maximum degree and the height of the tree can be estimated beforehand, the optimal time processor product can also be achieved. This paper focuses on optimizing execution time on the smallest possible number of processors in a model that considers both task dependency and communication configuration, in addition to computational cost.
This paper concerns the problem of how to exploit parallelism during the phases of compilation involving syntax-directed analysis and translation. In particular, we address the problem of how to exploit parallelism du...
详细信息
Using idle times of the processors is a well-known approach to run coarse grained parallelalgorithms for extremely complex problems. We present on-line algorithms for scheduling the processes of a parallel applicatio...
详细信息
ISBN:
(纸本)9781581138405
Using idle times of the processors is a well-known approach to run coarse grained parallelalgorithms for extremely complex problems. We present on-line algorithms for scheduling the processes of a parallel application that is known off-line on a dynamic network in which the idle times of the processors are dictated by an adversary. We also take communication and synchronization costs into account. Our first contribution consists of a formal model to restrict the adversary in a reasonable way. We then show a constant factor approximation for the off-line scheduling problem. As this problem has to take communication cost into account, it can be seen as a generalization of many NP-hard parallel machine scheduling problems. Finally, we present on-line algorithms for different models with constant or with "nearly constant" competitive ratio.
We present lower bounds for time needed to solve basic problems on three general-purpose models of parallel computation: the shared-memory models QSM and s-QSW, and the distributed-memory model, the BSP. For each of t...
详细信息
ISBN:
(纸本)9780897919890
We present lower bounds for time needed to solve basic problems on three general-purpose models of parallel computation: the shared-memory models QSM and s-QSW, and the distributed-memory model, the BSP. For each of these models, we also obtain lower bounds for the number of rounds needed to solve these problems using a randomized algorithm on a p-processor machine. Our results on 'rounds' is of special interest in the context of designing work-efficient algorithms on a machine where latency and synchronization costs are high. Many of our lower bound results are complemented by upper bounds that match the lower bound or are close to it.
Brent's scheduling principle provides a general simulation scheme when fewer processors are available than specified by the fastest parallel algorithm. Such a scheme preserves the actual number of executed operati...
详细信息
ISBN:
(纸本)089791483X
Brent's scheduling principle provides a general simulation scheme when fewer processors are available than specified by the fastest parallel algorithm. Such a scheme preserves the actual number of executed operations, and when applicable, it provides a processor balancing technique that significantly reduces the work, expressed as the number of executable operators. In this paper we discuss a new technique, called supereffective slow-down, that yields quite fast an algorithm with work significantly smaller than that of the fastest algorithm for the same problem. This technique can be viewed as a work-preserving acceleration of an existing recursive sequential algorithm for the considered problem. The presented examples include the computation of path algebras in graphs and digraphs and various computations in linear albegra. Some of the new algorithms may have practical value;for instance, we substantially improve the performance of the known parallelalgorithms for triangular linear systems of equations.
暂无评论