This paper investigates various factors that affect the ability of a system to learn to play Ms. Pac-Man. For this study Ms. Pac-Man provides a game of appropriate complexity, and has the advantage that there have bee...
详细信息
This paper investigates various factors that affect the ability of a system to learn to play Ms. Pac-Man. For this study Ms. Pac-Man provides a game of appropriate complexity, and has the advantage that there have been many other papers published on systems that learn to play this game. The results indicate that temporal difference learning (TDL) performs most reliably with a tabular function approximator, and that the reward structure chosen can have a dramatic impact on performance. When using a multi-layer perceptron as a function approximator, evolution outperforms TDL by a significant margin. Overall, the best results were obtained by evolving multi-layer perceptrons.
Describes an algorithm for automatically mapping and load balancing unstructured, dynamic data structures on distributed memory machines. The algorithm is intended to be embedded in a compiler for a parallel language ...
详细信息
Describes an algorithm for automatically mapping and load balancing unstructured, dynamic data structures on distributed memory machines. The algorithm is intended to be embedded in a compiler for a parallel language (DYNO) for programming unstructured numerical computations. The result is that the mapping and load balancing are transparent to the programmer. The algorithm iterates over two basic steps: (1) It identifies groups of nodes ('pieces') that disproportionately contribute to the number of off-processor edges of the data structure and moves them to processors to which they are better connected. (2) It balances the loads by identifying groups of nodes ('flows') that can moved to adjacent processors without creating new pieces. The initial results are promising, giving good load balancing and a reasonably low number of inter-processor edges.< >
In previous papers, we have described a reduction model for computing near-perfect state information (NPSI) in support of adaptive synchronization in a parallel discrete event simulation (S. Srinivasan et al., 1995; 1...
详细信息
In previous papers, we have described a reduction model for computing near-perfect state information (NPSI) in support of adaptive synchronization in a parallel discrete event simulation (S. Srinivasan et al., 1995; 1995). We report on an implementation of this model on a popular high performance computing platform-a network of workstations-without the use of special purpose hardware. The specific platform is a set of Pentium Pro PCs, interconnected by Myrinet-a Gbps network. We describe the reduction model and its use in our Elastic Time Algorithm. We summarize our design, described in an earlier paper and focus on the details of the implementation of this design. We present performance results that indicate that NPSI is feasible for simulations with medium to large event granularity.
programming using message passing or distributed shared memory are the two major parallel programming paradigms on clusters. However, these two models have high programming complexity, produce less maintainable parall...
详细信息
programming using message passing or distributed shared memory are the two major parallel programming paradigms on clusters. However, these two models have high programming complexity, produce less maintainable parallel code, and are not suitable for multi-core multiprocessor clusters. While object-oriented programming is dominant in serial programming, it has not been well exploited in parallel programming. In this paper, we propose an innovative automatic parallelization framework that employs past experience to parallelize serial programs and outputs the parallel code in the form of objects. Supported by a data-driven runtime environment, each parallel task is managed as a thread, exploiting the multiple processing cores on a cluster node. Based on this proposed framework, we have implemented a proof-of-concept parallelizer called PJava to parallelize Java code. The performance benefit of this framework is evaluated through case studies by comparing the execution time of the automatically generated PJava code to that of handcrafted JOPI (a Java dialect of MPI) code.
Vector clocks are logical timestamps used in correctness tools to analyze the happened-before relation between events in parallel program executions. In particular, race detectors use them to find concurrent conflicti...
详细信息
ISBN:
(数字)9781665497473
ISBN:
(纸本)9781665497480
Vector clocks are logical timestamps used in correctness tools to analyze the happened-before relation between events in parallel program executions. In particular, race detectors use them to find concurrent conflicting memory accesses, and replay tools use them to reproduce or find alternative execution paths. To record the happened-before relation with vector clocks, tool developers have to consider the different synchronization concepts of a programming model, e.g., barriers, locks, or message exchanges. Especially in distributed-memory programs, various concepts result in explicit and implicit synchronization between processes. Previously implemented vector clock exchanges are often specific to a single programming model, and a translation to other programming models is not trivial. Consequently, analyses relying on the vector clock exchange remain model-specific. This paper proposes an abstraction layer for on-the-fly vector clock exchanges for distributed-memory programs. Based on the programming models MPI, OpenSHMEM, and GASPI, we define common synchronization primitives and explain how model-specific procedures map to our model-agnostic abstraction layer. The exchange model is general enough also to support synchronization concepts of other parallel programming models. We present our implementation of the vector clock abstraction layer based on the Generic Tool Infrastructure with translators for MPI and OpenSHMEM. In an overhead study using the SPEC MPI 2007 benchmarks, the slowdown of the implemented vector clock exchange ranges from 1.1x to 12.6x for runs with up to 768 processes.
As the complexity of software for Cyber-Physical Systems (CPS) rapidly increases, multi-core processors and parallel programming models such as OpenMP become appealing to CPS developers for guaranteeing timeliness. He...
详细信息
ISBN:
(纸本)9781450319966
As the complexity of software for Cyber-Physical Systems (CPS) rapidly increases, multi-core processors and parallel programming models such as OpenMP become appealing to CPS developers for guaranteeing timeliness. Hence, a parallel task on multi-core processors is expected to become a vital component in CPS such as a self-driving car, where tasks must be scheduled in real-time. In this paper, we extend the fork-join parallel task model to be scheduled in real-time, where the number of parallel threads can vary depending on the physical attributes of the system. To efficiently schedule the proposed task model, we develop the task stretch transform. Using this transform for global Deadline Monotonic scheduling for fork-join real-time tasks, we achieve a resource augmentation bound of 3.73. In other words, any task set that is feasible on m unit-speed processors can be scheduled by the proposed algorithm on m processors that are 3.73 times faster. The proposed scheme is implemented on Linux/RK as a proof of concept, and ported to Boss, the self-driving vehicle that won the 2007 DARPA Urban Challenge. We evaluate our scheme on Boss by showing its driving quality, i.e., curvature and velocity profiles of the vehicle.
Vector prefix and reduction are collective communication primitives in which all processors must cooperate. The authors present two parallel algorithms, the direct algorithm and the split algorithm, for vector prefix ...
详细信息
Vector prefix and reduction are collective communication primitives in which all processors must cooperate. The authors present two parallel algorithms, the direct algorithm and the split algorithm, for vector prefix and reduction computation on coarse-grained, distributed-memory parallel machines. The algorithms are relatively architecture independent and can be used effectively in many applications such as pack/unpack, array prefix/reduction functions, and array combining scatter functions, which are defined in Fortran 90 and in High Performance Fortran. Experimental results on the CM-5 are presented.
General-purpose graphics processing units (GPGPU) brings an opportunity to improve the performance for many applications. However, exploiting parallelism is low productive in current programming frameworks such as CUD...
详细信息
ISBN:
(纸本)9781450327664
General-purpose graphics processing units (GPGPU) brings an opportunity to improve the performance for many applications. However, exploiting parallelism is low productive in current programming frameworks such as CUDA and OpenCL. Programmers have to consider and deal with many GPGPU architecture details; therefore it is a challenge to trade off the programmability and the efficiency of performance *** Repacking (PR) is a popular performance tuning approach for GPGPU applications, which improves the performance by changing the parallel granularity. Existing code transformation algorithms using PR increase the productivity, but they do not cover adequate code patterns and do not give an effective code error detection. In this paper, we propose a novel parallel repacking algorithm (APR) to cover a wide range of code patterns and improve efficiency. We develop an efficient code model that expresses a GPGPU program as a recursive statement sequence, and introduces a concept of singular statement. APR building upon this model uses appropriate transformation rules for singular and non-singular statements to generate the repacked codes. A recursive transformation is performed when it encounters a branching/loop singular statement. Additionally, singular statements unify the transformation for barriers and data sharing, and enable APR to detect the barrier errors. The experiment results based on a prototype show that out proposed APR covers more code patterns than existing solutions such as the automatic thread coarsening in Crest, and the repacked codes using the APR achieve effective performance gain up to 3.28X speedup, in some cases even higher than manually tuned repacked codes.
In this paper an approach to testbench development for synchronous parallel-pipeline designs is considered. The approach is based on cycle-accurate formal specifications of a design under verification. Specifications ...
详细信息
In this paper an approach to testbench development for synchronous parallel-pipeline designs is considered. The approach is based on cycle-accurate formal specifications of a design under verification. Specifications include descriptions of control flow graphs of the design's operations and definitions of the microoperations with the help of Hoare triples. The approach allows to automate testbench development for complex synchronous designs with control flow branching and parallel starting operations.
Traditional debug methodologies are limited in their ability to provide debugging support for many-core parallel programming. Synchronization problems or bugs due to race conditions are particularly difficult to detec...
详细信息
ISBN:
(纸本)9781605584973
Traditional debug methodologies are limited in their ability to provide debugging support for many-core parallel programming. Synchronization problems or bugs due to race conditions are particularly difficult to detect with software debugging tools. Most traditional debugging approaches rely on globally synchronized signals, but these pose problems in terms of scalability. The first contribution of this paper is to propose a novel non-uniform debug architecture (NUDA) based on a ring interconnection schema. Our approach makes debugging both feasible and scalable for many-core processing scenarios. The key idea is to distribute the debugging support structures across a set of hierarchical clusters while avoiding address overlap. This allows the address space to be monitored using non-uniform protocols. Our second contribution is a non-intrusive approach to race detection supported by the NUDA. A non-uniform page-based monitoring cache in each NUDA node is used to watch the access footprints. The union of all the caches can serve as a race detection probe. Using the proposed approach, we show that parallel race bugs can be precisely captured, and that most false-positive alerts can be efficiently eliminated at an average slowdown cost of only 1.4%~3.6%. The net hardware cost is relatively low, so that the NUDA can readily scale increasingly complex many-core systems.
暂无评论