In this paper, we show how to minimize data sharing overhead required in most parallel algorithms, especially in Large-Scale Data-parallel (LSDP) algorithms, on a 2D mesh. Two specific issues are addressed in this stu...
详细信息
In this paper, we show how to minimize data sharing overhead required in most parallel algorithms, especially in Large-Scale Data-parallel (LSDP) algorithms, on a 2D mesh. Two specific issues are addressed in this study. One is what the optimal group size is, i.e., how many PEs should share a copy of shared data. The other is where the replicated data should be allocated.
In this paper, we present algorithms for solving several basic geometric problems of size n in a network of p processors each with O(n/p) local memory. Our algorithms achieve the best possible (up to a constant factor...
详细信息
ISBN:
(纸本)081864222X
In this paper, we present algorithms for solving several basic geometric problems of size n in a network of p processors each with O(n/p) local memory. Our algorithms achieve the best possible (up to a constant factor) time bound in hypercubic parallel architectures such as hypercube, shuffle exchange and cube connected cycles, provided that n≥p1+Ε for some positive constant Ε. Our algorithms use only sorting and parallel prefix that involve interprocessor communications, and can be easily implemented in commercially available parallel computers. Experimentation on nCUBE parallel computers shows that our algorithms run efficiently.
A software framework for the parallel execution of sequential programs using C++ classes is presented. The functional language Concurrent ML is used to implement the underlying harness and to design the programming in...
详细信息
We propose a novel computational model for GPU. Known parallel computational models such as the PRAM model are not appropriate for evaluating GPU algorithms. Our model, called AGPU, abstracts the essence of current GP...
详细信息
ISBN:
(纸本)9781479941162
We propose a novel computational model for GPU. Known parallel computational models such as the PRAM model are not appropriate for evaluating GPU algorithms. Our model, called AGPU, abstracts the essence of current GPU architectures such as global and shared memory, memory coalescing and bank conflicts. We can therefore evaluate asymptotic behavior of GPU algorithms more accurately than known models and we can develop algorithms that are efficient on many real architectures. As a showcase, we first analyze known comparison-based sorting algorithms using the AGPU model and show that they are not I/O optimal, that is, the number of global memory accesses is more than necessary. Then we propose a new algorithm which uses an asymptotically optimal number of global memory accesses and whose time complexity is also nearly optimal.
Local Area MultiProcessors (LAMP) is a network of personal workstations with distributed shared physical memory provided by high performance technologies such as SCI. LAMP is more tightly coupled than the traditional ...
详细信息
ISBN:
(纸本)0818675829
Local Area MultiProcessors (LAMP) is a network of personal workstations with distributed shared physical memory provided by high performance technologies such as SCI. LAMP is more tightly coupled than the traditional local area networks (LAN) but is more loosely coupled than the bus based multiprocessors. This paper presents a distributed scheduling algorithm which exploits the distributed shared memory in SCI-LAMP to schedule the idle remote processors among the requesting workstations. It considers fairness by allocating remote processing capacity to the requesting workstations based on their priorities according to the decay-usage scheduling approach. The performance of the algorithm in scheduling both sequential and parallel jobs is evaluated by simulation. It is found that the higher priority nodes achieve faster job response times and higher speedups than that of the lower priority nodes. Lower scheduling overhead allows finer granularity of remote processors sharing than in LAN.
High-level data-parallel languages such as Vienna Fortran and High Performance Fortran (HPF) have been introduced to allow the programming of massively paralleldistributed-memory machines at a relatively high level o...
详细信息
High-level data-parallel languages such as Vienna Fortran and High Performance Fortran (HPF) have been introduced to allow the programming of massively paralleldistributed-memory machines at a relatively high level of abstraction, based on the Single-Program-Multiple-Data (SPMD) paradigm. Their main features include mechanisms for expressing the distribution of data across the processors of a machine. This paper introduces additional language functionality to allow the efficient processing of sparse matrix codes. We introduce new methods for the representation and distribution of sparse matrices, which forms a powerful mechanism for storing and manipulating sparse matrices able to be efficiently implemented on massively parallel machines.
Assignment of tasks of a parallel program onto processors of a distributed-memory system is critical to obtain minimal program completion time by minimizing communication overhead. This paper presents task assignment ...
详细信息
ISBN:
(纸本)081864222X
Assignment of tasks of a parallel program onto processors of a distributed-memory system is critical to obtain minimal program completion time by minimizing communication overhead. This paper presents task assignment heuristics for wormhole-routed systems. A Temporal Communication Graph is used to model task graphs and to identify spatial and temporal link contention. The interplay between degree of routing adaptivity, topology, application characteristics, and task assignment is studied by evaluating random task graphs using an event-driven simulator. The study indicates that even for systems supporting fully-adaptive routing, efficient task assignment is necessary to reduce program completion time especially for communication-bound applications.
distributed atmospheric data are often stored and managed in diverse regional and global repositories. The scientific process requires composition of these resources to solve a specific atmospheric science problem. A ...
详细信息
ISBN:
(纸本)9780769537474
distributed atmospheric data are often stored and managed in diverse regional and global repositories. The scientific process requires composition of these resources to solve a specific atmospheric science problem. A challenge is how to harvest data and models for designing and executing experiments in a seamless manner. Moreover, data-intensive scientific experiments with large-scale scientific data processing are iterative, dynamic and human centered. In particular, the atmospheric scientific experiments usually include some fixed steps, such as data discovery, access, preprocessing, transformation, visualization. A web service-enabled workflow system is a flexible tool for accessing distributed scientific data, and executing complex analysis on it. In this paper, we describe a four-layered architecture of the workflow system, which consists of web interaction layer, workflow engine layer, workflow components layer and resource layer. We also develop an intuitive and easy-to-use web based toolkit and apply it to atmospheric data processing.
To solve the graph partitioning problem, efficient heuristics have been developed that are also capable of distributing the computational load in parallel FEM computations. However, although a few parallel implementat...
详细信息
ISBN:
(纸本)0769521320
To solve the graph partitioning problem, efficient heuristics have been developed that are also capable of distributing the computational load in parallel FEM computations. However, although a few parallel implementations do exist, the involved algorithms are hard to parallelize due to their sequential nature. This paper presents a new approach to deal with the FEM graph partitioning problem. Applying diffusion as a growing mechanism, we are able to eliminate restrictions of former implementations based on the bubble framework and construct a relatively simple algorithm with a high degree of "natural" parallelism. We demonstrate that it computes solutions comparable to those of established heuristics. Its drawback is the long execution time if the parallelism is not exploited.
Most of the existing loop partitioning schemes require the dependence distances between the iterations of the loops to be constants which means that these schemes will not be applicable when the dependence distance is...
详细信息
ISBN:
(纸本)081864222X
Most of the existing loop partitioning schemes require the dependence distances between the iterations of the loops to be constants which means that these schemes will not be applicable when the dependence distance is variable. When each iteration of a loop is treated as a sequential instruction stream, parallel execution of these loops is possible if the data dependency between each iteration can be resolved during run-time. In this paper, we proposed a parallel architecture which is able to resolve these kind of undetermined data dependencies. A notable feature of the proposed architecture is that it will initiate executing multiple instruction streams before the data dependency between the instruction streams, if any, has been resolved and will restart the execution if necessary.
暂无评论