Multithreaded execution models attempt to combine some aspects of dataflow-like execution with von Neumann model execution, withthe objective of masking the latency of inter-processor communications and remote memory...
详细信息
Multithreaded execution models attempt to combine some aspects of dataflow-like execution with von Neumann model execution, withthe objective of masking the latency of inter-processor communications and remote memory accesses in multiprocessors. An important issue in the analysis and evaluation of multithreaded execution is the design and performance of the storage hierarchy. Because of the sequential execution of threads, the locality of access within an executing thread can be exploited using registers and cache. At the inter-thread level, however, the locality of accesses to memory and its effect on the cache is not yet well understood. Two storage hierarchy models, that attempt to capture and exploit this locality, are described and evaluated in this paper.
this paper presents a novel approach for performing real-time sonar beamforming on linear sensor arrays using the MasPar SIMD architecture. the beamforming problem is defined as a three dimensional solution space by g...
详细信息
this paper presents a novel approach for performing real-time sonar beamforming on linear sensor arrays using the MasPar SIMD architecture. the beamforming problem is defined as a three dimensional solution space by generating a cube structure with sonar array elements as one dimension, the required beams in other dimension, and the time samples in the third dimension. the given approach maps the problem cube into the MasPar structure using a modified one-to-one mapping and uses two MasPar Fortran 90 intrinsic array functions to generate the solutions to the beams. Simulation results are provided for different array and beam sizes.
Designing a good task allocation algorithm faces the challenge of allowing high levels of throughput, so that tasks are executed fast and processor parallelism is exploited, while still guaranteeing a low level of mem...
详细信息
ISBN:
(纸本)0818676833
Designing a good task allocation algorithm faces the challenge of allowing high levels of throughput, so that tasks are executed fast and processor parallelism is exploited, while still guaranteeing a low level of memory contention, so that performance does not suffer because of limitations on processor-to-memory bandwidth. In this work, we present a comparative study of throughput and contention guarantees provided by load balancing networks, a new class of distributed asynchronous algorithms for real-time task allocation in shared memory multiprocessors. Load balancing networks generalize balancing networks, to accomodate tasks with varying completion times. On the theoretical side, we formulate precise and crisp definitions for capturing the quality of load balancing provided by general task allocation algorithms;we use these definitions for formally evaluating the throughput performance of specific constructions of load balancing networks that we propose. Furthermore, we introduce a formal, complexity-theoretic measure of contention Incurred by tasks with varying completion times, and use it to analyse the contention performance of these constructions. Our theoretical results display precise and subtle trade-offs between throughput and contention performances for load balancing networks. On the practical side, we propose an experimental platform for evaluating the actual performance of load balancing networks through a series of carefully designed experiments that simulate these networks on real shared memory multiprocessor machines. Our experimental approach encompasses a rigorous methodology for randomly generating tasks that are not merely ''random'', but rather belong to common classes of tasks such as periodic and sporadic. Our experimental results reveal that load balancing networks substantially outperform in performance classical, centralized methods for task allocation.
One of the fundamental goals of parallel computing is to develop a framework that will support portable and efficient application programs. the Bulk-Synchronous parallel (BSP) model was proposed to help achieve this g...
详细信息
One of the fundamental goals of parallel computing is to develop a framework that will support portable and efficient application programs. the Bulk-Synchronous parallel (BSP) model was proposed to help achieve this goal. the BSP model is intended to be a `unifying model' - it addresses both software and hardware issues by allowing theoretical analysis to coexist with practical physical implementations. For several years the BSP model has been supported mainly by theoretical results. Recent experiments, however, have begun to demonstrate the practicality of the model for real architectures running real applications. the goal of this paper is to describe the methodology used to construct an efficient BSP library on the BBN Butterfly GP1000. Our results are relevant for BSP library implementations on shared-memory systems in general and for NUMA (nonuniform-memory-access) machines in particular.
the problem of scheduling tasks onto distributed memory machines for obtaining an optimal schedule is an NP-complete problem. In this paper, we present a scalable scheduling algorithm which can schedule the tasks of a...
详细信息
the problem of scheduling tasks onto distributed memory machines for obtaining an optimal schedule is an NP-complete problem. In this paper, we present a scalable scheduling algorithm which can schedule the tasks of a directed acyclic graphs (DAGs) with a complexity of O(V2) in the worst case, where V is the number of nodes of the DAG. this algorithm generates an optimal schedule for a class of DAGs which satisfy certain condition and if the required number of processors are available. the algorithm initially generates a schedule for a small number of processors. In case, the available number of processors are higher than the number of processors required by the initial schedule, the algorithm scales the schedule appropriately in an effort to obtain a lower parallel time by utilizing the extra or idle processors. the algorithm has been applied to some practical DAGs and the results are very promising.
this paper presents a measurement and simulation based study of parallel I/O in a high-performance cluster system: the Pittsburgh Supercomputing Center (PSC) DEC Alpha Supercluster. the measurements were used to chara...
详细信息
this paper presents a measurement and simulation based study of parallel I/O in a high-performance cluster system: the Pittsburgh Supercomputing Center (PSC) DEC Alpha Supercluster. the measurements were used to characterize the performance bottlenecks and the throughput limits at the compute and I/O nodes, and to provide realistic input parameters to PioSim, a simulation environment we have developed to investigate parallel I/O performance issues in cluster systems. PioSim was used to obtain a detailed characterization of parallel I/O performance, in the high-performance cluster system, for different regular access patterns and different system configurations. this paper also explores the use of local disks at the compute nodes for parallel I/O, and finds that the local disk architecture outperforms the traditional parallel I/O over remote I/O node disks architecture, even when as much as 68-75% of the requests from each compute node goes to remote disks.
In this paper, we propose a new class of interconnection networks called recursive hierarchical swapped networks (RHSN) for general-purpose parallelprocessing. the node degrees of RHSNs can vary from a small number t...
详细信息
In this paper, we propose a new class of interconnection networks called recursive hierarchical swapped networks (RHSN) for general-purpose parallelprocessing. the node degrees of RHSNs can vary from a small number to as large as required, depending on recursive and hierarchical composition parameters and the nucleus graph chosen. the diameter of an RHSN can be asymptotically optimal within a small constant factor. We present efficient routing, semigroup computation, ascend/descend, matrix-matrix multiplication, and emulation algorithms, thus proving the versatility of RHSNs. In particular, on suitably constructed RHSNs, matrix multiplication can be performed faster than the DNS algorithm on a hypercube. Furthermore, ascend/descend algorithms, semigroup computation, and parallel prefix computation can be done using algorithms with asymptotically fewer communication steps than on a hypercube.
the dataflow model of computation, in general, and its recent direction to combine dataflow processing with control-flow processing, in particular, provide attractive alternatives to satisfy the computational demands ...
详细信息
the dataflow model of computation, in general, and its recent direction to combine dataflow processing with control-flow processing, in particular, provide attractive alternatives to satisfy the computational demands of new applications, without experiencing the shortcomings of the traditional concurrent systems. this should motivate researchers to analyze the applicability of familiar concepts, such as scheduling and load balancing, within this new architectural framework. Effective execution of loop iterations as a means to improve performance and hardware utilization has received a great deal of attention in the past. In this paper we address the problem of scheduling/allocation of DOACROSS loops in a multithreaded dataflow environment. An extension to the staggered scheme - Cyclic staggered scheme - which produces a more balanced distribution of iterations among processors is introduced and its performance improvement in a dataflow and control-flow environment is simulated and analyzed.
the computational requirements for an adaptive solution of unsteady problems change as the simulation progresses. this causes workload imbalance among processors on a parallel machine which, in turn, requires signific...
详细信息
the computational requirements for an adaptive solution of unsteady problems change as the simulation progresses. this causes workload imbalance among processors on a parallel machine which, in turn, requires significant data movement at runtime. We present a new dynamic load-balancing framework, called JOVE, that balances the workload across all processors with a global view. Whenever the computational mesh is adapted, JOVE is activated to eliminate the load imbalance. JOVE has been implemented on an IBM SP2 distributed-memory machine in MPI for portability. Experimental results for two model meshes demonstrate that mesh adaption with load balancing gives more than a sixfold improvement over one without load balancing. We also show that JOVE gives a 24-fold speedup on 64 processors compared to sequential execution.
Available copy protocols provide the highest data availability and data reliability of all replication protocols that do not regenerate failed replicas. Unfortunately, all existing implementations of available copy pr...
详细信息
Available copy protocols provide the highest data availability and data reliability of all replication protocols that do not regenerate failed replicas. Unfortunately, all existing implementations of available copy protocols either rely on complex procedures for ascertaining which replicas are up to date after a total failure or have to wait for the recovery of all failed sites. We present a simple technique for efficiently implementing the available copy protocol. Our protocol does not require version numbers and maintains only n+log(n) bits of state per replica. We also show under standard Markovian assumptions that our new protocol provides the same data availability as the best feasible implementations of the available copy protocol.
暂无评论