this paper proposes a novel queue-based programming abstraction, parallel Dispatch Queue (PDQ), that enables efficient parallel execution of fine-grain software communication protocols. parallel systems often use fine...
详细信息
ISBN:
(纸本)0769500048
this paper proposes a novel queue-based programming abstraction, parallel Dispatch Queue (PDQ), that enables efficient parallel execution of fine-grain software communication protocols. parallel systems often use fine-grain software handlers to integrate a network message into computation. Executing such handlers in parallel requires access synchronization around resources. Much as a monitor construct in a concurrent language protects accesses to a set of data structures, PDQ allows messages to include a synchronization key protecting handler accesses to a group of protocol resources. By simply synchronizing messages in a queue prior to dispatch, PDQ not only eliminates the overhead of acquiring/releasing synchronization primitives but also prevents busy-waiting within handlers. In this paper, we study PDQ's impact on software protocol performance in the context of fine-grain distributed shared memory (DSM) on ail SMP cluster: Simulation results running shared-memory applications indicate that: (i) parallel software protocol execution using PDQ significantly improves performance in fine-grain DSM, (ii) tight integration of PDQ and embedded processors into a single custom device carl offer performance competitive or better than arl all-hardware DSM, and (iii) PDQ best benefits cost-effective systems that use idle SMP processors (rather than custom embedded processors) to execute protocols. On a cluster of 4 16-way SMPs, a PDQ-based parallel protocol running on idle SMP processors improves application performance by a factor of 2.6 over a system running a serial protocol on a single dedicated processor.
Much research has been done in fast communication on clusters and in protocols for supporting software shared memory across them. However, the end performance of applications that were written for the more proven hard...
详细信息
ISBN:
(纸本)0769500048
Much research has been done in fast communication on clusters and in protocols for supporting software shared memory across them. However, the end performance of applications that were written for the more proven hardware-coherent shared memory is still not very good on these systems. three major layers of software (and hardware) stand between the end user and parallel performance, each with its own functionality and performance characteristics. they include the communication layer, the software protocol layer that supports the programming model, and the application layer. these layers provide a useful framework to identify the key remaining limitations and bottlenecks in software shared memory systems, as well as the areas where optimization efforts might yield the greatest performance improvements. this paper performs such an integrated study, using this layered framework, for two types of software distributed shared memory systems: page-based shared virtual memory (SVM) and fine-grained software systems (FG). For the two system layers (communication and protocol), we focus on the performance costs of basic operations in the layers rather than on their functionalities. this is possible because their functionalities are now fairly mature. the less mature applications layer is treated through application restructuring. We examine the layers individually and in combination, understanding their implications for the two types of protocols and exposing the synergies among layers.
the Navy needs to use Multi Level Security (MLS) techniques in an environment with increasing amount of real time computation brought about by increased automation requirements and new more complex operations. NSWC-DD...
详细信息
In this paper we present Dynamic Bisectioning or DBS, a simple but powerful comprehensive scheduling policy for user-level threads, which unifies the exploitation of (multidimensional) loop and nested functional (or t...
详细信息
Dynamic Load Balancing is an important system function destined to distribute workload among available processors to improve throughput and/or execution times of parallel computer programs either uniform or non-unifor...
详细信息
Dynamic Load Balancing is an important system function destined to distribute workload among available processors to improve throughput and/or execution times of parallel computer programs either uniform or non-uniform (jobs whose workload varies at run-time in unpredictable ways). Non-uniform computation and communication requirements may bog down a parallel computer if no efficient load distribution is effected. A novel distributed algorithm for load balancing is proposed and is based on local Rate of Change observations rather than on global absolute load numbers. It is a totally distributed algorithm and requires no centralized trigger and/or decision makers. the strategy is discussed and analyzed by means of experimental simulation.
this paper presents SKiPPER, a programming environment dedicated to the fast prototyping of parallel vision algorithms on MIMD- DM platforms. SKiPPER is based upon the concept of algorithmic skele- tons, i.e. higher o...
详细信息
Multidimensional Analysis and On-Line Analytical Processing (OLAP) uses summary information that requires aggregate operations along one or more dimensions of numerical data values. Query processing for these applicat...
详细信息
Multidimensional Analysis and On-Line Analytical Processing (OLAP) uses summary information that requires aggregate operations along one or more dimensions of numerical data values. Query processing for these applications require different views of data for decision support. the Data Cube operator provides multi-dimensional aggregates, used to calculate and store summary information on a number of dimensions. the multi-dimensionality of the underlying problem can be represented both in relational and multi-dimensional databases, the latter being a better fit when query performance is the criteria for judgment. Relational databases are scalable in size and efforts are on to make their performance acceptable. On the other hand multi-dimensional databases perform well for such queries, although they are nor very scalable. parallelcomputing is necessary to address the scalability and performance issues for these data sets. In this paper we present a parallel and scalable infrastructure for OLAP and multidimensional analysis. We use chunking to store data either as a dense block using multidimensional arrays (md-arrays) or a sparse set using a Bit encoded sparse structure (BESS). Chunks provide a multidimensional index structure for efficient dimension oriented data accesses much the same as md-arrays do. Operations within chunks and between chunks are a combination of relational and multi-dimensional operations depending on whether the chunk is sparse or dense. We present performance results for data sets with 3, 5 and 10 dimensions for our implementation on the IBM SP-2 which show good speedup and scalability.
In order to provide Java the ability for supporting scientific parallelcomputing, we introduce a data parallel extension to Java language with runtime system support. We will provide the distributed arrays extension ...
详细信息
ISBN:
(纸本)0818691948
In order to provide Java the ability for supporting scientific parallelcomputing, we introduce a data parallel extension to Java language with runtime system support. We will provide the distributed arrays extension to Java, and discuss the related operation anal control over the new distributed array. Communication involving distributed arrays are handles through a standard of collective communication library. WE also will make the programming in a Single Program Multiple Data (SPMD) model.
the proceedings contain 61 papers. the topics discussed include: new number representation and conversion techniques on reconfigurable mesh;precise control of instruction caches;more on arbitrary boundary packed arith...
ISBN:
(纸本)0818691948
the proceedings contain 61 papers. the topics discussed include: new number representation and conversion techniques on reconfigurable mesh;precise control of instruction caches;more on arbitrary boundary packed arithmetic;more on arbitrary boundary packed arithmetic;PERL - a registerless architecture;design alternatives for shared memory multiprocessors;a simple optimal list ranking algorithm;a parallel skeletonization algorithm and its VLSI architecture;improving error bounds for multipole-based treecodes;computation of penetration measures for convex polygons and polyhedra for graphics applications;extrapolation in distributed adaptive integration;and java data parallel extensions with runtime system support.
the complexity of parallel I/O systems imposes significant challenge in managing and utilizing the available system resources to meet application performance, portability and usability goals. We believe that a paralle...
详细信息
ISBN:
(纸本)0818685794
the complexity of parallel I/O systems imposes significant challenge in managing and utilizing the available system resources to meet application performance, portability and usability goals. We believe that a parallel I/O system that automatically selects efficient I/O plans for user applications is a solution to this problem. In this paper, we present such an automatic performance optimization approach for scientific applications performing collective I/O requests on multidimensional arrays. the approach is based on, a high level description of the target workload and execution. environment characteristics, and applies genetic algorithms to select high quality I/O plans. We have validated this approach in the Panda parallel I/O library. Our performance evaluations on the IBM SP show that this approach can, select high quality I/O plans under a variety of system conditions with a low overhead, and the genetic algorithm-selected I/O plans are in general better than the default plans used in Panda.
暂无评论