Withthe proliferation of workstation clusters connected by high-speed networks, providing efficient system support for concurrent applications engaging in nontrivial interaction has become an important problem. Two p...
详细信息
Withthe proliferation of workstation clusters connected by high-speed networks, providing efficient system support for concurrent applications engaging in nontrivial interaction has become an important problem. Two principal barriers to harnessing parallelism are: one, efficient mechanisms that achieve transparent dependency maintenance while preserving semantic correctness, and two, scheduling algorithms that match coupled processes to distributed resources while explicitly incorporating their communication costs. this paper describes a set of performance features, their properties, and implementation in a system support environment called DUNES that achieves transparent dependency maintenance - IPC, file access, memory access, process creation/termination, process relationships - under dynamic load balancing. the two principal performance features are push/pull-based active and passive end-point caching and communication-sensitive load balancing. Collectively, they mitigate the overhead introduced by the transparent dependency maintenance mechanisms. Communication-sensitive load balancing, in addition, affects the scheduling of distributed resources to application processes where both communication and computation costs are explicitly taken into account. DUNES' architecture endows commodity operating systems withdistributed operating system functionality while achieving transparency with respect to their existing application base. DUNES also preserves semantic correctness with respect to single processor semantics. We show performance measurements of a UNIX based implementation on Sparc and x86 architectures over high-speed LAN environments. We show that significant performance gains in terms of system throughput and parallel application speed-up are achievable.
A desired mesh architecture, based on connected-cycle modules, is constructed. To enhance the reliability, multiple bus sets and spare nodes are dynamically inserted to construct modular blocks. Two reconfiguration sc...
详细信息
this paper presents the SCOOPP (SCalable Object Oriented parallel Programming) approach to support the design and execution of scalable parallelapplications. the SCOOPP programming model aims the portability, dynamic...
详细信息
this paper presents the SCOOPP (SCalable Object Oriented parallel Programming) approach to support the design and execution of scalable parallelapplications. the SCOOPP programming model aims the portability, dynamic scalability and efficiency of parallelapplications. the SCOOPP is an hybrid compile and run-time system, which can perform parallelism extraction, supports explicit parallelism and performs dynamic granularity control at run-time. the mechanism that supports dynamic grain-size adaptation is presented and performance evaluated on two parallel systems. the measured results show the feasibility of the proposed dynamic grain-size adaptation and a scalability improvement of parallelapplications over static parallel OO environments, which suggests cost benefits to develop scalable parallelapplications to run on multiple platforms.
As high-performance embedded computing systems become more commonplace in a variety of applications, the need for supporting standards becomes more critical. Specifications developed by concensus, such as Message Pass...
详细信息
As high-performance embedded computing systems become more commonplace in a variety of applications, the need for supporting standards becomes more critical. Specifications developed by concensus, such as Message Passing Interface (MPI) and Vector, Signal, and Image processing (VSIP), and `de facto' standards such as MATLAB, provide a means for developers to create real-time applications across multiple platform technologies. the balance between portability and performance presents some significant challenges, including balancing the application of tools tuned to specific platforms withthe use of standard, but possibly slower, code and tools.
We present the ParAL system which compiles Matlab scripts into C programs with calls to a parallel run-time library. the novel feature of the compiler is the optimization of array alignment which reduces or eliminates...
详细信息
We present the ParAL system which compiles Matlab scripts into C programs with calls to a parallel run-time library. the novel feature of the compiler is the optimization of array alignment which reduces or eliminates unnecessary communication overheads. We have evaluated this technique on several Matlab codes. For comparison, the same applications were hand-coded using the PBLAS library. the aligned codes were on average 43% faster then the misaligned codes, withthe speedup factor of almost 4 achieved in some cases. this optimization technique enabled ordinary Matlab scripts to run at a similar speed as manually optimized PBLAS codes.
Software distributed shared memory (DSM) systems have successfully provided the illusion of shared memory on distributed memory machines. However, most software DSM systems use the main memory of each machine as a lev...
详细信息
ISBN:
(纸本)0769501433
Software distributed shared memory (DSM) systems have successfully provided the illusion of shared memory on distributed memory machines. However, most software DSM systems use the main memory of each machine as a level in a cache hierarchy, replicating copies of shared data in local memory. Since computer memories tend to be much larger than caches, DSM systems have largely ignored memory capacity issues, assuming there is always enough space in main memory in which to replicate data. applicationsthat access data that exceeds the capacity available in local memory will page to disk, resulting in reduced performance. We have developed a software DSM system based on Cashmere that takes advantage of system-wide memory resources in order to reduce or eliminate paging overhead. Experimental results on a 4-node, 16-processor AlphaServer system demonstrate the improvement in performance using the enhanced software DSM system for applications with large data sets.
In order for parallel logic programming systems to become popular, they should serve the broadest range of applications. To achieve this goal, designers of parallel logic programming systems would like to exploit maxi...
详细信息
In order for parallel logic programming systems to become popular, they should serve the broadest range of applications. To achieve this goal, designers of parallel logic programming systems would like to exploit maximum parallelism for existing and novel applications, ideally by supporting both and-parallelism and or-parallelism. Unfortunately, the combination of both forms of parallelism is a hard problem, and available proposals cannot match the efficiency of, say, or-parallel only systems. We propose a novel approach to And/Or parallelism in logic programs. Our initial observation is that stack copying, the most popular technique in or-parallel systems, does not work well with And/Or systems because memory management is much more complex. Copying is also a significant problem in operating systems where the copy-on-write (COW) has been developed to address the problem. We demonstrate that this technique can also be applied to And/Or systems, and present both shared memory and distributed shared memory designs.
Researchers and practitioners in the area of parallel and distributed computing have been lacking a portable, flexible and robust distributed instrumentation system. We present the Baseline Reduced Instrumentation Sys...
详细信息
Researchers and practitioners in the area of parallel and distributed computing have been lacking a portable, flexible and robust distributed instrumentation system. We present the Baseline Reduced Instrumentation System Kernel (BRISK) that we have developed as a part of a real-time system instrumentation and performance visualization project. the design is based on a simple distributed instrumentation system model for flexibility and extensibility. the basic implementation poses minimalistic system requirements and achieves high performance. We show evaluations of BRISK using two distinct configurations: one emphasizes isolated simple performance metrics;and the other, BRISK's operation on distributedapplications, its built-in clock synchronization and dynamic on-line sorting algorithms.
Much research has been done in fast communication on clusters and in protocols for supporting software shared memory across them. However, the end performance of applicationsthat were written for the more proven hard...
详细信息
ISBN:
(纸本)0769500048
Much research has been done in fast communication on clusters and in protocols for supporting software shared memory across them. However, the end performance of applicationsthat were written for the more proven hardware-coherent shared memory is still not very good on these systems. three major layers of software (and hardware) stand between the end user and parallel performance, each with its own functionality and performance characteristics. they include the communication layer, the software protocol layer that supports the programming model, and the application layer. these layers provide a useful framework to identify the key remaining limitations and bottlenecks in software shared memory systems, as well as the areas where optimization efforts might yield the greatest performance improvements. this paper performs such an integrated study, using this layered framework, for two types of software distributed shared memory systems: page-based shared virtual memory (SVM) and fine-grained software systems (FG). For the two system layers (communication and protocol), we focus on the performance costs of basic operations in the layers rather than on their functionalities. this is possible because their functionalities are now fairly mature. the less mature applications layer is treated through application restructuring. We examine the layers individually and in combination, understanding their implications for the two types of protocols and exposing the synergies among layers.
Our study of a large set of scientific applications over the past three years indicates that the processing for multi-dimensional datasets is often highly stylized. the basic processing step usually consists of mappin...
详细信息
Our study of a large set of scientific applications over the past three years indicates that the processing for multi-dimensional datasets is often highly stylized. the basic processing step usually consists of mapping the individual input items to the output grid and computing output items by aggregating, in some way, all the input items mapped to the corresponding grid point. In this paper, we discuss the design and performance of T2, an infrastructure for building parallel database systems that integrates storage, retrieval and processing of multi-dimensional datasets. It achieves its primary advantage from the ability to integrate data retrieval and processing for a wide variety of applications and from the ability to maintain and jointly process multiple datasets with different underlying grids. We present preliminary performance results comparing the implementation of two applications using the T2 services with custom-built integrated implementations.
暂无评论