Some classes of real-time systems function in environments which cannot be modeled with static approaches. In such environments, the arrival rates of events which drive transient computations may be unknown. Also, the...
详细信息
Some classes of real-time systems function in environments which cannot be modeled with static approaches. In such environments, the arrival rates of events which drive transient computations may be unknown. Also, the periodic computations may be required to process varying numbers of data elements per period, but the number of data elements to be processed in an arbitrary period cannot be known at the time of system engineering, nor can an upper bound be determined for the number of data items;thus, a worst case execution time cannot be obtained for such periodics. this paper presents middleware services that support such dynamic real-time systems through adaptive resource management. the middleware services have been implemented and employed for components of the experimental Navy system described in [10]. Experimental characterizations show that the services provide timely responses, that they have a low degree of intrusiveness on hardware resources, and that they are scalable.
the proposed convergence algorithm quickly and accurately predicts the mean message response time of a communication channel. From the predicted value the correct window size of the desired percentage of time-out tole...
详细信息
the proposed convergence algorithm quickly and accurately predicts the mean message response time of a communication channel. From the predicted value the correct window size of the desired percentage of time-out tolerance in message transmission and response can be computed. A correct window size is a necessity for reducing excessive message retransmissions caused by time-outs. the convergence algorithm is particularly useful for enhancing the performance of time-critical distributedapplications running on large networks such as the Internet. Prolonged delays in message response in these systems may lead to fatal errors. the discussion in this paper concentrates mainly on the theory and verification of the convergence algorithm. the simulation results have confirmed that the proposed algorithm is indeed a cost/effective solution for enhancing system performance. the simplicity of the convergence algorithm makes it worthwhile for practical implementation in real systems.
We study how several collective operations like broadcast, reduction, scan, etc. can be composed efficiently in complex parallel programs. Our specific contributions are: (1) a formal framework for reasoning about col...
详细信息
We study how several collective operations like broadcast, reduction, scan, etc. can be composed efficiently in complex parallel programs. Our specific contributions are: (1) a formal framework for reasoning about collective operations; (2) a set of optimization rules which save communications by fusing several collective operations into one; (3) performance estimates, which guide the application of optimization rules depending on the machine characteristics; (4) a simple case study with machine experiments.
the ability to dynamically adapt an unstructured grid (or mesh) is a powerful tool for solving computational problems with evolving physical features;however, an efficient parallel implementation is rather difficult, ...
详细信息
the ability to dynamically adapt an unstructured grid (or mesh) is a powerful tool for solving computational problems with evolving physical features;however, an efficient parallel implementation is rather difficult, particularly from the viewpoint of portability on various multiprocessor platforms. We address this problem by developing PLUM, an automatic and architecture-independent framework for adaptive numerical computations in a message-passing environment. Portability is demonstrated by comparing performance on an SP2, an Origin2000, and a T3E, without any code modifications. We also present a general-purpose load balancer that utilizes symmetric broadcast networks (SBN) as the underlying communication pattern, with a goal to providing a global view of system loads across processors. Experiments on an SP2 and an Origin2000 demonstrate the portability of our approach which achieves superb load balance at the cost of minimal extra overhead.
parallel computing is becoming increasing central and mainstream, driven both by the widespread availability of commodity SMP and high-performance cluster platforms, as well as the growing use of parallelism in genera...
详细信息
parallel computing is becoming increasing central and mainstream, driven both by the widespread availability of commodity SMP and high-performance cluster platforms, as well as the growing use of parallelism in general-purpose applications such as image recognition, virtual reality, and media processing. In addition to performance requirements, the latter computations impose soft real-time constraints, necessitating efficient, predictable parallel resource management. In this paper, we propose a novel approach for increasing parallel system utilization while meeting application soft real-time deadlines. Our approach exploits the application tunability found in several general-purpose computations. Tunability refers to an application's ability to trade off resource requirements over time, while maintaining a desired level of output quality. We first describe language extensions to support tunability in the Calypso system, then characterize the performance benefits of tunability, using a synthetic task system to systematically identify its benefits. Our results show that application tunability is convenient to express and can significantly improve parallel system utilization for computations with predictability requirements.
A new systolic algorithm which computes image differences in run-length encoded (RLE) format is described. the binary image difference operation is commonly used in many image processingapplications including automat...
详细信息
A new systolic algorithm which computes image differences in run-length encoded (RLE) format is described. the binary image difference operation is commonly used in many image processingapplications including automated inspection systems, character recognition, fingerprint analysis, and motion detection. the efficiency of these operations can be improved significantly withthe availability of a fast systolic system that computes the image difference as described in this paper It is shown that for images with a high similarity measure, the time complexity of the systolic algorithm is small and in some cases constant with respect to the image size. the time for the systolic algorithm is proportional to the difference between the number of runs in the two images, while the time for the sequential algorithm is proportional to the total number of runs in the two images together A formal proof of correctness for the algorithm is also given.
Classification is an important problem in the field of data mining. Construction of good classifiers is computationally intensive and offers plenty of scope for parallelization. Divide-and-conquer paradigm can be used...
详细信息
Classification is an important problem in the field of data mining. Construction of good classifiers is computationally intensive and offers plenty of scope for parallelization. Divide-and-conquer paradigm can be used to efficiently construct decision tree classifiers. We discuss in detail various techniques for parallel divide-and-conquer and extend these techniques to handle efficiently disk-resident data. Furthermore, a generic technique for parallel out-of-core divide-and-conquer problems is suggested. We present pCLOUDS, the parallel version of the decision tree classifier algorithm CLOUDS, capable of handling large out-of-core data sets. pCLOUDS exhibits excellent speedup, sizeup and scaleup properties which make it a competitive tool for data mining applications. We evaluate the performance of pCLOUDS for a range of synthetic data sets on the IBM-SP2.
In this paper, we present the first system that implements OpenMP on a network of shared-memory multiprocessors. this system enables the programmer to rely on a single, standard, shared-memory API for parallelization ...
详细信息
In this paper, we present the first system that implements OpenMP on a network of shared-memory multiprocessors. this system enables the programmer to rely on a single, standard, shared-memory API for parallelization within a multiprocessor and between multiprocessors. It is implemented via a translator that converts OpenMP directives to appropriate calls to a modified version of the TreadMarks software distributed memory system (SDSM). In contrast to previous SDSM systems for SMPs, the modified TreadMarks uses POSIX threads for parallelism within an SMP node. this approach greatly simplifies the changes required to the SDSM in order to exploit the intra-node hardware shared memory. We present performance results for six applications (SPLASH-2 Barnes-Hut and Water, NAS 3D-FFT, SOR, TSP and MGS) running on an SP2 with four four-processor SMP nodes. A comparison between the threaded implementation and the original implementation of TreadMarks shows that using the hardware shared memory within an SMP node significantly reduces the amount of data and the number of messages transmitted between nodes, and consequently achieves speedups up to 30% better than the original versions. We also compare SDSM against message passing. Overall, the speedups of multithreaded TreadMarks programs are within 7-30% of the MPI versions.
parallel computing is becoming increasing central and mainstream, driven both by the widespread availability of commodity SMP and high-performance cluster platforms, as well as the growing use of parallelism in genera...
详细信息
parallel computing is becoming increasing central and mainstream, driven both by the widespread availability of commodity SMP and high-performance cluster platforms, as well as the growing use of parallelism in general-purpose applications such as image recognition, virtual reality, and media processing. In addition to performance requirements, the latter computations impose soft real-time constraints, necessitating efficient, predictable parallel resource management. In this paper we propose a novel approach for increasing parallel system utilization while meeting application soft real-time deadlines. Our approach exploits the application tunability found in several general-purpose computations. Tunability refers to an application's ability to trade off resource requirements over time, while maintaining a desired level of output quality. We first describe language extensions to support tunability in the Calypso system, then characterize the performance benefits of tunability, using a synthetic task system to systematically identify its benefits. Our results show that application tunability is convenient to express and can significantly improve parallel system utilization for computations with predictability requirements.
暂无评论