In this paper, we present an algorithm and provide design improvements needed to port the serial Lempel-Ziv-Storer-Szymanski (LZSS), lossless data compression algorithm, to a parallelized version suitable for general ...
详细信息
ISBN:
(纸本)9781467345651;9780769549033
In this paper, we present an algorithm and provide design improvements needed to port the serial Lempel-Ziv-Storer-Szymanski (LZSS), lossless data compression algorithm, to a parallelized version suitable for general purpose graphic processor units (GPGPU), specifically for NVIDIA's CUDA Framework. the two main stages of the algorithm, substring matching and encoding, are studied in detail to fit into the GPU architecture. We conducted detailed analysis of our performance results and compared them to serial and parallel CPU implementations of LZSS algorithm. We also benchmarked our algorithm in comparison with well known, widely used programs;GZIP and ZLIB. We achieved up to 34x better throughput than the serial CPU implementation of LZSS algorithm and up to 2.21x better than the parallelized version.
Searching for data location in P2P systems is a message-intensive task. Basically, a query needs to reach all the nodes in order to find the locations of a specified data. this search involves a broadcast, and an inap...
详细信息
Searching for data location in P2P systems is a message-intensive task. Basically, a query needs to reach all the nodes in order to find the locations of a specified data. this search involves a broadcast, and an inappropriate broadcast strategy may not only lead to inefficient data sharing, but also directly affect the scalability of the P2P system. We propose the utilization of an ad-hoc spanning tree for distributed search in P2P systems. this spanning tree is built as an overlay network. We define algorithms to maintain the ad-hoc spanning tree, and we compare the search time obtained withthe ad-hoc spanning tree withthe time obtained with other structures used for broadcasting in parallel and distributedsystems. this comparison is done with an MPI emulation.
Stochastic optimization is used in several high impact contexts to provide optimal solutions in the face of uncertainties. this paper explores the parallelization of two-stage stochastic resource allocation problems t...
详细信息
ISBN:
(纸本)9781467345651;9780769549033
Stochastic optimization is used in several high impact contexts to provide optimal solutions in the face of uncertainties. this paper explores the parallelization of two-stage stochastic resource allocation problems that seek an optimal solution in the first stage, while accounting for sudden changes in resource requirements by evaluating multiple possible scenarios in the second stage. Unlike typical scientific computing algorithms, linear programs (which are the individual grains of computation in our parallel design) have unpredictable and long execution times. this confounds both a priori load distribution as well as persistence-based dynamic load balancing techniques. We present a master-worker decomposition coupled with a pull-based work assignment scheme for load balance. We discuss some of the challenges encountered in optimizing boththe master and the worker portions of the computations, and techniques to address them. Of note are cut retirement schemes for balancing memory requirements with duplicated worker computation, and scenario clustering for accelerating the evaluation of similar scenarios. We base our work in the context of a real application: the optimization of US military aircraft allocation to various cargo and personnel movement missions in the face of uncertain demands. We demonstrate scaling up to 122 cores of an intel (R) 64 cluster;even for very small, but representative datasets. Our decision to eschew problem-specific decompositions has resulted in a parallel infrastructure that should be easily adapted to other similar problems. Similarly, we believe the techniques developed in this paper will be generally applicable to other contexts that require quick solutions to stochastic optimization problems.
Update methods are an important aspect of the burgeoning Artificial Life research area. Artificial Life models, like the Predator-Prey model, are able to operate quite efficiently when implemented in a sequential mann...
详细信息
Update methods are an important aspect of the burgeoning Artificial Life research area. Artificial Life models, like the Predator-Prey model, are able to operate quite efficiently when implemented in a sequential manner only while population numbers are low to moderate. We find that for large populations sequential implementations are too slow to extract meaningful measurement statistics. In this paper we discuss the parallelisation of sequential update methods for use in Artificial Life systems. We also discuss the ramifications that parallel update algorithms introduce to data dependencies and also the meaning of correctness in parallel models.
We present optional locking as a method for significantly speeding up distributed locks, and we generalize it to multiple lock types obeying a conflict relation. the generalized version can simulate the message passin...
详细信息
We present optional locking as a method for significantly speeding up distributed locks, and we generalize it to multiple lock types obeying a conflict relation. the generalized version can simulate the message passing paradigm on top of distributed Shared Memory (DSM) with no more messages than explicit message passing would need. thus we argue that message passing can be viewed as a true special case of optional locking. As a consequence, the attractiveness of DSM programming models should increase significantly due to well-recognized advantages such as simplicity and reduced software engineering cost;mixtures of both message passing and shared data access patterns can be treated uniformly. Measurements and simulations based on database benchmarks indicate a substantial performance improvement of optional locking over conventional locking even in presence of multiple lock types.
parallel applications are notorious for their intractability to performance debugging. Automatic performance analysis techniques, such as those used by Kojak and KappaPI, are promising in alleviating the difficulty of...
详细信息
ISBN:
(纸本)9780889867048
parallel applications are notorious for their intractability to performance debugging. Automatic performance analysis techniques, such as those used by Kojak and KappaPI, are promising in alleviating the difficulty of discovering performance inefficiencies in parallel applications. However, as we show in this paper, the results produced by these tool can be potentially misleading and sometimes, outright incorrect. the reason is that the overhead due to performance inefficiencies originating at a certain point in the program can causally propagate and manifest itself at other points. Current techniques perform a flat analysis, i.e., they do not account for causal propagation. In this paper, we present a method of causal analysis that current analysis techniques can be retrofitted with to account for causal propagation of overhead to arrive at a more accurate description of performance bottlenecks. We also show various advantages rendered by this technique to improving the effectiveness of automatic performance analysis. In this paper, we only tackle overhead related to communication operations in MPI parallel application. In general, however, our technique can be used for non-communication related overhead for any parallel programming paradigm.
As the complexity of chip designs increase, simulation time also increases. Unit and variable delay simulation takes the most simulation time in IC design process;however, parallel processing performs inefficiently du...
详细信息
As the complexity of chip designs increase, simulation time also increases. Unit and variable delay simulation takes the most simulation time in IC design process;however, parallel processing performs inefficiently due to large amount of synchronization. In this paper, techniques to reduce the number of synchronization points in synchronous designs are proposed, and a partitioner to partition designs along flip-flop boundaries is also proposed so that these techniques can be employed on real designs.
this paper presents an evolutionary prototyping methodology oriented to the model, design and implementation of concurrent distributedsystems. this methodology use two several stages: a modeling language of concurren...
详细信息
this paper presents an evolutionary prototyping methodology oriented to the model, design and implementation of concurrent distributedsystems. this methodology use two several stages: a modeling language of concurrent distributedsystems LeMSiDiC (a graphical modeled language who provides UML-like structuring capabilities and a precise syntax and semantic for automatic source code generation for these types of systems);a source code generator GeCSiDiC (a code generator able to construct the objects associated to the model specified with LeMSiDiC using the object-oriented paradigm). the methodology allows to interrelate with one architecture oriented to concurrent distributedsystems management or to interrelate with concurrent distributedsystems without a specialized support.
Read-Copy-Update (RCU) is a mechanism designed to increase the level of concurrency in readers-writer synchronization scenarios, vastly improving scalability of software running on multiprocessor machines. Most existi...
详细信息
ISBN:
(纸本)9781467345651
Read-Copy-Update (RCU) is a mechanism designed to increase the level of concurrency in readers-writer synchronization scenarios, vastly improving scalability of software running on multiprocessor machines. Most existing RCU variants have been developed for and studied within the Linux kernel. Due to strong dependency on the Linux internals, they cannot be easily transferred to other operating system kernels. this paper presents a novel non-intrusive variant of the RCU mechanism (AP-RCU), which depends only on basic kernel-level concepts while maintaining the scalability benefits. We have implemented AP-RCU in the Solaris kernel (UTS) and experimentally confirmed the expected benefits over traditional forms of synchronization, comparable with previous RCU implementations.
In this paper, we provide an "Improved Channel Aware QoS Scheduling Architecture for WiMAX Base Stations". In this scheme, we provide intelligence to the classifier by having feedback from the compensation b...
详细信息
ISBN:
(纸本)9780889867741
In this paper, we provide an "Improved Channel Aware QoS Scheduling Architecture for WiMAX Base Stations". In this scheme, we provide intelligence to the classifier by having feedback from the compensation block. We capitalize on the design of WiMAX, which differentiates connections with separate connection IDs and classifies traffic, based on a class-based approach. We propose a design in which the Classifier, in order to provide excellent QoS, admits frames with only those CIDs that can be serviced without any significant delay to the real time voice and video applications, thus improving the overall QoS. In other words, it rejects a packet having a CID which is experiencing bad channel quality and allows unmarked CIDs to utilize the channel at its maximum efficiency.
暂无评论