This paper deals with general nested loops and proposes a novel dynamic scheduling technique. General loops contain complex loop bodies (consisting of arbitrary program statements, such as assignments, conditionals an...
详细信息
This paper deals with general nested loops and proposes a novel dynamic scheduling technique. General loops contain complex loop bodies (consisting of arbitrary program statements, such as assignments, conditionals and repetitions) that exhibit uniform loop-carried dependencies. Therefore it is now possible to achieve efficient parallelization for a vast class of loops, mostly found in DSP, PDEs, signal and video coding. At the core of this technique lies a simple and efficient dynamic rule (SDS - Successive Dynamic Scheduling) for determining the next ready-to-be-executed iteration at runtime. The central idea is to schedule the iterations on-the-fly using SDS, along the optimal hyperplane (determined using the QuickHull algorithm). Furthermore, a tool (CRONUS/1) that implements this theory and automatically produces the SPMD parallel code for message passing architectures is presented. As a testing case study, the FSBM motion estimation algorithm (used in video coding standards, e.g., MPEG-2, H.261) was used. The experimental results validate the presented theory and corroborate the efficiency of the generated parallel code.
In the module allocation problem a collection of software modules are to be assigned to physical processing nodes, subject to execution and communication cost. The cost of an allocation is a function of the execution ...
详细信息
In the module allocation problem a collection of software modules are to be assigned to physical processing nodes, subject to execution and communication cost. The cost of an allocation is a function of the execution costs and the communication costs for any pair of modules allocated to distinct processors. The module allocation problem has been well studied and is known to be NP-complete except for certain communication configurations. To solve the problem, several heuristics have been proposed. This paper discusses an alternative approach to solving the module allocation problem by applying a stochastic optimization method called the Cross Entropy (CE) Method. The CE Method is a state-of-the-art stochastic method for solving combinatorial and multi-extremal continuous optimization problems. The CE method uses a distribution with parameter v to generate sample allocation. The generated samples are then used to update v according to sample quality. This process is repeated until the distribution converges to a possibly optimal allocation. The results in this paper indicate that the CE method can successfully be applied to the module allocation problem and efficiently generate high quality solutions. Also, the CE method allows the use of non-standard objective functions that are used to find allocations that have multiple conflicting objectives.
In multicomputers and computer networks a proper processor allocation of incoming jobs has a big impact on efficiency of parallel and distributedcomputing. In this paper, the mesh topology and allocation with using F...
详细信息
ISBN:
(纸本)0889864012
In multicomputers and computer networks a proper processor allocation of incoming jobs has a big impact on efficiency of parallel and distributedcomputing. In this paper, the mesh topology and allocation with using First Fit (FF) and Stack-Based (SBA) schemes, are considered. In particular, the proposed algorithms, called SSBA (Stack-Based with Sorting) and BFSBA (Better Fit Stack-Based) are described. Evaluation of algorithm's properties have been done with using the proposed experimentation system. This system consists of such modules like experiment design, visualization of allocation processes, presentation of results of series of experiments (including computing the introduced measures of efficiency). The investigations, carried out in this system, show advantages of the proposed algorithms: SSBA is faster, and BFSBA causes less fragmentation than the classic SBA. Moreover, SBA-family algorithms are much more efficient than FF algorithm.
Simultaneous Multithreaded (SMT) processors improve the instruction throughput by allowing fetching and running instructions from several threads simultaneously at a single cycle. With the pipeline deepening and issue...
详细信息
Simultaneous Multithreaded (SMT) processors improve the instruction throughput by allowing fetching and running instructions from several threads simultaneously at a single cycle. With the pipeline deepening and issue widths increasing, the branch predictor plays a more important role in improving the performance of an SMT processor. Although the existing branch predictors have presented a high prediction accuracy, the complexity of their implementation and the hardware overhead of them are very large for SMT processors. In this paper, we present a simple and effective value based branch predictor for SMT processors. The predictor is easy to be implemented, and needs less hardware than that of all the other dynamic predictors we consider. Execution-driven simulation results show that our predictor outperforms many predictors in literature. The performance speedup of our predictor over the best predictor considered is 13% on average. The branch prediction miss rates and the instruction rates fetched along the wrong path are decreased.
Network swapping systems allow individual cluster nodes with over-committed memory to use the idle memory of remote nodes as their backing store, and to swap their pages over the network. As the number of nodes in a c...
详细信息
Network swapping systems allow individual cluster nodes with over-committed memory to use the idle memory of remote nodes as their backing store, and to swap their pages over the network. As the number of nodes in a cluster increases, it becomes more likely that a node will fail or become unreachable, making it important that such a system provide reliability support. Without reliability, a single node crash can affect programs running on other cluster nodes by losing remotely swapped page data that was stored on the crashed node. Our network swapping system, Nswap, has design features that complicate reliability: swapped pages can migrate from one node to another in response to changes in a node's local memory needs. As a result, reliability schemes that rely on fixed placement of page and reliability data are not applicable to our system. Our reliability solutions solve the unique challenge of providing reliability to network swapping systems that both support dynamic changes to the size of remote RAM swap space and support migration of remotely swapped page data. Results show that even though our Mirroring reliability scheme adds time and space overhead to Nswap, it still outperforms swapping to disk by a factor of up to 8.2. Our dynamic Parity scheme will provide reliability with minimal time and space overhead.
Since data dependencies greatly decrease instruction level parallelism, minimizing dependencies becomes a crucial part of the process of parallelizing sequential code. Eliminating all unnecessary hazards leads to the ...
详细信息
Since data dependencies greatly decrease instruction level parallelism, minimizing dependencies becomes a crucial part of the process of parallelizing sequential code. Eliminating all unnecessary hazards leads to the more efficient use of resources, fewer processor stalls and easily maintainable code. Previously we proposed a novel approach for eliminating redundant data dependencies from code. In this paper, we review this method and show how this elimination technique may be combined with unfolding so as to parallelize code even further.
The fields of probability and statistics are built over the abstract concepts of probability space and random variables. This has given rise to elegant and powerful mathematical theory but exact implementation of thes...
详细信息
The fields of probability and statistics are built over the abstract concepts of probability space and random variables. This has given rise to elegant and powerful mathematical theory but exact implementation of these concepts on conventional computers seems impossible. Finding high-quality and efficient algorithms for random number generation on parallel computers is even more difficult. One of the reasons good parallel random number generators are so hard to create is that any small correlations that exist in the sequential generator may be amplified by the method used to distribute the sequence among the processors, producing stronger correlations in the subsequences on each processor. Inter-processor correlations may also be introduced. Also, the method to specify the seeds for each processor is greatly important, since any correlations between the seeds on the different processors could produce strong inter-processor correlations. However, we found the random number generator by L'Ecuyer based on the combination of four linear congruential generators is a good example for both sequential and parallel implementations. It has a large domain and inherently parallel property where each processor is assigned to generate one or more sequences independently and our parallel implementation generates the same random number sequence as the sequential implementation. We mainly present the issues related to the parallel implementation of this algorithm using a shared memory workstations with OpenMP. In the implementation, multiple virtual generators work in parallel. Our solution is scalable with the number of processors and provides locality. It also maintains all properties of its sequential implementation with significant absolute and relative speedup, and significant efficiency and iso-efficiency.
The proceedings contains 69 papers from Fifteenth iasted international conference on parallel and distributed computing and systems. The topics discussed include: computing nearest neighbours in real time;distributed ...
详细信息
ISBN:
(纸本)088986392X
The proceedings contains 69 papers from Fifteenth iasted international conference on parallel and distributed computing and systems. The topics discussed include: computing nearest neighbours in real time;distributed algorithms for constructing multicast trees;properties of neural networks studied from ND geometry and discrete algebra;knowledge inference for mobile monitoring;parallel design of multichannel inverse filters for audio reproduction;design of local access networks;a set-based approach to packet classification and a hybrid scheme for transmission schedules in streaming media.
We motivate the value of an abstract distributed service component (ADSC) with an overview of JANET, a high-performance grid computing service. We then present the architecture of an ADSC, focusing on some design issues.
ISBN:
(纸本)088986392X
We motivate the value of an abstract distributed service component (ADSC) with an overview of JANET, a high-performance grid computing service. We then present the architecture of an ADSC, focusing on some design issues.
Small failures should only disrupt a small part of a network. One way to do this is by marking the surrounding area as untrustworthy - circumscribing the failure. This can be done with a distributed algorithm using hi...
详细信息
ISBN:
(纸本)088986392X
Small failures should only disrupt a small part of a network. One way to do this is by marking the surrounding area as untrustworthy - circumscribing the failure. This can be done with a distributed algorithm using hierarchical clustering and neighbor relations, and the resulting circumscription is near-optimal for convex failures.
暂无评论