This paper presents a new structured parallel programming model, ''SEQ OF PAR'': based on the Communication Closed Layer (CCL) principle of causal composition for parallel programs and Bird-Meertens fo...
详细信息
ISBN:
(纸本)0818678763
This paper presents a new structured parallel programming model, ''SEQ OF PAR'': based on the Communication Closed Layer (CCL) principle of causal composition for parallel programs and Bird-Meertens formalism (Bh IF) of locality-based parallel computation. This model is to support for more general, architecture-independent parallel programming. It provides a structured approach to integrate task (or process) parallelism and data-parallelism in one framework. The well-founded algebra of CCL and BMF makes it also possible to derive, optimize and verify parallel programs through algebraic transformations. Experimental results show that it is very promising to adopt this programming model for getting efficient, portable parallel code.
In the course of the development of reactive systems often real rime constraints have to be met. In such time critical applications heterogeneous multi-processor systems are used in order to fulfill these time constra...
详细信息
ISBN:
(纸本)0818678763
In the course of the development of reactive systems often real rime constraints have to be met. In such time critical applications heterogeneous multi-processor systems are used in order to fulfill these time constraints. This paper presents a hybrid partitioning method that uses a stochastic algorithm together with mixed integer linear programming. This method supports the development of time critical systems. We assume that the algorithm which has to be analyzed is given inform of a so-called task-graph. The goal of the overall method is to fix for every task the processor that will execute it and the starting time of this execution. The main issue is a high-level-synthesis-like method for constructing a problem-specific multi-processor board. The presented methods have been fully implemented and tested.
In this paper we present the results of a parallel implementation of a heart field simulation algorithm. The application of biomagnetic fields offers a wide range for using parallel algorithms. Pathological changes in...
详细信息
ISBN:
(纸本)0818678763
In this paper we present the results of a parallel implementation of a heart field simulation algorithm. The application of biomagnetic fields offers a wide range for using parallel algorithms. Pathological changes in the human body, especially in the heart muscle, can be diagnosed and localised by means of biomagnetic field parameters. The gain of this diagnose method is to fit an individual reference modell of the heart field of a patient. Based on differences between the reference modell and the real measured biomagnetic field parameters, the type and the position of defects in the heart can be located. The most time consuming components of the whole algorithm are the matrix computations, especially the matrix inversion. The matrix inversion can be implemented on a paralleldistributed memory system. In this paper we discuss the routing, the parallel matrix inversion, and the speed up for different network topologies that depends on the number of processors and different problem sizes.
In this paper we study the parallel aspects of PCGLS, a basic iterative method whose main idea is to organize the computation of conjugate gradient method with preconditioner applied to normal equations, and Incomplet...
详细信息
ISBN:
(纸本)0818678763
In this paper we study the parallel aspects of PCGLS, a basic iterative method whose main idea is to organize the computation of conjugate gradient method with preconditioner applied to normal equations, and Incomplete Modified Gram-Schmidt (IMGS) preconditioner for solving sparse least squares problems on massively paralleldistributed memory computers. The performance of these methods an this kind of architecture is always limited because of the global communication required for the inner products. We will describe the parallelization of PCGLS and lMGS preconditioner by two ways of improvement. One is To assemble the results of a number of inner products collectively and the other is to create situations where communication can be overlapped with computation. A theoretical model of computation and communication phases is presented which allows us to decide the number of processors that minimizes the runtime. Several numerical experiments on Parsytec GC/PowerPlus are presented.
With the advances of wireless communication technology, using the wireless LAN as a platform to perform distributed network computing becomes feasible. In this paper, we studied the characteristics of the end-to-end c...
详细信息
With the advances of wireless communication technology, using the wireless LAN as a platform to perform distributed network computing becomes feasible. In this paper, we studied the characteristics of the end-to-end communication over wireless links. With the advantage of reduced bandwidth competition in each LAN segment separated by the wireless bridges, and with the overlap of wireless and wired communications, an analytical comparison showed that the group communications over wireless links can be more efficient than over a single segment wired LAN. We also conducted experiments of running distributed applications and the results showed that with the support of threads, wireless network computing can achieve the same performance as the wired networks. Furthermore, the statistical results from our survey showed that the users cannot tell the difference between wireless and wired settings in terms of the data accessing speed.
Although highly paralleldistributed memory computers exist for several years, the operating systems used on them did nor fit the requirements very well. Most of them are designed for sequential, shared memory paralle...
详细信息
ISBN:
(纸本)0818678763
Although highly paralleldistributed memory computers exist for several years, the operating systems used on them did nor fit the requirements very well. Most of them are designed for sequential, shared memory parallel, or distributed computers. Examples are Unix on the IBM SP/2 [17] and Mach on the Intel Paragon. This results in poor scalability caused by inefficient communication primitives designed for wide area networks or by waste of resources due to huge kernels (e.g. 8 MB per node are reported for Mach on the Paragon, [16]), which is harmful especially in highly parallel systems with hundreds or thousands of nodes. With Cosy (Concurrent Operating System) we have shown that a well structured and carefully designed system can be small (70 Kb for the kernel, 372 total memory usage per node), efficient (33 mu s for communication), and scalable (applications run efficient on up to 1024 processors).
On the point of that it is very difficult to keep load balancing among processors for the nonuniform loop in compile-time and it must be at the price of extra overhead to use dynamic methods, this paper has proposed a...
详细信息
ISBN:
(纸本)0818678763
On the point of that it is very difficult to keep load balancing among processors for the nonuniform loop in compile-time and it must be at the price of extra overhead to use dynamic methods, this paper has proposed an adaptive hybrid scheduling way, in which the processes of distribution of loop are divided into a few rounds and the block size in each round is determined adaptively according to the average overhead due to dynamic scheduling. Several experiment results have also exposed the effect of scheduling parameter, which could be selected by programmers according to the probability that a fetching processor may not perform an additional task fetching.
This paper presents a new rapid thread replacement mechanism which is important in multithread technology. Analysis to the memory system indicates that the memory utilization decreases with the increase of cache hit r...
详细信息
ISBN:
(纸本)0818678763
This paper presents a new rapid thread replacement mechanism which is important in multithread technology. Analysis to the memory system indicates that the memory utilization decreases with the increase of cache hit ratio. The parallelism between thread computation and thread replacement is found by analyzing their working processes. Based on these, we advance a rapid multithread replacement mechanism which overlaps the thread replacement with thread computation. More especially, with finite hardware contexts, this mechanism can play the same role of infinite contexts by tolerating the replacement overhead. By modifing the general thread switching model, we bulid the thread replacement model and evaluate this mechanism in theory and experiment methods. At last, we discuss the hardware implementation and put forward the problems to be resolved in the future.
Fast and efficient communication is one of the major design goals not only for parallel systems but also for clusters of workstations. The proposed model of the high performance communication device ATOLL (1) features...
详细信息
ISBN:
(纸本)0818678763
Fast and efficient communication is one of the major design goals not only for parallel systems but also for clusters of workstations. The proposed model of the high performance communication device ATOLL (1) features very low latency for the start of communication operations and reduces the software overhead for communication specific functions. To close the gap between off-the-shelf microprocessors and the communication system a highly sophisticated processor interface implements atomic start of communication, MMU support, and a flexible event scheduling scheme. The interconnectivity of ATOLL provided by four independent network ports combined with cut-through routing allows the configuration of a large variety of network topologies. A software transparent error correction mechanism significantly reduces the required protocol overhead. The presented simulation results promise high performance and low-latency communication.
This paper proposes an efficient parallel approach to texture classification for image retrieval. The idea behind this method is to pre-extract texture features in terms of texture energy measurement associated with a...
详细信息
This paper proposes an efficient parallel approach to texture classification for image retrieval. The idea behind this method is to pre-extract texture features in terms of texture energy measurement associated with a 'tuned' mask and store them in a multi-scale and multi-orientation texture class database via a two-dimensional linked list for query. Thus each texture class sample in the database can be traced by its texture energy in a two-dimensional row sorted matrix. The parallel searching strategies are introduced for fast identifying the entities closest to the input texture throughout the given texture energy matrix. In contrast to the traditional search methods, our approach incorporates different computation patterns for different cases of available processor numbers and concerns with robust and work-optimal parallel algorithms for row-search and minimum-find based on the accelerated cascading technique and the dynamic processor allocation scheme. Applications of the proposed parallel search and multisearch algorithms to both single image classification and multiple image classification are discussed. The time complexity analysis shows that our proposal will speed up the classification tasks in a simple but dynamic manner. Examples are presented of the texture classification task applied to image retrieval of Brodatz textures, comprising various orientations and scales.
暂无评论