MorphoSys is a reconfigurable architecture for computation intensive applications. It combines both coarse grain and fine grain reconfiguration techniques to optimize hardware, based on the application domain. M2, the...
详细信息
ISBN:
(纸本)3540440496
MorphoSys is a reconfigurable architecture for computation intensive applications. It combines both coarse grain and fine grain reconfiguration techniques to optimize hardware, based on the application domain. M2, the current implementation, is developed as an IP core. It is synthesized based on the TSMC 0.13 micron technology. Experimental results show that for multimedia applications MorphoSys has a performance comparable to ASICs withthe added benefit of being able to be reconfigured for different applications in one clock cycle.
In this paper parallel solving symmetric eigenproblems, which include standard and generalized eigenvalue problems, is discussed. For standard eigenvalue problem and tridiagonal eigenvalue problem is not the key point...
详细信息
Creating portable and automatically scalable parallel software has been a goal for researchers and practitioners since the advent of parallel computing. In this paper we present a programming methodology that reduces ...
详细信息
ISBN:
(纸本)0769515126
Creating portable and automatically scalable parallel software has been a goal for researchers and practitioners since the advent of parallel computing. In this paper we present a programming methodology that reduces parallel programming complexity, while creating portable and automatically scalable parallel software. To support this methodology two separate tools have been developed - the PARSA Software Development Environment and an accompanying thread manager. the development environment addresses programming issues via an object-based graphical programming methodology that transforms a project automatically into a portable and scalable source code. Generated source code makes calls to the user-level thread manager, which manages the run time execution of the parallel software. Two sample applications that contain various forms of parallelism have been developed and are compiled on three different systems with diverse native threading mechanisms to demonstrate portability Finally, the automatic scalability is demonstrated withthe run time performance of the applications on multiprocessor systems.
the model of bulk-synchronous parallel (BSP) computation is an emerging paradigm of general-purpose parallel computing. We propose the first optimal deterministic BSP algorithm for computing the convex hull of a set o...
详细信息
ISBN:
(纸本)3540440496
the model of bulk-synchronous parallel (BSP) computation is an emerging paradigm of general-purpose parallel computing. We propose the first optimal deterministic BSP algorithm for computing the convex hull of a set of points in three-dimensional Euclidean space. Our algorithm is based on known fundamental results from combinatorial geometry, concerning small-sized, efficiently constructible e-nets and c-approximations of a given point set. the algorithm generalises the technique of regular sampling, used previously for sorting and two-dimensional convex hull computation. the cost of the simple algorithm is optimal only for extremely large inputs;we show how to reduce the required input size by applying regular sampling in a multi-level fashion.
As a classical method of image segmentation in mathematical morphology, the watershed transform has been applied successively into some fields like remote sensing image processing, biomedical and computer vision appli...
详细信息
the retrieval of images in remote sensing databases is based on world-oriented information like the location of the scene, the utilised scanner, and the date of acquisition. However, these descriptions are not meaning...
详细信息
ISBN:
(纸本)3540440496
the retrieval of images in remote sensing databases is based on world-oriented information like the location of the scene, the utilised scanner, and the date of acquisition. However, these descriptions are not meaningful for many users who have a limited knowledge about remote sensing but nevertheless have to work with satellite imagery. therefore a content-based dynamic retrieval technique using a cluster architecture to fulfil the resulting computational requirements is proposed. Initially the satellite images are distributed evenly over the available computing nodes and the retrieval operations are performed simultaneously. the dynamic strategy creates the need for a workload balancing before the sub-results are joined in a final ranking.
In this paper some implicit domain decomposition procedures for solving parabolic problems are proposed. In these methods, the classic implicit scheme is used in each sub-domain, and Dirichlet boundary values at the (...
详细信息
Our new architecture, known as Scheduled DataFlow (SDF) system deviates from current trend of building complex hardware to exploit Instruction Level parallelism (ILP) by exploring a simpler, yet powerful execution par...
详细信息
ISBN:
(纸本)0769515126
Our new architecture, known as Scheduled DataFlow (SDF) system deviates from current trend of building complex hardware to exploit Instruction Level parallelism (ILP) by exploring a simpler, yet powerful execution paradigm that is based on dataflow, multithreading and decoupling of memory accesses from execution. A program is partitioned into non-blocking threads. In addition, all memory accesses are decoupled from the thread's execution. Data is pre-loaded into the thread's context (registers), and all results are post-stored after the completion of the thread's execution. Even though multithreading and decoupling are possible with control-flow architecture, the non-blocking and functional nature of the SDF system make it easier to coordinate the memory accesses and execution of a thread. In this paper we show some recent improvements on SDF implementation, whereby threads exchange data directly in register contexts, thus eliminating the need for creating thread frames. thus it is now possible to explore the scalability of our architecture's performance when more register contexts are included on the chip.
this paper presents a general methodology for the efficient parallelization of existing data cube construction algorithms. We describe two different partitioning strategies, one for top-down and one for bottom-up cube...
详细信息
this paper presents a general methodology for the efficient parallelization of existing data cube construction algorithms. We describe two different partitioning strategies, one for top-down and one for bottom-up cube algorithms. Both partitioning strategies assign subcubes to individual processors in such a way that the loads assigned to the processors are balanced. Our methods reduce inter processor communication overhead by partitioning the load in advance instead of computing each individual group-by in parallel. Our partitioning strategies create a small number of coarse tasks. this allows for sharing of prefixes and sort orders between different group-by computations. Our methods enable code reuse by permitting the use of existing sequential (external memory) data cube algorithms for the subcube computations on each processor. this supports the transfer of optimized sequential data cube code to a parallel setting. the bottom-up partitioning strategy balances the number of single attribute external memory sorts made by each processor. the top-down strategy partitions a weighted tree in which weights reflect algorithm specific cost measures like estimated group-by sizes. Both partitioning approaches can be implemented on any shared disk type parallel machine composed of p processors connected via an interconnection fabric and with access to a shared parallel disk array. We have implemented our parallel top-down data cube construction method in C++ withthe MPI message passing library for communication and the LEDA library for the required graph algorithms. We tested our code on an eight processor cluster, using a variety of different data sets with a range of sizes, dimensions, density, and skew. Comparison tests were performed on a SunFire 6800. the tests show that our partitioning strategies generate a close to optimal load balance between processors. the actual run times observed show an optimal speedup of p.
暂无评论