We propose a new multicomputer node architecture, the DI-multicomputer, which can provide higher memory and communication performancethan existing multicomputerarchitectures. By integrating a router onto each proces...
详细信息
We propose a new multicomputer node architecture, the DI-multicomputer, which can provide higher memory and communication performancethan existing multicomputerarchitectures. By integrating a router onto each processor chip and eliminating the memory bus interface, each processor uses packet routing for both local memory access and internode communication. Multi-packet handling mechanisms are used to implement a highperformance memory interface based on packet routing. the DI-multicomputer network interface directs different types of messages to an appropriate level of the memory hierarchy, providing efficient communication for both short and long messages. Trace-driven simulations show that the communication mechanisms of the DI-multicomputer can achieve up to four times speedup when compared to existing architectures.< >
Superscalar microprocessors obtain highperformance by exploiting parallelism at the instruction level. To effectively use the instruction-level parallelism found in general purpose, non-numeric code, future processor...
详细信息
Superscalar microprocessors obtain highperformance by exploiting parallelism at the instruction level. To effectively use the instruction-level parallelism found in general purpose, non-numeric code, future processors will need to speculatively execute far beyond instruction fetch limiting conditional branches. One result of this deep speculation is an increase in the number of instruction and data memory references due to the execution of mispredicted paths. Using a tool we developed to generate speculative traces from Intel architecture Unix binaries, we examine the differences in cache performance between speculative and non-speculative execution models. the results pertaining to increased memory traffic mispredicted path reference effects, allocation strategies, and speculative write buffers are discussed.< >
the authors have developed a conceptual design for a multiprocessor system to implement multidimensional digital signal processing applications. their approach is to develop an application specific computing system in...
详细信息
the authors have developed a conceptual design for a multiprocessor system to implement multidimensional digital signal processing applications. their approach is to develop an application specific computing system instead of a specific hardware solution. the application specific computing system can be used as a research tool for exploring new approaches to problems. In addition, it can be used to achieve highperformance at reasonable costs since it can use commercially available processors. they present a performance evaluation of our system using the 2-D FIR filter as an example. they explain their approach to implementing the 2-D FIR filter by first reviewing their previous implementation of the 2-D IIR filter.< >
Exploiting parallelism is a key to building high-performance database systems. Several approaches to building database systems that support both inter- and intra-query parallelism have been proposed. these approaches ...
详细信息
Exploiting parallelism is a key to building high-performance database systems. Several approaches to building database systems that support both inter- and intra-query parallelism have been proposed. these approaches can be broadly classified as either Shared Nothing (SN) or Shared Everything (SE). Although the SN approach is highly scalable, it requires complex data partitioning and tuning to achieve good performance whereas the SE approach suffers from non-scalability. We propose a scalable sharing approach which combines the advantages of both SN and SE. We propose a comprehensive database architecturethat includes the underlying hardware, and data partitioning and scheduling strategies, to promote scalable sharing. We analyze the performance and scalability of our approach and compare withthat of a SN system. We find that for a variety of workloads and data skew our approach performs and scales at least as well as a SN system that uses the best possible data partitioning strategy.< >
An edge detection process in computer vision and image processing detects any types of significant features appearing as discontinuities in intensities. this paper presents our experience with parallelizing an edge de...
详细信息
ISBN:
(纸本)0818664274
An edge detection process in computer vision and image processing detects any types of significant features appearing as discontinuities in intensities. this paper presents our experience with parallelizing an edge detection application algorithm that reduces noise and unnecessary detail in a gray-scale image from a coarse level to a fine level of resolution by using an edge focusing technique. Numerical methods and parallel implementations of edge focusing are presented. the image detection algorithms are implemented on three representative message-passing architectures: a low-cost heterogeneous PVM network, an Intel iPSC/860 hypercube, and a CM-5 massively parallel multicomputer. Our objectives are to provide insight into implementation and performance issues for image processing applications on general-purpose message-passing architectures, to investigate implications an network variations, and to evaluate the computing scalabilities on the three network systems by examining execution and communication patterns of the image edge detection application.< >
In numerical algorithms based on adaptive mesh refinement, the computational workload changes during their execution. In mapping such algorithms on to distributed memory architectures, it is necessary to balance the w...
详细信息
In numerical algorithms based on adaptive mesh refinement, the computational workload changes during their execution. In mapping such algorithms on to distributed memory architectures, it is necessary to balance the workload among the processors dynamically in order to obtain highperformance. In this paper, we propose a dynamic processor allocation algorithm for a mesh architecturethat reassigns the workload in an attempt to minimize boththe computational and communication costs. Our algorithm is based on a heuristic for a 2D packing problem that gives provably close to optimal solutions for special cases of the problem. We also demonstrate through experiments how our algorithm provides good quality solutions in general.< >
the family of reconfigurable generalized hypercube (RGH) architectures is proposed for the construction of scalable parallel computers. the objective is to reduce the high VLSI complexity of generalized hypercubes whi...
详细信息
the family of reconfigurable generalized hypercube (RGH) architectures is proposed for the construction of scalable parallel computers. the objective is to reduce the high VLSI complexity of generalized hypercubes while maintaining to high extent their outstanding performance. Generalized hypercubes are versatile topologies of very high cost that optimally emulate binary hypercubes and k-ary n-cubes. RGH's, which are lower-cost reconfigurable systems, emulate efficiently generalized hypercubes for application algorithms that use regular communication patterns. RGH's generally perform better than binary hypercubes and k-ary n-cubes withthe same number of nodes. To illustrate the viability of RGH's, extensive cost analysis and comparisons with relevant systems are carried out. the hardware cost of RGH's is shown to be even lower than that of fat trees. therefore, scalable RGH's are viable candidates for the construction of versatile parallel computers.< >
In this paper, we use execution-driven simulation to study and compare vector processing performances, in terms of the total execution time of an application program, of cache-based vector computers withthat of uncac...
详细信息
ISBN:
(纸本)0818664274
In this paper, we use execution-driven simulation to study and compare vector processing performances, in terms of the total execution time of an application program, of cache-based vector computers withthat of uncached vector computers having a large number of interleaved memory banks. the cache memory used here is a new cache organization called prime-mapped cache. Simulation results on SPEC92 benchmarks show that the cache-based vector computers perform significantly better than the vector computers with no cache. the performance improvement is getting larger as the speed gap between processors and memories grows. Meanwhile, the cache-based vector computers are also cost-effective because of the reduction of large number of interleaved memory banks otherwise needed by vector computers with no cache.< >
the main contribution of this work is to propose two application-specific bus architectures for computingthe prefix sums of a binary sequence. Our architectures feature the following characteristics: all broadcasts o...
详细信息
the main contribution of this work is to propose two application-specific bus architectures for computingthe prefix sums of a binary sequence. Our architectures feature the following characteristics: all broadcasts occur on buses of length15 or 63; we use a new technique that we call shift switching which allows switches to cyclically permute an incoming signal, dramatically improving the performance of the reconfigurable bus system. As it turns out, our special-purpose architectures improve the performance of the best algorithms known to date by a significant factor. Specifically, our solutions require no adders, are faster, and use less VLSI area than the architectures of the state of the art.< >
the Algorithm To architecture Mapping Model (ATAMM) is a scheduling strategy for predictable performance in real-time multiprocessor dataflow architectures. the architecture under consideration consists of either hete...
详细信息
ISBN:
(纸本)0818656204
the Algorithm To architecture Mapping Model (ATAMM) is a scheduling strategy for predictable performance in real-time multiprocessor dataflow architectures. the architecture under consideration consists of either heterogeneous or homogeneous processors and implements dataflow models of real-time applications with a cyclo-static assignment scheme. Terminology is developed for graph partitioning, heterogeneous computing, and assignment classifications. A design methodology is described for partitioning dataflow graphs into blocks of nodes (operations) for the purpose of cyclo-static node-to-processor assignment while satisfying a design objective. A theorem is developed and proved to provide cyclo-static assignment of node blocks and illustrated by simulation results.< >
暂无评论