Current integration trends embrace the prosperity of single-chip multi-core processors. Although multi-core processors deliver significantly improved system throughput, single-thread performance is not addressed. In t...
详细信息
ISBN:
(纸本)076952429X
Current integration trends embrace the prosperity of single-chip multi-core processors. Although multi-core processors deliver significantly improved system throughput, single-thread performance is not addressed. In this paper, we propose a new execution paradigm that utilizes multi-cores On a single chip collaboratively to achieve high performance for single-thread memory-intensive workloads while maintaining the flexibility to support multithreaded applications. the proposed execution paradigm, dual-core execution, consists of two supersealar cores (a front and back processor) coupled with a queue. the front processor fetches and preprocesses instruction streams and retires processed instructions into the queue for the back processor to consume. the front processor executes instructions as usual except for cache-missing loads, which produce an invalid value instead of blocking the pipeline. As a result, the front processor runs far ahead to warm up the data caches and fix branch mispredictions for the back processor. In-flight instructions are distributed in the front processor, the queue, and the back processor, forming a very large instruction window for single-thread out-of-order execution. the proposed architecture incurs only minor hardware changes and does not require any large centralized structures such as large register files, issue queues, load/store queues, or reorder buffers. Experimental results show remarkable latency hiding capabilities of the proposed architecture, even outperforming more complex single-thread processors with much larger instruction windows than the front or back processor.
the three dimensional discrete cosine transform (3D DCT) has been widely used in many applications such as video compression. On the other hand, the k-ary n-cube is one of the most popular interconnection networks use...
详细信息
ISBN:
(纸本)0769524869
the three dimensional discrete cosine transform (3D DCT) has been widely used in many applications such as video compression. On the other hand, the k-ary n-cube is one of the most popular interconnection networks used in many recent multicomputers. As direct calculation of 3D DCT is very time consuming, many researchers have been working on developing algorithms and special-purpose architectures for fast computation of 3D DCT this paper proposes a parallel algorithm for efficient calculation of 3D DCT on the k-ary n-cube multicomputers. the time complexity of the proposed algorithm is of O(N) for an N x N x N input data cube while direct calculation of 3D DCT has a complexity of O(N-6).
Given the importance of parallel mesh generation in large-scale scientific applications and the proliferation of multilevel SMT-based architectures, it is imperative to obtain insight on the interaction between meshin...
详细信息
ISBN:
(纸本)9781595931672
Given the importance of parallel mesh generation in large-scale scientific applications and the proliferation of multilevel SMT-based architectures, it is imperative to obtain insight on the interaction between meshing algorithms and these systems. We focus on parallel Constrained Delaunay Mesh (PCDM) generation. We exploit coarse-grain parallelism at the subdomain level and fine-grain at the element level. this multigrain data parallel approach targets clusters built from low-end, commercially available SMTs. Our experimental evaluation shows that current SMTs are not capable of executing fine-grain parallelism in PCDM. However, experiments on a simulated SMT indicate that with modest hardware support it is possible to exploit fine-grain parallelism opportunities. the exploitation of fine-grain parallelism results to higher performance than a pure MPI implementation and closes the gap between the performance of PCDM and the state-of-the-art sequential mesher on a single physical processor. Our findings extend to other adaptive and irregular multigrain, parallelalgorithms. Copyright 2005 ACM.
Measurement and modelling of distributions of data communication times is commonly done for telecommunication networks, but this has not previously been done for message passing communications on parallel computers. W...
详细信息
ISBN:
(纸本)3540292357
Measurement and modelling of distributions of data communication times is commonly done for telecommunication networks, but this has not previously been done for message passing communications on parallel computers. We have used the MPIBench program to measure distributions of point-to-point MPI communication times for two different parallel computers, with a low-end Ethernet network and a high-end Quadrics network respectively. Here we present and discuss the results of efforts to fit the measured distributions with standard probability distribution functions such as exponential, lognormal, Erlang, gamma, Pearson 5 and Weibull distributions.
this paper proposes efficient techniques to reconfigure a multi-processor array, which embedded in a 6-port switch lattice in the form of a rectangular grid. It has been shown that the proposed architecture with 6-por...
详细信息
ISBN:
(纸本)3540292357
this paper proposes efficient techniques to reconfigure a multi-processor array, which embedded in a 6-port switch lattice in the form of a rectangular grid. It has been shown that the proposed architecture with 6-port switches eliminate gate delays and notably increase the harvest when compared with one using 4-port switches. A new rerouting algorithm combines the latest techniques to maximize harvest without increase in reconfiguration time. Experimental results show that the new reconfiguration algorithm consistently outperforms the most efficient algorithm proposed in literature.
the Radon transform (RT) is a widely studied algorithm used to perform image pattern extraction in fields such as computer graphics, medical imagery, and avionics. Real-time implementation of the discrete RT (DRT) is ...
详细信息
ISBN:
(纸本)0769524079
the Radon transform (RT) is a widely studied algorithm used to perform image pattern extraction in fields such as computer graphics, medical imagery, and avionics. Real-time implementation of the discrete RT (DRT) is extremely difficult due to its use of complex trigonometric functions and O(N-3) time complexity, making its use in video applications difficult. A O(N(2)lgN) approximate discrete (ADRT) has been presented in literature [1] that allows highly parallel computation. this paper presents an architecture that uses the ADRT to create a computation architecture known as the xADRT Performance analysis indicates that it can achieve a refresh rate of 10 frames per second for use in real-time image processing applications.
In this paper we propose a new parallelization scheme for Simulated Annealing - Hierarchical parallel SA (HPSA). this new scheme features coarse-granularity in parallelization, directed at message-passing systems such...
详细信息
ISBN:
(纸本)3540292357
In this paper we propose a new parallelization scheme for Simulated Annealing - Hierarchical parallel SA (HPSA). this new scheme features coarse-granularity in parallelization, directed at message-passing systems such as clusters. It combines heuristics such as adaptive clustering with SA to achieve more efficiency in local search. through experiments with various optimization problems and comparison with some available schemes, we show that HPSA is a powerful general-purposed optimization method. It can also serve as a framework for meta-heuristics to gain broader application.
Multi task parallel processor arrays are a common machine architecture in which, typically, the tasks running in parallel occupy disjoint subarrays of the machine. On dynamically and partially reconfigurable processor...
详细信息
ISBN:
(纸本)0769523129
Multi task parallel processor arrays are a common machine architecture in which, typically, the tasks running in parallel occupy disjoint subarrays of the machine. On dynamically and partially reconfigurable processor arrays the tasks can be changed during run time. this is useful for online scenarios when the relative importance of tasks might change and therefore the assignment of computational resources to the tasks should be changed. Examples are optimization tasks in an online scenario in which the results of some tasks are needed earlier than expected at initialization. For such tasks the size of their subarrays must be increased because they need more computational resources to speed up. In this paper we design flexible Particle Swarm Optimization (PSO) algorithms for 2-dimensional reconfigurable processor arrays where the algorithms can change their size and have a good optimization behaviour. Since PSO is an iterative, individual-based optimization algorithm that relies upon interactions of neighbouring particles suitable for fine-grained parallelarchitectures. We propose a dynamic 2-dimensional hierarchical ordering of the particles within a tasks subarray so that the best particles are concentrated in the center. this gives the best particles the strongest influence on the swarm. A further advantage is that size reductions of the tasks can easily be done by cutting off the outer parts of the swarm which contain mainly the less good particles. It is experimentally shown that the proposed algorithms perform better than standard PSO algorithms under conditions with varying supply of computing resources that are available for the tasks. Moreover, also for conditions with constant supply of processing resources and no need for size changes the proposed algorithms perform well.
暂无评论