An architecture for a reconfigurable superscalar processor is described in which some of its execution units are implemented in reconfigurable hardware. The overall configuration of the processor is defined according ...
详细信息
ISBN:
(纸本)0769523129
An architecture for a reconfigurable superscalar processor is described in which some of its execution units are implemented in reconfigurable hardware. The overall configuration of the processor is defined according to how its reconfigurable execution units are configured. An efficient micro-architectural solution to configuration management is presented that effectively steers the current processor configuration toward a configuration that is well matched with the execution unit requirements of instructions being scheduled for execution. The approach first selects the best matched among four steering configurations based on the number and type of execution units required by the instructions. One of the steering configurations is dynamically defined as the current configuration;the other three are statically predefined. Once a steering configuration is selected, portions of it begin loading on corresponding reconfigurable execution units that are not busy. The active configuration of the processor is generally the overlap of two or more steering configurations.
The GCA (Global Cellular Automata) model is a very interesting and flexible model which can be used to implement all kind of parallel algorithms. The GCA model consists of a field of cells similar the Cellular Automat...
详细信息
ISBN:
(纸本)0769523129
The GCA (Global Cellular Automata) model is a very interesting and flexible model which can be used to implement all kind of parallel algorithms. The GCA model consists of a field of cells similar the Cellular Automata model. Each cell has links to a set of remote cells which can be dynamically changed from generation to generation. A cell reads the remote neighbors' states and then changes its own state according to a local rule. The model is massively parallel because all cells can change their states independently and in parallel. We have investigated how the GCA model can be implemented efficiently in hardware using a Field Programmable Gate Array (FPGA) prototyping platform. We have implemented a fully parallel architecture where all cells operate fully in parallel and other architectures where the cells are stored in memories in order to handle a large number of cells. We are showing that in the fully parallel architecture a speed-up of around 190 is realistic on a modern FPGA platform compared to a software implementation on a PC. In the partially parallel architecture based on memories the speed-up will be lower but the number of cells is only restricted by the capacity of the memories.
parallel TCP flows are broadly used in the high performance distributed computing community to enhance network throughput, particularly for large data transfers. Previous research has studied the mechanism by which pa...
详细信息
ISBN:
(纸本)0769523129
parallel TCP flows are broadly used in the high performance distributed computing community to enhance network throughput, particularly for large data transfers. Previous research has studied the mechanism by which parallel TCP improves aggregate throughput, but there doesn't exist any practical mechanism to predict its throughput and its impact on the background traffic. In this work, we address how to predict parallel TCP throughput as a function of the number of flows, as well as how to predict the corresponding impact on cross traffic. To the best of our knowledge, we are the first to answer the following question on behalf of a user: what number of parallel flows will give the highest throughput with less than a p% impact on cross traffic? We term this the maximum nondisruptive throughput. We begin by studying the behavior of parallel TCP in simulation to help derive a model for predicting parallel TCP through-put and its impact on cross traffic. Combining this model with some previous findings we derive a simple, yet effective, online advisor. We evaluate our advisor through extensive simulations and wide-area experimentation.
Data Distribution Management (DDM) is one of the six services provided by HLA/RTI as complementarities of Declaration/Interests Management to provide a flexible and extensive mechanism for further throttling the data ...
详细信息
ISBN:
(纸本)0769523129
Data Distribution Management (DDM) is one of the six services provided by HLA/RTI as complementarities of Declaration/Interests Management to provide a flexible and extensive mechanism for further throttling the data placed on the network and delivered to federates based on simulated entities' interests of data. DDM is of essential importance especially for large scale distributed simulations. In the past a few years, two main types of DDM protocols have been developed, named region-based methods and grid-based methods. However, all of these techniques have their obvious drawbacks, which affect their deployment in most applications that require high performance and low overhead. In our previous work, we have proposed a dynamic grid-based DDM scheme that shows a great potential when compared to both region-based and grid-based approaches. In this paper, we wish to improve our previous scheme, which we refer to as optimized dynamic grid-based DDM, to further reduce irrelevant data that might be received by simulated entites.
The parallel multiple front method is used in mechanical engineering to solve large sparse linear systems issued from finite element modeling. It is a parallel direct method based on a nonoverlapping domain decomposit...
详细信息
ISBN:
(纸本)0769523129
The parallel multiple front method is used in mechanical engineering to solve large sparse linear systems issued from finite element modeling. It is a parallel direct method based on a nonoverlapping domain decomposition method. The decomposition is usually built with a graph partitioning approach. However this approach is not well suited to all parallel applications. It provides computing times over the subdomains which can vary from simple to double for our parallel multiple method. We show that its computing time can be decreased by load balancing the computational volume over the subdomains. We present in this communication a sequential and a parallel version of our load balancing method which corrects in computational volume an initial decomposition issued from graph partitioning tools.
In this paper we propose two price-based job allocation schemes for computational grids. A grid system tries to solve problems submitted by various grid users by allocating the jobs to the computing resources governed...
详细信息
ISBN:
(纸本)0769523129
In this paper we propose two price-based job allocation schemes for computational grids. A grid system tries to solve problems submitted by various grid users by allocating the jobs to the computing resources governed by different resource owners. The prices charged by these owners are obtained based on a pricing model using a bargaining game theory framework. These prices are then used for job allocation. We present the grid system model and formulate the two schemes as a constraint minimization problem and as a non-cooperative game respectively. The objective of these schemes is to minimize the cost for the grid users. We present algorithms to compute the optimal load (job) fractions to allocate jobs to the computers. Finally, the two schemes are compared under simulations with various system loads and configurations and conclusions are drawn.
This paper gives an overview of two related tools that we have developed to provide more accurate measurement and modelling of the performance of message-passing communication and application programs on distributed m...
详细信息
ISBN:
(纸本)0769521320
This paper gives an overview of two related tools that we have developed to provide more accurate measurement and modelling of the performance of message-passing communication and application programs on distributed memory parallel computers. MPIBench uses a very precise, globally synchronised clock to measure the performance of MPI communication routines. It can generate probability distributions of communication times, not just the average values produced by other MPI benchmarks. This allows useful insights to be made into the MPI communication performance of parallel computers, and in particular how performance is affected by network contention. The Performance Evaluating Virtual parallel Machine (PEVPM) provides a simple, fast and accurate technique for modelling and predicting the performance of message-passing parallel programs. It uses a virtual parallel machine to simulate the execution of the parallel program. The effects of network contention can be accurately modelled by sampling from the probability distributions generated by MPIBench. These tools are particularly useful on clusters with commodity Ethernet networks, where relatively high latencies, network congestion and TCP problems can significantly affect communication performance, which is difficult to model accurately using other tools. Experiments with example parallel programs demonstrate that PEVPM gives accurate performance predictions on commodity clusters. We also show that modelling communication performance using average times rather than sampling from probability distributions can give misleading results, particularly for programs running on a large number of processors.
Power consumption is a troublesome design constraint for emergent systems such as IBM's BlueGene /L. If current trends continue, future petaflop systems will require 100 megawatts of power to maintain high-perform...
详细信息
ISBN:
(纸本)0769523129
Power consumption is a troublesome design constraint for emergent systems such as IBM's BlueGene /L. If current trends continue, future petaflop systems will require 100 megawatts of power to maintain high-performance. To address this problem the power and energy characteristics of high-performance systems must be characterized. To date, power-performance profiles for distributed systems have been limited to interactive commercial workloads. However, scientific workloads are typically non-interactive (batched) processes riddled with interprocess dependences and communication. We present a framework for direct, automatic profiling of power consumption for non-interactive, parallel scientific applications on high-performance distributed systems. Though our approach is general, we use our framework to study the power-performance efficiency of the NAS parallel benchmarks on a 32-node Beowulf cluster. We provide profiles by component (CPU, memory, disk, and NIC), by node (for each of 32 nodes), and by system scale (2, 4, 8, 16, and 32 nodes). Our results indicate power profiles are often regular corresponding to application characteristics and for fixed problem size increasing the number of nodes always increases energy consumption but does not always improve performance. This finding suggests smart schedulers could be used to optimize for energy while maintaining performance.
Grid applications typically deal with huge amount of data and often the same data have to be transferred and processed on many resources. Nevertheless, the majority of existing middleware platforms for Grid computing ...
详细信息
ISBN:
(纸本)0769523129
Grid applications typically deal with huge amount of data and often the same data have to be transferred and processed on many resources. Nevertheless, the majority of existing middleware platforms for Grid computing do not provide suitable programming and communication models to make easy software development and to improve communication performances when a large set of receivers is involved. Some middlewares for wide area network computing, such as ProActive, provide the group abstraction to transparently deal with a number of similar receivers. We propose an extension of such a mechanism in order to improve its features for Grid environments. In particular, ProActive native groups have been extended both at programming and communication levels in order to support both different internal behaviors and high performance communication based on IP multicast. A case study shows the effectiveness of the new mechanism and its efficiency compared with the original one.
When designing a SoC, matching the required performance both in terms of processing power and power consumption tends to become more and more challenging. Moreover, since the range of targeted applications for every s...
详细信息
ISBN:
(纸本)0769523129
When designing a SoC, matching the required performance both in terms of processing power and power consumption tends to become more and more challenging. Moreover, since the range of targeted applications for every single product is growing rapidly, employing reconfigurable accelerators makes more and more sense to this purpose. Coarse grain reconfigurable architectures bring an alternative providing interesting performance /flexibility trade-offs over traditional approaches. This paper presents an original method allowing to efficiently exploit dynamical parallelism at both loop-level and task-level, which remains rarely used. This method called DHM (Dynamic Hardware Multiplexing) is based upon the use of a hardwired controller dedicated to run-time task scheduling and automatic loop unrolling. This paper shows that significant performance improvements can be achieved through combining both intra and inter-task parallelism. Principles and validations are exposed through a case study on a coarse grain reconfigurable architecture.
暂无评论