To obtain the benefits of aggressive, wide-issue, architectures, a large window of valid instructions must be available. While researchers have been successful in obtaining high accuracies with a range of dynamic bran...
详细信息
ISBN:
(纸本)0780373154
To obtain the benefits of aggressive, wide-issue, architectures, a large window of valid instructions must be available. While researchers have been successful in obtaining high accuracies with a range of dynamic branch predictors, there still remains the need for more aggressive instruction delivery. Loop bodies possess a large amount of spatial and temporal locality. A large percentage of a program's entire execution can be attributed to code found in loop bodies. If we retain this code in a buffer or the cache, we do not have to refetch this code on subsequent loop iterations., Loops tend to iterate multiple times before exiting, thus providing us with the opportunity to speculatively issue multiple iterations. While some loops can be unrolled by a compiler many contain conditional branches. The number of times a loop iterates may be dependent on a program variable. These issues can hinder our ability to speculatively issue multiple iterations of a loop. If we are able to profile loops during runtime, we can use this information to more accurately issue speculative paths through loop bodies. In this paper we present a characterization of loop execution across the SPECint2000 benchmark suite. We intend for this study to serve as a guide in the selection of design parameters of a loop path predictor We characterize the patterns exhibited during multiple visits to a loop body. We present the design of a table that records path-based loop execution history and allows us to predict multiple loop iterations dynamically.
作者:
Grosz, LMassey Univ
Inst Informat & Math Sci N Shore Mail Ctr Auckland New Zealand
We consider the algebraic multilevel iteration (AMLI) for the solution of systems of linear equations as they arise from a finite-difference discretization on a rectangular grid. Key operation is the matrix-vector pro...
详细信息
We consider the algebraic multilevel iteration (AMLI) for the solution of systems of linear equations as they arise from a finite-difference discretization on a rectangular grid. Key operation is the matrix-vector product, which can efficiently be executed on vector and parallel-vector computer architectures if the nonzero entries of the matrix are concentrated in a few diagonals. In order to maintain this structure for all matrices on all levels coarsening in alternating directions is used. In some cases it is necessary to introduce additional dummy grid hyperplanes. The data movements in the restriction and prolongation are crucial, as they produce massive memory conflicts on vector architectures. By using a simple performance model the best of the possible vectorization strategies is automatically selected at runtime. Examples show that on a Fujitsu VPP300 the presented implementation of AMLI reaches about 85% of the useful performance, and scalability with respect to computing time can be achieved.
Let G be a Delta-regular graph with n vertices and girth at least 4 such that Delta much greater than log n. We give very simple, randomized, distributed algorithms for vertex coloring G with Delta/k colors in O(k + l...
详细信息
Let G be a Delta-regular graph with n vertices and girth at least 4 such that Delta much greater than log n. We give very simple, randomized, distributed algorithms for vertex coloring G with Delta/k colors in O(k + log n) communication rounds, where k = O(log Delta). The algorithm may fail or exceed the above running time, but the probability that this happens is o(1), a quantity that goes to zero as n grows. The probabilistic analysis relies on a powerful generalization of Azuma's martingale inequality that we dub the Method of Bounded Variances. (C) 2000 Academic Press.
Many data-parallel applications, including emerging media applications, have regular structures that can easily be expressed as a series of arithmetic kernels operating on data streams. Data-parallelarchitectures are...
详细信息
ISBN:
(纸本)9781581131963
Many data-parallel applications, including emerging media applications, have regular structures that can easily be expressed as a series of arithmetic kernels operating on data streams. Data-parallelarchitectures are designed to exploit this regularity by performing the same operation on many data elements concurrently. However, applications containing data-dependent control constructs perform poorly on these architectures. Conditional streams convert these constructs into data-dependent data movement. This allows data-parallelarchitectures to efficiently execute applications with data-dependent control flow. Essentially, conditional streams extend the range of applications that a data-parallel architecture can execute efficiently. For example, polygon rendering speeds up by a factor of 1.8 with the use of conditional streams.
High performance applications involving large data sets require the efficient and flexible use of multiple disks. In an external memory machine with D parallel, independent disks, only one block can be accessed on eac...
详细信息
ISBN:
(纸本)0898714532
High performance applications involving large data sets require the efficient and flexible use of multiple disks. In an external memory machine with D parallel, independent disks, only one block can be accessed on each disk in one I/O step. This restriction leads to a load balancing problem that is perhaps the main inhibitor for adapting single-disk external memory algorithms to multiple disks. This paper shows that this problem can be solved efficiently using a combination of randomized placement, redundancy and an optimal scheduling algorithm. A buffer of O(D) blocks suffices to support efficient writing of arbitrary blocks if blocks are distributed uniformly at random to the disks (e.g., by hashing). If two randomly allocated copies of each block exist, N arbitrary blocks can be read within [N/D] + I/O steps with high probability. In addition, the redundancy can be reduced from 2 to if 1/r for any integer r. These results can be used to emulate the simple and powerful "single-disk multi-head" model of external computing [1] on the physically more realistic independent disk model [33] with small constant overhead. This is faster than a lower bound for deterministic emulation [3].
Discrete event simulations are widely used to study and analyze active and conventional networking architectures and protocols. Active networks must coexist and communicate with conventional networks to effectively ut...
详细信息
ISBN:
(纸本)076950728X
Discrete event simulations are widely used to study and analyze active and conventional networking architectures and protocols. Active networks must coexist and communicate with conventional networks to effectively utilize and extend the infrastructure of the Internet. Hence, large scale network simulations containing both conventional and active components should be conducted to study crucial scalability and performance issues. In this paper, a framework to enable parallel co-simulation of conventional and active networks is described. The framework integrates ANSE, a parallel Active Networks Simulation Environment, with NS, a popular sequential network simulator. Object oriented techniques that completely insulate the application modules from the modifications have been employed for parallelizing NS in order to eliminate changes to the network models. This paper presents the design and implementation of the parallel co-simulation framework along with the results obtained from our co-simulation studies. Our studies indicate that parallel simulation techniques considerably reduce simulation times for even small sized network models.
A deterministic, oblivious, permutation routing algorithm on the 2D meshes of constant queue size is presented. An optimal, linear-time bound for oblivious permutation routing could be obtained for these meshes. A new...
详细信息
A deterministic, oblivious, permutation routing algorithm on the 2D meshes of constant queue size is presented. An optimal, linear-time bound for oblivious permutation routing could be obtained for these meshes. A new algorithm based on bit-reversal permutation for oblivious routing is proposed. A (2.954+Ε)n oblivious routing is proposed where Ε is any positive constant and the queue-size is a function of Ε. The running time of the algorithm is less than 1.5 times as large as the network diameter of the mesh. The n×n mesh is considered and model on these meshes are presented. The problems in controlling the pipeline of packet flows is discussed.
Suppose that a parallel algorithm can include any number of parallel threads. Each thread can proceed without ever having to busy wait to another thread. A thread can proceed till its termination, but no new threads c...
详细信息
Suppose that a parallel algorithm can include any number of parallel threads. Each thread can proceed without ever having to busy wait to another thread. A thread can proceed till its termination, but no new threads can be formed. What kind of problems can such restrictive algorithms solve and still be competitive in the total number of operations they perform with the fastest serial algorithm for the same problem? Intrigued by this informal question, we considered one of the most elementary parallel algorithmic paradigms, that of balanced binary trees. The main contribution of this paper is a new balanced (not necessarily binary) tree no-busy-wait paradigm for parallelalgorithms;applications of the basic paradigm to two problems are presented: building heaps, and executing parallel tree contraction (assuming a preparatory stage);the latter is known to be applicable to evaluating a family of general arithmetic expressions. For putting things in context, we also discuss our `PRAM-on-chip' vision (actually a small update to it), presented at SPAA98.
We consider the problem of processing a given number of tasks on a given number of processors as quickly as possible when only vague information about the processing time of a task is available before it is completed....
详细信息
We consider the problem of processing a given number of tasks on a given number of processors as quickly as possible when only vague information about the processing time of a task is available before it is completed. Whenever a processor is idle, it can be assigned, at the price of a certain overhead, a portion, called a chunk, of the unassigned tasks. The goal is to minimize the makespan, that is, the time that passes until all the tasks are completed. The difficulty then is to find the optimal tradeoff between the processors' load balance, which is favoured by having small, and therefore many, chunks, and the total scheduling overhead, which is lower when there are fewer chunks. This scheduling problem has been the subject of intensive research in the past, and a large variety of heuristics have been proposed. Its mathematical analysis, however, turned out to be difficult even for simplistic models of the vague-information issue, and little theoretical work has been presented to date. In this work we present a novel theoretical model that covers a multitude of natural vague-information scenarios, and for which we can prove general upper and lower bounds on the achievable makespan. From this we derive optimal bounds and algorithms for a whole variety of specific scenarios, including the modelling of task processing times as independent, identically distributed random variables, which guided the design of most of flee previously existing heuristics. Unlike traditional approaches, our model neither ignores a priori knowledge of the input (the processing times) nor does it restrict the distribution of the input, but instead works with the concepts of an a priori estimate of the processing times, which is implicit in every algorithm, and a measure for the deviation of this estimate from the actual processing times, which is not known until all the tasks are completed.
We describe two deterministic algorithms for constructing the arrangement determined by a set of (algebraic) curve segments in the plane. They both use a divide-and-conquer approach based on derandomized geometric sam...
详细信息
We describe two deterministic algorithms for constructing the arrangement determined by a set of (algebraic) curve segments in the plane. They both use a divide-and-conquer approach based on derandomized geometric sampling and achieve the optimal running time O(n log n+k), where n is the number of segments and k is the number of intersections. The first algorithm, a simplified version of one presented in [1], generates a structure of size O(n log log n+k) and its parallel implementation runs in time O(log2 n). The second algorithm is better in that the decomposition of the arrangement constructed has optimal size O(n+k) and it has a parallel implementation in the EREW PRAM model that runs in time O(log3/2 n). The improvements in the second algorithm are achieved by means of an approach that adds some degree of globality to the divide-and-conquer approach based on random sampling. The approach extends previous work by Dehne et al., Deng and Zhu and Kuehn, that use small separators for planar graphs in the design of randomized geometric algorithms for coarse grained multicomputers. The approach simplifies other previous geometric algorithms, and also has the potential of providing efficient deterministic algorithms for the external memory model.
暂无评论