A hierarchical bus network T = (V, E) uses hierarchically, tree-like connected buses as a communication network. New communication technologies like SCI (Scalable Coherent Interface) make such networks very attractive...
详细信息
A hierarchical bus network T = (V, E) uses hierarchically, tree-like connected buses as a communication network. New communication technologies like SCI (Scalable Coherent Interface) make such networks very attractive, because they allow their easy construction and guarantee reasonable communication performance. Such networks can be modeled as tree networks: leaves correspond to processors, inner nodes to buses, edges to switches, and bandwidths of inner nodes and edges are related to bandwidths of buses and switches, respectively. In this paper we address the problem of static data management. Given a set of shared data objects X and the read and write frequencies from the processors to the shared data objects, the goal is to compute a (maybe redundant) placement of the shared data objects to the processors, such that the congestion (the maximum over the load of all edges and inner nodes, induced by the read and write frequencies, divided by the bandwidth of the edge or inner node, respectively) is minimized. It is known that this problem can be solved optimally in linear time, if inner nodes are allowed to hold copies of shared data objects. In our model, inner nodes correspond to buses and therefore cannot store copies of shared data objects. We show that this restriction increases the complexity of the placement problem drastically: It becomes NP-hard. On the other hand, the main contribution of our paper is an approximation algorithm with runtime O(|X|·|V|·height(T)·log(degree(T))) that increases the congestion by a factor of at most 7.
In recent years, the task of allocating jobs to servers has been studied with the `balls and bins' abstraction. Results in this area exploit the large decrease in maximum load that can be achieved by allowing each...
详细信息
In recent years, the task of allocating jobs to servers has been studied with the `balls and bins' abstraction. Results in this area exploit the large decrease in maximum load that can be achieved by allowing each job (ball) a little freedom in choosing its destination server (bin). In this paper we examine an infinite and parallel allocation process (see [ABS98]) which is related to the `balls and bins' abstraction. The simple process can be used to model many problems arising in applications like load balancing, data accesses for parallel data servers, hashing, and PRAM simulations. Unfortunately, the parallel allocation process behaves in a highly non-uniform manner which makes its analysis challenging. Even the typically simple question of for which arrival rates the process is stable, is highly non-trivial. In order to cope with this non-uniform behavior we introduce a new sequential process and show (via simulations) that the sequential process models the behavior of the parallel one very accurately. We develop a system of ordinary differential equations in order to describe the behavior of our sequential process and present a thorough analysis of the performance this process. For example, we show that the queue length distribution decreases double-exponentially. Finally, we present simulation results indicating that the solutions to the differential equations very well predict the queue length distribution of our sequential process and the largest injection rate for which it is stable. Summarizing, we can conclude that in all the performance characteristics we have measured experimentally, the parallel and the sequential process are closely related. This indicates that the obtained solution of the differential equations and the results presented above are applicable to the parallel process, too.
Computer simulations of realistic applications usually require solving a set of non-linear partial differential equations (PDEs) over a finite region. The process of obtaining numerical solutions to the governing PDEs...
详细信息
ISBN:
(纸本)354067442X
Computer simulations of realistic applications usually require solving a set of non-linear partial differential equations (PDEs) over a finite region. The process of obtaining numerical solutions to the governing PDEs involves solving large sparse linear or eigen systems over the unstructured meshes that model the underlying physical objects. These systems are often solved iteratively, where the sparse matrix-vector multiply (SPMV) is the most expensive operation within each iteration. In this paper, we focus on the efficiency of SPMV using various ordering/partitioning algorithms. We examine different implementations using three leading programming paradigms and architectures. Results show that ordering greatly improves performance, and that cache reuse can be more important than reducing communication. However, a multithreaded implementation indicates that ordering and partitioning are not required on the Tera MTA to obtain an efficient and scalable SPMV.
Symmetry is one of the most important aesthetic criteria which clearly reveals the structure of the graph. However, previous work on symmetric graph drawing has focused on two dimensions. In this paper, we extend symm...
详细信息
We study an optimization problem that arises in the context of data placement in multimedia storage systems. We are given a collection of M multimedia data objects that need to be assigned to a storage system consisti...
详细信息
ISBN:
(纸本)0898714532
We study an optimization problem that arises in the context of data placement in multimedia storage systems. We are given a collection of M multimedia data objects that need to be assigned to a storage system consisting of N disks d(1),d(2)...,d(N). We are also given sets U-1,U-2,...,U-M such that U-i is the set of clients requesting the ith data object. Each disk d(j) is characterized by two parameters, namely, its storage capacity C-j which indicates the maximum number of data objects that may be assigned to it, and a load capacity L-j which indicates the maximum number of clients that it can serve. The goal is to find a placement of data objects on disks and an assignment of clients to disks so as to maximize the total number of clients served, subject to the capacity constraints of the storage system. We study this data placement problem for two natural classes of storage systems, namely, homogeneous and uniform ratio. Our first main result is a tight upper and lower bound on the number of items that can always be packed for any input instance to homogeneous as well as uniform ratio storage systems. We show that an algorithm given in [11] for data placement, achieves this bound. Our second main result is a polynomial time approximation scheme for the data placement problem in homogeneous and uniform ratio storage systems, answering an open question of [11]. Finally, we also study the problem from an empirical perspective.
Many data-parallel applications, including emerging media applications, have regular structures that can easily be expressed as a series of arithmetic kernels operating on data streams. Data-parallelarchitectures are...
详细信息
Many data-parallel applications, including emerging media applications, have regular structures that can easily be expressed as a series of arithmetic kernels operating on data streams. Data-parallelarchitectures are designed to exploit this regularity by performing the same operation on many data elements concurrently. However, applications containing data-dependent control constructs perform poorly on these architectures. Conditional streams convert these constructs into data-dependent data movement. This allows data-parallelarchitectures to efficiently execute applications with data-dependent control flow. Essentially, conditional streams extend the range of applications that a data-parallel architecture can execute efficiently. For example, polygon rendering speeds up by a factor of 1.8 with the use of conditional streams.
We study the problem of scheduling a set of n independent parallel tasks on m processors, where in addition to the processing time there is a size associated with each task indicating that the task can be processed on...
详细信息
Suppose that a parallel algorithm can include any number of parallel threads. Each thread can proceed without ever having to busy wait to another thread. A thread can proceed till its termination, but no new threads c...
ISBN:
(纸本)9781581131857
Suppose that a parallel algorithm can include any number of parallel threads. Each thread can proceed without ever having to busy wait to another thread. A thread can proceed till its termination, but no new threads can be formed. What kind of problems can such restrictive algorithms solve and still be competitive in the total number of operations they perform with the fastest serial algorithm for the same problem?Intrigued by this informal question, we considered one of the most elementary parallel algorithmic paradigms, that of balanced binary trees. The main contribution of this paper is a new balanced (not necessarily binary) tree no-busy-wait paradigm for parallelalgorithms; applications of the basic paradigm to two problems are presented: building heaps, and executing parallel tree contraction (assuming a preparatory stage); the latter is known to be applicable to evaluating a family of general arithmetic *** putting things in context, we also discuss our “PRAM-on-chip” vision (actually a small update to it), presented at SPAA98.
This paper presents mathematical foundations for the design of a memory controller subcomponent that helps to bridge the processor/memory performance gap for applications with strided access patterns. The parallel Vec...
ISBN:
(纸本)9781581131857
This paper presents mathematical foundations for the design of a memory controller subcomponent that helps to bridge the processor/memory performance gap for applications with strided access patterns. The parallel Vector Access (PVA) unit exploits the regularity of vectors or streams to access them efficiently in parallel on a multi-bank SDRAM memory system. The PVA unit performs scatter/gather operations so that only the elements accessed by the application are transmitted across the system bus. Vector operations are broadcast in parallel to all memory banks, each of which implements an efficient algorithm to determine which vector elements it holds. Earlier performance evaluations have demonstrated that our PVA implementation loads elements up to 32.8 times faster than a conventional memory system and 3.3 times faster than a pipelined vector unit, without hurting the performance of normal cache-line fills. Here we present the underlying PVA algorithms for both word interleaved and cache-line inter-leaved memory systems.
暂无评论