A c-vertex-ranking of a graph G for a positive integer c is a labeling of the vertices of G with integers such that, for any label i, deletion of all vertices with labels >i leaves connected components, each having...
详细信息
Superscalar and VLIW architectures are based on instruction-level parallelism (ILP), which ideally achieve high performance to execute multiple instructions in parallel. However, the system performance is restricted b...
详细信息
ISBN:
(纸本)0818678704
Superscalar and VLIW architectures are based on instruction-level parallelism (ILP), which ideally achieve high performance to execute multiple instructions in parallel. However, the system performance is restricted because of the Von Neumann bottleneck. Therefore, the memory hierarchy design is very important in this kind of architecture. We have proposed a computer architecture named Jetpipeline, which can execute both vector and scalar instructions in parallel. To make full use of the computing ability of Jetpipeline, this paper presents the memory hierarchy design for Jetpipeline and evaluates the effect of the design on the system performance of Jetpipeline through simulations.
The extraction of functional dependencies is a fundamental activity in the database design recovery process. Existing algorithms for this task are computationally expensive and appear to be impractical if applied to l...
详细信息
We consider the problem of sorting a file of N records on the D-disk model of parallel I/O in which there are two sources of parallelism, Records are transferred to and from disk concurrently in blocks of B contiguous...
详细信息
We consider the problem of sorting a file of N records on the D-disk model of parallel I/O in which there are two sources of parallelism, Records are transferred to and from disk concurrently in blocks of B contiguous records. In each I/O operation, up to one block can be transferred to or from each of the D-disks in parallel, We propose a simple, efficient, randomized mergesort algorithm called SRM that uses a forecast-and-flush approach to overcome the inherent difficulties of simple merging on parallel disks, SRM exhibits a limited use of randomization and also has a useful deterministic version. Generalizing the technique of forecasting, our algorithm is able to read in, at any time, the 'right' block from any disk and using the technique of flushing, our algorithm evicts, without any I/O overhead, just the 'right' blocks from memory to make space for new ones to be read in. The disk layout of SRM is such that it enjoys perfect write parallelism, avoiding fundamental inefficiencies of previous mergesort algorithms, By analysis of generalized maximum occupancy problems we are able to derive an analytical upper bound on SRM's expected overhead valid for arbitrary inputs, The upper bound derived on expected I/O performance of SRM indicates that SRM is provably better than disk-striped mergesort (DSM) for realistic parameter values D, M and B. Average-case simulations show further improvement on the analytical upper bound. Unlike previously proposed optimal sorting algorithms, SRM outperforms DSM even when the number D of parallel disks is small.
In this paper, a parallel and fault-tolerant LAN (P_FTLAN) with dual communication subnetworks is presented to improve LANs' reliability. Its function modes, technical characters, hardware and software architectur...
详细信息
In this paper, a parallel and fault-tolerant LAN (P_FTLAN) with dual communication subnetworks is presented to improve LANs' reliability. Its function modes, technical characters, hardware and software architectures, and some key implementation techniques, such as logical addresses and parallel mechanisms, are described in details. Our prototype system and analyzing results suggest that the scheme presented in the paper not only provides an effective approach to high reliable LANs, but also can improve their performance greatly.
programming languages like Fortran or C define exactly the layout of array elements in memory. Programmers often use that definition to access the same memory via variables of different types. For many real programs t...
详细信息
ISBN:
(纸本)0818680903
programming languages like Fortran or C define exactly the layout of array elements in memory. Programmers often use that definition to access the same memory via variables of different types. For many real programs this practice makes changing the layout of an array impossible without violating the semantics of the program since the same memory block may be accessed via variables of different types-such accessed may now receive wrong array elements. On the other hand, changing array layout is often necessary to obtain good parallel performance or even to improve sequential performance by providing better cache locality. Our paper demonstrates that the problem of changing array layouts in the presence of multiple variables of different types accessing the same memory can be solved with our algorithms for 1) detecting overlapping arrays, 2) using procedure cloning to reduce overlapping, 3) array type coercion, and 4) code structure recovery.
parallel logic programming systems are sophisticated examples of symbolic computing systems. They address problems such as dynamic memory allocation, scheduling irregular execution patterns, and managing different typ...
详细信息
parallel logic programming systems are sophisticated examples of symbolic computing systems. They address problems such as dynamic memory allocation, scheduling irregular execution patterns, and managing different types of implicit parallelism. Most parallel logic programming systems have been developed for bus-based shared-memory architectures. The complexity of parallel logic programming systems and the large amount of data they process raises the question of whether logic programming systems can still obtain good performance on scalable architectures, such as distributed shared-memory systems. In this work we use execution-driven simulation to investigate the access patterns and caching behaviour exhibited by a parallel logic programming system, Andorra-I. We show that the system obtains reasonable performance, but that it does not scale well. By studying the behaviour of the major data structures in Andorra-I in detail, we conclude that this result is largely a consequence of the scheduling and work manipulation implementation used in the system. We also show that the Andorra-I's data structures exhibit widely-varying memory access patterns and caching behaviour, which not only depend on the number of processors, but also on the amount and type of parallelism available in the application program. Some of these data structures clearly favour invalidate-based cache coherence protocols, while others favour update-based protocols. Since most of Andorra-I's data structures are common to other parallel logic programming systems, we believe that these systems can greatly benefit from flexible coherence schemes where either the compiler can specify the protocol to be used for each data structure or the protocol can adapt to varying memory access patterns.
Much of the computation involved in parallel programs occurs within loops, either nested loops as in parallel scientific applications or collections of loops as in stream-based applications. Loop fusion is a well-know...
详细信息
ISBN:
(纸本)9780897918909
Much of the computation involved in parallel programs occurs within loops, either nested loops as in parallel scientific applications or collections of loops as in stream-based applications. Loop fusion is a well-known program transformation that has shown to be effective in improving data locality in parallel programs by reducing inter-processor communication and improving register and cache locality. Weighted loop fusion is the problem of finding a legal partition of loop nests into fusible clusters so as to minimize the total inter-cluster weights. The loop nests may contain parallel or sequential loops;care is taken to ensure that a parallel loop does not get serialized after fusion. It has been shown in past work that the weighted loop fusion problem is NP-hard. Despite the NP-hardness property, we show how optimal solutions can be found efficiently (i.e., within the compile-time constraints of a product-quality optimizing compiler) for weighted loop fusion problem sizes that occur in practice. In this paper, we present an integer programming formulation for weighted loop fusion with size (number of variables and constraints) that is linearly proportional to the size of the input weighted loop fusion problem. The linear-sized formulation is key to making the execution time small enough for use in a product-quality optimizing compiler, since the natural integer programming formulation for this problem has cubic size for which the execution time would be too large to be practical. The linear-sized integer programming formulation can be solved efficiently using any standard optimization package but we also present a custom branch-and-bound algorithm that can be used if greater efficiency is desired. A prototype implementation of this approach has been completed, and preliminary compile-time measurements are included in the paper as validation of the practicality of this approach.
Presents an algorithm for computing a sum of products, realizing a fundamental compound multiply-and-add operation of high-speed arithmetic. Two new cellular pipelined algorithms and architectures (2D and 3D) are prop...
详细信息
An important attribute in the specification of many compute-intensive applications is `time'. Simulation of interactive virtual environments is one such domain. There is a mismatch between the synchronization and ...
详细信息
ISBN:
(纸本)9780897918909
An important attribute in the specification of many compute-intensive applications is `time'. Simulation of interactive virtual environments is one such domain. There is a mismatch between the synchronization and consistency guarantees needed by such applications (which are temporal in nature) and the guarantees offered by current shared memory systems. Consequently, programming such applications using standard shared memory style synchronization and communication is cumbersome. Furthermore, such applications offer opportunities for relaxing both the synchronization and consistency requirements along the temporal dimension. In this work, we develop a temporal programming model that is more intuitive for the development of applications that need temporal correctness guarantees. This model embodies two mechanisms: `delta consistency' - a novel time-based correctness criterion to govern the shared memory access guarantees, and a companion `temporal synchronization' - a mechanism for thread synchronization along the time axis. These mechanisms are particularly appropriate for expressing the requirements in interactive application domains. In addition to the temporal programming model, we develop efficient explicit communication mechanisms that aggressively push the data out to `future' consumers to hide the read miss latency at the receiving end. We implement these mechanisms on a cluster of workstations in a software distributed shared memory architecture called `Beehive.' Using a virtual environment application as the driver, we show the efficacy of the proposed mechanisms in meeting the real time requirements of such applications.
暂无评论