ElipSys is a programming system supporting a constraint logic programming (CLP) language and OR-parallel execution. these two features complement each other: CLP programming eases the writing of efficient search progr...
详细信息
In designing application-specific bit-level architectures and in programming existing bit-level processor arrays, it is necessary to expand a word-level algorithm into its bit-level form before dependence analysis can...
详细信息
In designing application-specific bit-level architectures and in programming existing bit-level processor arrays, it is necessary to expand a word-level algorithm into its bit-level form before dependence analysis can be performed. In this paper, we consider dependence structures of bit-level algorithms as functions of three components dependence structures of word-level algorithms, dependence structures of the arithmetic algorithms implementing word-wise operations, and algorithm expansions. Based on these components, we can derive dependence structures of bit-level algorithms without using time consuming general dependence analysis methods. To illustrate our approach, we derive two dependence structures for bit-level matrix multiplication and apply a method developed earlier [5,6,10] to design two bit-level architectures. One of these architectures is O{p) times faster than the best word-level architecture, where p is the word length. the speedup we found here is true in general because a bit in a bit-level architecture goes to the next processor for processing as soon as it is available.
the paper presents the differential hashing functions. A differential computation process enables hashing function processing time to be optimized. After a formal definition of the differential property for hashing fu...
详细信息
the paper presents the differential hashing functions. A differential computation process enables hashing function processing time to be optimized. After a formal definition of the differential property for hashing functions, we show that not all hashing functions have this property, then we propose a characterization of the differential hashing function set. Next, we show the performance acceleration produced by the differential algorithms applied to five hashing functions found in common applications. the observed accelerations can be significant because they are proportional to key length. Last we study the performances of a differential hashing function for the reachability graph exploration of distributed systems specified by a Petri net. this application demonstrates the advantages and the limits of the differential technique.< >
Recent advancements in VLSI and packaging technologies demonstrate attractiveness in building scalable parallel systems using clustered configurations while exploiting communication locality. Clustered architectures u...
详细信息
Recent advancements in VLSI and packaging technologies demonstrate attractiveness in building scalable parallel systems using clustered configurations while exploiting communication locality. Clustered architectures using buses or MINs as the inter-cluster interconnection do not satisfy boththe above objectives. this paper proposes a new class of k-ary n-cube cluster-c scalable architectures by combining the scalability of k-ary n-cube wormhole-routed networks withthe cost-effectiveness of processor cluster designs. this paper focuses on direct cluster interconnection. the interplay between various system parameters and routing schemes are analyzed to determine optimal configurations under the constant bisection bandwidth constraint. Our analysis indicates that small sized clusters with a ring intra-cluster topology and a 2D/3D/4D inter-cluster network connecting these clusters offer best system performance.< >
In parallel A* graph search on distributed-memory machines, different processors may perform significant duplicated work if inter-processor duplicates are not pruned. the only known method for duplicate pruning associ...
详细信息
In parallel A* graph search on distributed-memory machines, different processors may perform significant duplicated work if inter-processor duplicates are not pruned. the only known method for duplicate pruning associates a particular processor with each distinct node of the search space using a suitable hash function. then duplicate nodes arising in different processors are transmitted to the same processor, and thereby pruned. there are two main drawbacks attributable to such an approach: (1) Load balance is determined solely by the hash function and is unsatisfactory. (2) Node transmissions for duplicate pruning are global; this can lead to hot spots in the network. We propose two different duplicate pruning techniques that outperform this hashing-only method by using: (1) separate algorithms for duplicate pruning and load balancing, and (2) a novel search space partitioning scheme that evenly spreads out the bandwidth requirement for pruning over the entire parallel architecture. Using the Traveling Salesman Problem (TSP) as a test case, we find that on a 10-dimensional nCUBE2 hypercube multicomputer, our pruning strategies yield a speedup improvement of more than 135% over previous methods that do not prune any duplicates, and more than 155% over the hashing-only pruning scheme.< >
Most of the existing loop partitioning schemes require the dependence distances between the iterations of the loops to be constants which means that these schemes will not be applicable when the dependence distance is...
详细信息
Most of the existing loop partitioning schemes require the dependence distances between the iterations of the loops to be constants which means that these schemes will not be applicable when the dependence distance is variable. When each iteration of a loop is treated as a sequential instruction stream, parallel execution of these loops is possible if the data dependency between each iteration can be resolved during run-time. We propose a parallel architecture which is able to resolve these kind of undetermined data dependencies. A notable feature of the proposed architecture is that it will initiate executing multiple instruction streams before the data dependency between the instruction streams, if any, has been resolved and will restart the execution if necessary.< >
Communications latency forms a major obstacle to effective parallelprocessing. the bottlenecks of interprocessor communication can be caused by characteristics of a particular architecture or a particular application...
详细信息
Communications latency forms a major obstacle to effective parallelprocessing. the bottlenecks of interprocessor communication can be caused by characteristics of a particular architecture or a particular application, and especially by the relationship between the two. We believe that efficient parallelprocessing requires serious attention to this intersection of architecture and application. In this paper we report: our analysis of the execution behavior of three programs from the SPLASH set, using two multiprocessor systems and a simulator; our identification of one program as especially hostile to multiprocessors; and the results of our efforts to improve the performance of that program by applying our detailed knowledge of the relationship between application and architecture.< >
In a system supporting parallel execution, speculative processing can be used to increase the amount of parallelism. Speculative computations are started before it is known whether the results are needed. the system r...
详细信息
In a system supporting parallel execution, speculative processing can be used to increase the amount of parallelism. Speculative computations are started before it is known whether the results are needed. the system requires a method to favor the execution of more promising computations and to remove unwanted ones. Speculative processing can be equally or unequally prioritized. We describe the support for speculative parallelism in the language BaLinda Lisp, show its usefulness in applications to infinite data structure, heuristic search, and AND/OR parallel computation. We also discuss the condition and necessity for the early merging of the speculative tasks' environment.< >
In this paper, aimed at understanding the effects of resource utilization in the performance of multi-way joins in shared-nothing database systems, we introduce a performance modeling for the left and right-deep linea...
详细信息
In this paper, aimed at understanding the effects of resource utilization in the performance of multi-way joins in shared-nothing database systems, we introduce a performance modeling for the left and right-deep linear join tree structures. the main purpose of the model is to get some insight into the behavior of the tree formats and their effects in the overall parallelprocessing. We present many performance results and show that the best multi-way join processing performance is achieved when the number of active operators can be adjusted in order to maximize the overlap of the disk and network transfers.< >
Some parallel applications do not require a precise imitation of the behavior of the physically shared memory programming model. Consequently, certain parallel machine architectures have elected to emphasize different...
详细信息
Some parallel applications do not require a precise imitation of the behavior of the physically shared memory programming model. Consequently, certain parallel machine architectures have elected to emphasize different required coherency properties because of possible efficiency gains. this has led to various definitions of models of store coherency. these definitions have not been amenable to detailed analysis and, consequently, inconsistencies have resulted. In this paper a unified framework is proposed in which different models of store coherency are developed systematically by progressively relaxing the constraints that they have to satisfy. A demonstration is given of how formal reasoning can be carried out to compare different models. Some real-life systems are considered and a definition of a version of weak coherency is found to be incomplete.< >
暂无评论