We study the power of shared-memory in models of parallel computation. We describe a novel distributed data structure that eliminates the need for shared mernory without significantly increasing the run time of the pa...
详细信息
We study the power of shared-memory in models of parallel computation. We describe a novel distributed data structure that eliminates the need for shared mernory without significantly increasing the run time of the parallel computation. We also show how a complete network of processors can deterministicly simulate one PRAM step in O(log n(loglog n) 2 ) time, when both models use n processors, and ttie size of the PRAM'S shared memory is polynomial in n. (The best previously known upper bound was the trivial O(n)). We also establish that this upper bound is nearly optimal. We prove that an online simulation of T PRAM steps by a complete network of processors requires Ω(Tlog n/loglog n) time.
Several patterns of structure and locality of communication among software components assigned to processors are considered. In each case, a mapping between components and processors is identified so that high-volume ...
详细信息
Several patterns of structure and locality of communication among software components assigned to processors are considered. In each case, a mapping between components and processors is identified so that high-volume communication paths correspond to connections supported most efficiently by the interconnection network. It is shown that locality of communication can have a profound effect on the efficiency of communication in a multicomputer. Several types of multiprocessors and multicomputers being designed for applications in artificial intelligence are likely to exhibit locality of communication of a form suitable for use by such interconnection networks.< >
With the prevalence of chip multiprocessor (CMP) on server and client computers, it becomes an important issue to use the multicores to speedup existing sequential programs. Decoupled software pipelining (DSWP) is a r...
详细信息
With the prevalence of chip multiprocessor (CMP) on server and client computers, it becomes an important issue to use the multicores to speedup existing sequential programs. Decoupled software pipelining (DSWP) is a recent proposed technique that extracts non-speculative threads from sequential programs for higher performance. However, this technique is not effective on commodity CMP architecture, because the inter-thread communication and synchronization overhead often offset the benefit from the parallelization. To reduce the overhead without modification to CMP architecture, this paper presents a clustered DSWP (CDSWP) technique that is an extension to DSWP. By communicating a dependent data set instead of a single dependent data, this technique transforms sequential program into a clustered thread pipeline. The meaning of "clustered" is that some dependent data items are clustered together as a communication unit. The advantage of this technique is that it can eliminate false sharing and reduce the average cache latency, and thus the overhead is reduced greatly. According to the preliminary experiments on some commodity CMP architectures, we have achieved loop speedup ranging from 16% to 58% on some SPEC2000 benchmark programs.
In this paper, we propose a novel mobile distributed file system design, which provides high available and reliable storage for files and guarantees that file operations are executed in spite of concurrency and failur...
详细信息
In this paper, we propose a novel mobile distributed file system design, which provides high available and reliable storage for files and guarantees that file operations are executed in spite of concurrency and failures. The design is intended to fit mobile clients' devices (e.g., PDAs and cell phones) that have limited storage space and cannot store all data they need, yet they require to access these data at all times. We adopt a server-side caching in order to guarantee sufficient caching space to all mobile clients, and ensure the availability of files in case of clients' failures. We present our algorithm, describe its implementation, and report on its performance evaluation using a cluster of workstations. Our results indicate clearly that our algorithm exhibits a significant degree of automation and conflict-free mobile file system
We consider the sharing of processors in parallel computer systems. In general, the main goal of the systems, better performance, decreases processor utilization. We model and simulate a multi-tasking parallel operati...
详细信息
We consider the sharing of processors in parallel computer systems. In general, the main goal of the systems, better performance, decreases processor utilization. We model and simulate a multi-tasking parallel operating system, and estimate its performance degradation and overall system throughout in the case where threads share a processor. We formalize the system by using DEVS (Discrete Event System Specification), and implement it in the DEVS simulation environment. Simulation results show that the processor sharing between threads is acceptable in the sense of TPP (throughput per processor).
Presents a new method of performing division in hardware and explores different ways of implementing it. This method involves computing a preliminary estimate of the quotient by splitting the dividend, performing divi...
详细信息
ISBN:
(纸本)0769514413
Presents a new method of performing division in hardware and explores different ways of implementing it. This method involves computing a preliminary estimate of the quotient by splitting the dividend, performing division of each of the parts in parallel and merging them. The estimate is refined iteratively to get the final quotient. This method is significantly fast since it carries out parallel operations to compute the preliminary quotient and makes use of a fast multiplier to refine the result. It is possible to pipeline the execution of the unit yielding further increase in throughput. Speed estimates show that this method yields a much higher throughput than other fast methods, while area and latency are comparable.
We introduce a new kernel language for modeling hardware/software systems, adopting multiple heterogenous models of computation. The language has formal operational semantics, and is well suited for model checking, co...
详细信息
We introduce a new kernel language for modeling hardware/software systems, adopting multiple heterogenous models of computation. The language has formal operational semantics, and is well suited for model checking, code synthesis etc. For different blocks of code, different scheduling policies can be applied, to reflect the different interpretations of, for example, parallelism in different models of computation. The user can add his own scheduling policies, to use or explore different models of computation.
The author models the internal structure of memory by a tree, where nodes represent memory modules (like cache, disks), and edges represent buses between them. The modules have smaller access time, capacity, and block...
详细信息
The author models the internal structure of memory by a tree, where nodes represent memory modules (like cache, disks), and edges represent buses between them. The modules have smaller access time, capacity, and block size the nearer they are to the root. All buses may transmit blocks of data in parallel. The author gives a deterministic sorting algorithm based on greed-sort. Its running time is shown to be optimal up to a constant factor. The bound implies the number of parallel modules necessary at each hierarchy level to overcome the I/O bottlenecks of sorting. The proposed algorithm also applies to the less general models UMH (uniform memory hierarchies) and P-UMH.< >
An important problem in distributed systems is observation of global properties of distributed computations. What makes this problem difficult is that events in the computation can be concurrent, i.e. the relation bet...
详细信息
An important problem in distributed systems is observation of global properties of distributed computations. What makes this problem difficult is that events in the computation can be concurrent, i.e. the relation between events forms a partial order, not a total order. One of the fundamental parameters of a partial order is the width, which corresponds to the maximum number of mutually incomparable elements. For example, a process-time diagram that shows this partial order decomposition in minimum number of chains can be very useful in monitoring or debugging such computations. In this paper, we present an incremental algorithm to compute the optimal chain partition. We compare our algorithm with existing chain reduction algorithms. From a practical point of view, performance evaluation shows that our approach achieves up to 90% run-time improvement over the previously known algorithms.
We describe a package using parallel and distributed processing elements. This package called MICINE is based on the ARTi-Artificial Neural Network proposed by Carpenter and Grossberg of MIT, USA. It is quite tedious ...
详细信息
We describe a package using parallel and distributed processing elements. This package called MICINE is based on the ARTi-Artificial Neural Network proposed by Carpenter and Grossberg of MIT, USA. It is quite tedious to go through and ascertain the exact disease by giving simply some symptoms. Here we have tried to design a package which will be more efficient than a well experienced Doctor. This system takes in a set of the symptoms from the patient and tries to get the best matching disease for the symptom-set. Then it gives the proper prescription for the medicines. We have implemented the package using an alternative computational paradigm which is inherently parallel and distributed. This system when converted to a fully automatic system can give treatments to any patients without the help of human beings. Thus it can reduce human errors.
暂无评论