We present a new language called ParCeL-1, dedicated to connectionist and explicitly parallel AI programming. ParCeL-1 is a language based on agents, similar to actor languages. Its agents are autonomous and follow a ...
详细信息
We present a new language called ParCeL-1, dedicated to connectionist and explicitly parallel AI programming. ParCeL-1 is a language based on agents, similar to actor languages. Its agents are autonomous and follow a computational model in which the communications are non-blocking and the communication scheme is explicit. ParCeL-1 has a parallel implementation and runs on several multiprocessor architectures. We give an example of connectionist programming (the Kohonen map) and show several performance results on a transputer based multiprocessor architecture and on the Cray T3D parallel computer.
We describe and evaluate a novel distributed-shared memory (DSM) architecture of JUMP-1, a general-purpose MPP system. For improving performance, JUMP-1 DSM architecture utilizes cooperation of the network, constructi...
详细信息
We describe and evaluate a novel distributed-shared memory (DSM) architecture of JUMP-1, a general-purpose MPP system. For improving performance, JUMP-1 DSM architecture utilizes cooperation of the network, construction of memory-directories and a memory-protocol that unifies memory-consistency, communication and synchronization. Among features of JUMP-1 DSM, we show details of Reduced Hierarchical Bit-map Directory schemes (RHBDs) which utilize hierarchy embedded in the interconnection network for reducing network traffic on shared memory operations. Three variations of the RHBD are implemented on a network called the RDT (Recursive Diagonal Torus) consisting of the hierarchical structure of two-dimensional tori. In RHBDs, the bit map directory is reduced for quick multicasting without accessing the directory in each hierarchy. Most unnecessary packets caused by reduction of the bit map are removed with the pruning cache provided in the router. The results of simulation demonstrate that latency for cache coherent messages is much improved compared with traditional directory schemes.
Mapping parallel programs onto multiprocessor computers is one of the most significant research topics in parallel processing. A program can be partitioned into communicating tasks. The allocation of the communicating...
详细信息
ISBN:
(纸本)0780335295
Mapping parallel programs onto multiprocessor computers is one of the most significant research topics in parallel processing. A program can be partitioned into communicating tasks. The allocation of the communicating tasks to processors is called process-to-processor mapping (mapping). Communicating tasks should be allocated in a way to balance the computation load and to minimize the communication load. Link contention degree increases the communication cost due to communication congestion caused by multiple communicating processes sharing the same communication link, therefore should be minimized. The paper proposes a new mapping algorithm for hypercube computers which aims at minimizing the link contention degree at every processor to below a certain (preset) level. For mapping an arbitrary task graph onto a hypercube computer, the algorithm has a worst case time complexity O(n/sup 4/+p/sup 6/). Our algorithm has been implemented in C++ programming language and its performance is evaluated through extensive testing.
Many parallel algorithm design models have been proposed for abstracting a large class of parallelarchitectures. However, all of these models potentially make inaccurate asymptotic performance predictions that may be...
详细信息
Many parallel algorithm design models have been proposed for abstracting a large class of parallelarchitectures. However, all of these models potentially make inaccurate asymptotic performance predictions that may be too optimistic or too pessimistic depending on the circumstances. We propose a new, simpler parallel model called A/sup 3/ (Approximate Model for Analysis of Aggregate Communication Operations) that provides asymptotically accurate time estimates for a wide class of parallel programs that are based on aggregate communication operations. Accuracy is attained (1) by making the model sensitive to the structure of aggregate data communication operations and (2) by classifying these aggregate communication operations into those that are cross-section bandwidth sensitive and those that are not. We note that algorithms expressed exclusively using those aggregate communication operations that are cross-section bandwidth insensitive have the same time complexity across a wide range of architectures. Other algorithms (using aggregate communication operations sensitive to cross-section bandwidth) may have different time complexity but their implementations may still be portable and possibly optimal across a wide range of architectures as long as they use a library of aggregate communication operations customized to each architecture. We note that the simpler, asymptotically accurate algorithm analysis facilitated by A/sup 3/ can make algorithm design much faster and simpler.
We propose a novel concept of an integration of compression and sensing in order to enhance performance of the image sensor. By integrating compression function on the sensor plane, the image signal that has to be rea...
详细信息
We propose a novel concept of an integration of compression and sensing in order to enhance performance of the image sensor. By integrating compression function on the sensor plane, the image signal that has to be readout from the sensor is significantly reduced. Thus, the integration can consequently increase the pixel rate of the sensor. The compression scheme we make use of is conditional replenishment that detects and encodes moving areas. In this paper, we discuss design and implementation of two architectures for on sensor compression. One is pixel parallel approach and the other is column parallel approach. We describe and compare both approaches.
Scalability has been used extensively as a de facto performance criterion for evaluating parallel algorithms and architectures. In this paper, the relation between scalability and execution time is carefully studied. ...
详细信息
ISBN:
(纸本)0818672552
Scalability has been used extensively as a de facto performance criterion for evaluating parallel algorithms and architectures. In this paper, the relation between scalability and execution time is carefully studied. Results show that isospeed scalability well characterizes the variation of execution time. Three algorithms from scientific computing are implemented on an Intel Paragon and an IBM SP2 parallel computer. Experimental and theoretical results show that scalability is an important, distinct metric for parallel and distributed systems, and may be as important as execution time in a scalable parallel and distributed environment.
This paper presents a new heuristic, concurrent, iterative loop-based scheduling and allocation algorithm for high-level synthesis of digital signal processing (DSP) architectures using heterogeneous functional units....
详细信息
ISBN:
(纸本)9780818675027
This paper presents a new heuristic, concurrent, iterative loop-based scheduling and allocation algorithm for high-level synthesis of digital signal processing (DSP) architectures using heterogeneous functional units. In a heterogeneous architecture, functional units could be either bit-serial or digit-serial or bit-parallel. We assume a library of heterogeneous implementation style based functional units is available. Experiments show that this new heuristic synthesis approach generates optimal and near-optimal area solutions. Although optimum synthesis of such architectures were proposed recently using an integer linear programming (ILP) model, our method can produce similar solutions in one to two orders of magnitude less time, at the expense of sacrificing the cost optimality. We compare the solutions generated by the proposed algorithm with the optimal solutions generated by the ILP approach and other recent techniques. We have incorporated this new algorithm into the Minnesota ARchitecture Synthesis (MARS-II) system.
The authors investigate the efficient implementation of algorithms with a two-level parallelism on distributed memory machines. They consider parallel specifications consisting of an upper level of multiprocessor task...
详细信息
The authors investigate the efficient implementation of algorithms with a two-level parallelism on distributed memory machines. They consider parallel specifications consisting of an upper level of multiprocessor tasks each of which having an internal structure of uni-processor tasks. To achieve an optimal parallel execution time, the parallel execution of such a program requires an optimal scheduling of the multiprocessor tasks and an appropriate treatment of uni-processor tasks. In particular they consider an important class of parallel programs that are generated within a specific parallelprogramming model designing group-SPMD programs for scientific computing. They show how the costs of data redistributions between M-tasks can be taken into consideration and how the special structure of the resulting program can be exploited by using a simple approximation algorithm with a provable good performance.
One of the fundamental goals of parallel computing is to develop a framework that will support portable and efficient application programs. The Bulk-Synchronous parallel (BSP) model was proposed to help achieve this g...
详细信息
One of the fundamental goals of parallel computing is to develop a framework that will support portable and efficient application programs. The Bulk-Synchronous parallel (BSP) model was proposed to help achieve this goal. The BSP model is intended to be a "unifying model"-it addresses both software and hardware issues by allowing theoretical analysis to coexist with practical physical implementations. For several years the BSP model has been supported mainly by theoretical results. Recent experiments, however, have begun to demonstrate the practicality of the model for real architectures running real applications. The goal of this paper is to describe the methodology used to construct an efficient BSP library on the BBN Butterfly GP1000. Our results are relevant for BSP library implementations on shared-memory systems in general and for NUMA (nonuniform m-memory-access) machines in particular.
暂无评论