Data-parallel ML is proposed for compilation to a distributed version (DPCAM) of Cousineau, Curien and Mauny’s Categorical Abstract Machine. the DPCAM is a static network of CAMs which dynamically restrict the MIMD e...
详细信息
In multiprocessor systems, overheads caused by inter-processor communication and synchronization have been one of the largest obstacles for efficient execution of parallel programs. To reduce these overheads in shared...
详细信息
A key requirement for the effective use of multiprocessor systems in real-world applications is an ability to accurately predict the performance of a specific algorithm on a specific architecture. Such performance pre...
详细信息
A key requirement for the effective use of multiprocessor systems in real-world applications is an ability to accurately predict the performance of a specific algorithm on a specific architecture. Such performance prediction tools assist the system designer in initially selecting, and then modifying, boththe algorithm and the architecture to obtain acceptable performance. In this paper, we present a modeling approach that permits separate evaluation of algorithm and architecture performance with only a small number of "cross" parameters required to link the two models. An example application of this technique to a Gaussian elimination algorithm on two dissimilar multiprocessor architectures shows good agreement with actual performance figures obtained from measurement and simulation.< >
the introduction of specialized hardware platforms for connectionist modeling ("connectionist supercomputer") has created a number of research topics. Some of these issues are controversial, e.g. the efficie...
the introduction of specialized hardware platforms for connectionist modeling ("connectionist supercomputer") has created a number of research topics. Some of these issues are controversial, e.g. the efficient implementation of incremental learning techniques, the need for the dynamic reconfiguration of networks and possible programming environments for these machines.
this paper describes the MM32k, a massively-parallel SIMD computer which is easy to program, high in performance, low in cost and effective for implementing highly parallel neural network architectures. the MM32k has ...
this paper describes the MM32k, a massively-parallel SIMD computer which is easy to program, high in performance, low in cost and effective for implementing highly parallel neural network architectures. the MM32k has 32768 bit serial processing elements, each of which has 512 bits of memory, and all of which are interconnected by a switching network. the entire system resides on a single PC-AT compatible card. It is programmed from the host computer using a C++ language class library which abstracts the parallel processor in terms of fast arithmetic operators for vectors of variable precision integers.
In this paper, we present a new scalar architecture for high-speed vector processing. Without using cache memory, the proposed architecture tolerates main memory access latency by introducing slide-windowed floating-p...
详细信息
A number of existing multiprocessors are based on the hypercube interconnection network. the popularity of the hypercube is due to its small communication diameter, which grows logarithmically withthe cube size, its ...
详细信息
A number of existing multiprocessors are based on the hypercube interconnection network. the popularity of the hypercube is due to its small communication diameter, which grows logarithmically withthe cube size, its fault-tolerant properties, and its modularity which makes it possible to build a larger cube from smaller subcubes. the star graph has been studied as a network topology for fault-tolerant parallel com puting. Unfortunately, the size of the network grows too sharply with n to be affordable for values of n larger than 7 or 8. We introduce a novel intercon nection network known as the incomplete star graph, which overcomes the above problem while retaining the most of the advantages of the star graph. We present the architecture of the incomplete star graph and compare its performance withthe full star as well as competing architectures such as the incomplete hypercube and arrangement graphs. We provide routing algorithms for both non-faulty and faulty incompletestar graphs, and study their performance.
High performance distributed computing systems require high performance communication systems. Distributed modeling and implementation of these communication systems is important. Toward this goal, the authors refine ...
详细信息
High performance distributed computing systems require high performance communication systems. Distributed modeling and implementation of these communication systems is important. Toward this goal, the authors refine the process-to-channel/sub agent/-to-process (PCP) model of asynchronous distributed communication. While the PCP model provides a versatile and succinct mechanism for specifying and comparing different types of channels, it is inherently centralized. the refined model presented here, the process-to-channel/sub agent/-to-channel/sub agent/-to-process (PCCP) communication model, is amenable to distributed modeling and implementation of channels. the usefulness of the PCCP model is demonstrated by presenting a distributed implementation of hierarchical F-channels.< >
We present a neural network simulation which we implemented on the massively parallel Connection Machine 2. In contrast to previous work, this simulator is based on biologically realistic neurons with nontrivial singl...
We present a neural network simulation which we implemented on the massively parallel Connection Machine 2. In contrast to previous work, this simulator is based on biologically realistic neurons with nontrivial single-cell dynamics, high connectivity with a structure modelled in agreement with biological data, and preservation of the temporal dynamics of spike interactions. We simulate neural networks of 16,384 neurons coupled by about 1000 synapses per neuron, and estimate the performance for much larger systems. Communication between neurons is identified as the computationally most demanding task and we present a novel method to overcome this bottleneck. the simulator has already been used to study the primary visual system of the cat.
Recent physiological research has shown that synchronization of oscillatory responses in striate cortex may code for relationships between visual features of objects. A VLSI circuit has been designed to provide rapid ...
Recent physiological research has shown that synchronization of oscillatory responses in striate cortex may code for relationships between visual features of objects. A VLSI circuit has been designed to provide rapid phase-locking synchronization of multiple oscillators to allow for further exploration of this neural mechanism. By exploiting the intrinsic random transistor mismatch of devices operated in subthreshold, large groups of phase-locked oscillators can be readily partitioned into smaller phase-locked groups. A multiple target tracker for binary images is described utilizing this phase-locking architecture. A VLSI chip has been fabricated and tested to verify the architecture. the chip employs Pulse Amplitude Modulation (PAM) to encode the output at the periphery of the system.
暂无评论