this contribution describes a new class of arithmetic architectures for Galois fields GF(2k). the main applications of the architecture are public-key systems which are based on the discrete logarithm problem for elli...
详细信息
Object dataflow is a popular approach used in parallel rendering. the data representing the 3D scene is statically distributed among processors and objects are fetched and cached only on demand. Most previous object d...
详细信息
Object dataflow is a popular approach used in parallel rendering. the data representing the 3D scene is statically distributed among processors and objects are fetched and cached only on demand. Most previous object dataflow methods were implemented on shared memory architectures and exploited spatial coherency to reduce hardware cache misses. In this paper, we propose an efficient model for object dataflow parallel volume rendering on message passing machines. the algorithm is introduced and its ray storage mechanism is used to support latency hiding by postponing computation on inactive rays. Memory usage is optimized by letting objects migrate and replicate at different processors rather than the common static assignments. Our cache-only-memory approach uses a distributed-directory scheme to trace the location of objects at other nodes. A mechanism to minimize network congestion was implemented which optimizes channel utilization. Unlike previous methods, our approach can benefit from temporal coherence and effectively minimizes communication costs during animation on limited-bandwidth multiprocessing environments. We report results of the algorithm's implementation on several platforms like Cray T3D, Convex SPP and DEC-alpha cluster of workstations (COWs), and achieved higher efficiency and scalability than existing algorithms.
DataScalar architectures improve memory system performance by running computation redundantly across multiple processors, which are each tightly coupled with an associated memory. the program data set (and/or text) is...
详细信息
ISBN:
(纸本)9780897919012
DataScalar architectures improve memory system performance by running computation redundantly across multiple processors, which are each tightly coupled with an associated memory. the program data set (and/or text) is distributed across these memories. In this execution model, each processor broadcasts operands it loads from its local memory to all other units. In this paper, we describe the benefits, costs, and problems associated withthe DataScalar model. We also present simulation results of one possible implementation of a DataScalar system. In our simulated implementation, six unmodified SPEC95 binaries ran from 7% slower to 50% faster on two nodes, and from 9% to 100% faster on four nodes, than on a system with a comparable, more traditional memory system. Our intuition and results show that DataScalar architectures work best with codes for which traditional parallelization techniques fail. We conclude with a discussion of how DataScalar systems may accommodate traditional parallelprocessing, thus improving performance over a much wider range of applications than is currently possible with either model.
this paper is concerned with a new parallelthinning approach for three dimensional (3D) digital images that preserves the topology and maintains their shape. We introduce a new approach of selecting shape points and ...
详细信息
We present the concept of cooperative vision and its application to a multi-agent system with special attention to the integration of vision. Cooperative vision can be described as a type of distributed vision, where ...
详细信息
this paper presents a complete methodology for the automatic synthesis of VLSI architectures used in digital signal processing. Most signal processingalgorithms have the form of an n-dimensional nested loop with unit...
详细信息
this paper presents a complete methodology for the automatic synthesis of VLSI architectures used in digital signal processing. Most signal processingalgorithms have the form of an n-dimensional nested loop with unit uniform loop carried dependencies. We model such algorithms with generalized UET grids. We calculate the optimal makespan for the generalized UET grids and then we establish the minimum number of systolic cells required for achieving the optimal makespan. We present a complete methodology for the hardware synthesis of the resulting architecture, based on VHDL. this methodology automatically detects all necessary computation and communication elements and produces optimal layouts. the complexity of our proposed scheduling policy is completely independent of the size of the nested loop and depends only on its dimension, thus being the most efficient (in terms of complexity) known to us. All these methods were implemented and incorporated in an integrated software package which provides the designer with a powerful parallel design environment, from high level signal processing algorithmic specifications to low-level (i.e., actual layouts) optimal implementation. the evaluation was performed using well-known algorithms from signal processing.
作者:
Lenke, MLRR-TUM
Lehrstuhl für Rechnertechnik und Rechnerorganisation Institut für Informatik Technische Universität München 80290 München Germany
Typical applications of the so-called Grand Challenges need massively parallel computer system architectures. Tools like parallel debuggers, performance analysers and visualizers help the code designer to develop effi...
详细信息
Typical applications of the so-called Grand Challenges need massively parallel computer system architectures. Tools like parallel debuggers, performance analysers and visualizers help the code designer to develop efficient parallelalgorithms. Such tools merely support the development cycle. But technical and scientific engineers who make use of parallel high-performance computing applications, e.g. numerical simulation algorithms in computational fluid dynamics (CFD), must be supported in their engineering work by another kind of tool. A tool for the application cycle is required because old, conventional suggestions regarding the arrangement for the application cycle rely on strictly sequential procedures. they are due to the heritage of traditional work on former vector computers. that formative influence is still felt in today's arrangements for the application cycle, prevents a more efficient engineering work and, therefore, must be overcome. New tool conceptions have to be introduced to enable on-line interaction between the technical and scientific engineers and their running parallel simulation. VIPER stands for VIsualization of parallel numerical simulation algorithms for Extended Research and offers physical parameters of the mathematical model and parameters of the numerical method as objects of a graphical user tool interface for online observation and online modification. A special client-server-client process architecture implementation enables technical and scientific engineers who are sitting at their graphic workstation to interact withtheir parallel simulation algorithms running on a remote parallel computer system. the VIPER prototype is applied on ParNsflex which is a parallel Navier-Stokes solver for real world aero-dynamic problems. A Paragon XP/S was selected as test parallel computer system. A first evaluation indicates the superiority of the VIPER conception against conventional procedures. Copyright (C) 1996 Published by Elsevier Science L
We study the scalability of 2-D discrete wavelet transform algorithms on fine-grained parallelarchitectures. the principal operation in the 2-D DWT is the filtering operation used to implement the filter banks of the...
详细信息
this paper presents VLSI/WSI designs for a recently introduced parallel architecture known as the folded cube-connected cycles (FCCC). We first discuss two layouts for the FCCC, in which there is no component redundan...
详细信息
this paper presents VLSI/WSI designs for a recently introduced parallel architecture known as the folded cube-connected cycles (FCCC). We first discuss two layouts for the FCCC, in which there is no component redundancy. then we incorporate redundancy, and present locally and globally reconfigurable FCCCs. We also discuss the design of universal building blocks for the construction of fault-tolerant FCCCs of various dimensions.
A DC plasma oxidation system with a hollow cathode which consists of a pair of parallel Si plates was developed. Using this system, thin Si oxide films of less than 40 nm thickness were grown on n-type Si(100) substra...
详细信息
A DC plasma oxidation system with a hollow cathode which consists of a pair of parallel Si plates was developed. Using this system, thin Si oxide films of less than 40 nm thickness were grown on n-type Si(100) substrates, for the application to the tunnel devices. the film quality and the oxide stoichiometry were estimated by XPS measurements. On the oxide films, the MIS (Metal-Insulator-Semiconductor) diode type tunnel emitters were fabricated. the electrical properties of the diodes, such as I-V characteristics and electron emission into the vacuum were measured. For a typical sample, an electron emission current density of 800 pA/mm(2) into the vacuum was obtained.
暂无评论