the architectural and implementational features that work together in the APx accelerator to achieve high sustained system performance for a significant set of compute-intensive functions are presented. the features i...
详细信息
the architectural and implementational features that work together in the APx accelerator to achieve high sustained system performance for a significant set of compute-intensive functions are presented. the features include VLSI integration, memory bandwidth, concurrency of operations, interprocessor communications, processor selection mechanisms, and I/O bandwidth. the APx is an expandable system and provides from 64 to 256 16-bit processors which provide peak instruction rates from 800 to 3200 MIPs. the individual processors in the APx accelerator are 16-bit RISC processors which are quite powerful and versatile. In addition, pairs of 16-bit processors can be configured to operate in 32-bit mode under software control. IEEE format single precision floating-point operations are supported in 32-bit mode with peak ratings from 40 to 160 MFLOPS.< >
this abstract introduces a research project at Iowa State University whose goal is to produce a high-speed computing system by combining a multiprocessor configuration with a software paradigm based on the functional ...
ISBN:
(纸本)9780897912181
this abstract introduces a research project at Iowa State University whose goal is to produce a high-speed computing system by combining a multiprocessor configuration with a software paradigm based on the functional programming language SASL. the underlying architecture will evaluate combinatory code produced by the SASL compiler by graph reduction, in a manner similar to that of other proposed SKIM combinator machines. Fine-grain parallelism is supported by concurrently evaluating the body and parameters of functions, by “unrolling” recursive algorithms when possible, and by application of high-level user-defined data shaping operations, such as “map.” A simple dynamic load balancing scheme is used to evenly distribute the work of reducing the graph over all *** current design, under construction, is a tightly-coupled 16 processor system. Each processor has a private program memory that contains a complete copy of the program graph. the rules that specify and control the parallel reduction of the graph guarantee that the correct value will eventually be returned to the processor assigned to the root of the graph, and that inconsistencies in copies at various processors due to local reductions are of no concern. the data memory is global, but is physically organized so that a segment of the memory is located next to each processor. We expect that once the data is distributed through the graph during reduction, there will be some degree of locality of reference that will ease bus traffic. the typical bus bandwidth limitations that degrade performance in a fine-grain architecture are mitigated somewhat by employing a locally-developed high-speed (up to 100Mb/s per channel) “party line” serial bus for data references and interprocessor messages. the bus design allows for parallel data transfers, and can exchange processor load information in broadcast mode in one bus *** have chosen to base the design of the architecture on the functional language SASL
A fault tolerant computerarchitecture, FTCX, is an experimental computerarchitecture intended to serve as a general-purpose real-time computing system for fault sensitive supervisory and control applications. FTCX u...
详细信息
ISBN:
(纸本)0818607033
A fault tolerant computerarchitecture, FTCX, is an experimental computerarchitecture intended to serve as a general-purpose real-time computing system for fault sensitive supervisory and control applications. FTCX uses tightly synchronous triplex computation in its core to detect and mask all first faults. Synchronization, fault detection, and fault correction are all performed in the hardware. Novel to this architecture are the means by which interrupt requests and data are exchanged between the simplex local or remote industry standard bus (VMEbus) environments and the triplexed core environment. these exchanges are software transparent, yet fully implement all of the necessary algorithms to maintain data consistency and synchronization in the three channels of the core, even in the face of byzantine faults.
Recent work in microarchitecture has identified a new model of execution, restricted data flow, in which data-flow techniques are used to coordinate out-of-order execution of sequential instruction streams. It is beli...
详细信息
ISBN:
(纸本)081860719X
Recent work in microarchitecture has identified a new model of execution, restricted data flow, in which data-flow techniques are used to coordinate out-of-order execution of sequential instruction streams. It is believed that the restricted-data-flow model has great potential for implementing high-performancecomputing engines. A minimal functionality variant of the model, called HPSm, is defined. the instruction set, data path, timing and control of HPSm are described. A simulator of HPSm has been written, and some of the Berkeley RISC benchmarks have been executed on the simulator. Measurements obtained from these benchmarks, along with measurements obtained for the Berkeley RISC II, are reported.
the design issues of a fault-tolerant controller for high-voltage dc (HVDC) power transmission are presented. this high-speed digital controller performs safety, regulation and control algorithms under highly stringen...
详细信息
ISBN:
(纸本)0818607033
the design issues of a fault-tolerant controller for high-voltage dc (HVDC) power transmission are presented. this high-speed digital controller performs safety, regulation and control algorithms under highly stringent time constraints. the controller is programmed in a functional language executed cyclically. A dual multiprocessor architecture with continuous update was chosen over the classical triple-modular-redundancy (TMR) solution to meet the availability goals at a lower cost, while maintaining the latency time low enough. Error detection relies only partly on hardware checking, and uses an unconventional method which monitors the process state, called safe area control (SAC). Experience in the field proves the correctness of the fault-tolerance concept, with no failures in use for more than one year of operation.
Multicomputers connected as binary hypercubes are being considered for a variety of applications in complex dedicated systems for remote sensing and data analysis in which high availability and error-free computations...
详细信息
ISBN:
(纸本)0818607033
Multicomputers connected as binary hypercubes are being considered for a variety of applications in complex dedicated systems for remote sensing and data analysis in which high availability and error-free computations are expected. the issues involved in fault-tolerance implementation in a hypercube system left bracket JPL-85 right bracket are studied. Approaches for implementing concurrent fault-detection in the highperformance processing nodes are examined as are techniques for using redundancy within the processor array. An alternative interconnection structure is proposed which is better-suited for long-life unmaintained applications. It is shown that a binary hypercube structure can be decomposed hierarchically and redundancy can be used at several levels so that it can be utilized in an efficient fashion.
Graphic simulation is frequently used in the analysis and verification of robot programs. Automatic trajectory generation, collision avoidance, and cycle time calculations can now be performed by robotic simulation sy...
详细信息
ISBN:
(纸本)0948507152
Graphic simulation is frequently used in the analysis and verification of robot programs. Automatic trajectory generation, collision avoidance, and cycle time calculations can now be performed by robotic simulation systems to design and test robotic workcells. Many of these simulation systems visually display the actions of a robot manipulator or workcell. We discuss a robot simulation system developed on a Silicon Graphics IRIS Workstation that displays solid-model representations of workcells at near real-time rates. this system graphically simulates a variety of robotic tasks, such as complex workcell activity, the motion of a gripper grasping a part, or the wanderings of a rover on a factory floor. the realization of this system is due in part to recent advances in computer graphics hardware. We emphasize how this new graphics hardware can be exploited to implement rapid display rates and improve visual realism.
Our recent work in microarchitecture has identified a new model of execution, restricted data flow, in which data flow techniques are used to coordinate out-of-order execution of sequential instruction streams. We bel...
ISBN:
(纸本)9780818607196
Our recent work in microarchitecture has identified a new model of execution, restricted data flow, in which data flow techniques are used to coordinate out-of-order execution of sequential instruction streams. We believe that the restricted data flow model has great potential for implementing very highperformancecomputing engines. this paper defines a minimal functionality variant of our model, which we are calling HPSm. the instruction set, data path, timing and control of HPSm are all described. A simulator for HPSm has been written, and some of the Berkeley RISC benchmarks have been executed on the simulator. We report the measurements obtained from these benchmarks, along withthe measurements obtained for the Berkeley RISC II. the results are encouraging.
We proposed a computer with low-level parallelism as one of the basic computerarchitectures and built a large scale experimental system called QA-2. By low-level parallelism, we mean that a long-word instruction cont...
ISBN:
(纸本)9780818607196
We proposed a computer with low-level parallelism as one of the basic computerarchitectures and built a large scale experimental system called QA-2. By low-level parallelism, we mean that a long-word instruction controls simultaneously many ALUs, busses, registers and memories in a mode of fine-grained parallelism. the QA-2 employs a 256-bit instruction by which four different ALU operations, four memory accesses to different/continuous locations and one powerful sequence control are all specified and performed in parallel. If many simultaneously executable operations are detected and embedded in one instruction at compile time, this type of computer can provide a high-degree of performance for a wide variety of applications. this paper describes the architectural benefits and limitations of low-level parallelism in performing 3-D color image generation and interpreting Prolog/Lisp programs. the hardware organization with four ALUs, which are actually implemented in the QA-2, is verified to be adequate. In fact, nearly three out of four ALUs can work in parallel. Any architecture with more than four ALUs can not achieve a significant degree of performance enhancement. this paper also shows the degree of performance improvement achieved by the techniques such as ALU chaining and highly-structured sequence control mechanisms. As compared withthe IBM 370 architecture, the QA-2 can generate 3-D color images in 1/5 of dynamic instruction steps. the compiler version of Prolog machine on the QA-2 is as fast (45K LIPS) as the ICOT's PSI. From all results, we expect that the QA-2 is a high-performancecomputer which will be utilized in the future personal computing environment.
作者:
Abu-Sufah, WalidKwok, Alex Y.Univ of Illinois
Cent for Supercomputer Research & Development Urbana IL USA Univ of Illinois Cent for Supercomputer Research & Development Urbana IL USA
the development of performance prediction tools for high-speed machine organizations has been recognized as a key problem facing the research community in parallel computing. A survey of the tools which have been deve...
详细信息
ISBN:
(纸本)0818606347
the development of performance prediction tools for high-speed machine organizations has been recognized as a key problem facing the research community in parallel computing. A survey of the tools which have been developed for performance prediction of the Cedar multiprocessor supercomputer of the University of Illinois is presented. the system is deterministic, modular, and automatic. the hierarchical organization of the system provides the user withthe ability to choose from a set of alternatives for predicting the performance with different levels of accuracy and cost, using 22 programs. the performance degradation due to conflicts in the shared memory delay in the Cedar interconnection network and synchronization overhead are measured. the results confirm that the architecture of Cedar is balanced. the performance of the Cedar interconnection network is very close to a crossbar. Synchronization overhead and shared memory conflicts could degrade performance for some programs considerably.
暂无评论