As manufacturing complexity increases, and as factory yields, equipment reliability and equipment utilization each approach 100%, one must look for alternative improvement programs to help reduce manufacturing costs a...
详细信息
As manufacturing complexity increases, and as factory yields, equipment reliability and equipment utilization each approach 100%, one must look for alternative improvement programs to help reduce manufacturing costs and assist in managing increasing factory complexity. there are a number of possibilities, such as automating routine decision-making processes involved in manufacturing, resulting in faster or more cost-effective decisions, or capturing and applying manufacturing knowledge to reduce the time necessary to detect, analyze and solve manufacturing problems. Either possibility can result in significant savings in productivity improvement or cost reduction, or in avoidance of significant losses. Such strategic programs are in addition to the current automation programs which help manage factory performance, i.e., physical automation to replace human physical activity, and information automation to replace routine human data gathering or data analyses procedures.< >
In this paper, we present a comparative performance evaluation of hot spot effects on the MIN-based and HR-based shared-memory architectures. Analytical models are described for understanding network differences and f...
详细信息
In this paper, we present a comparative performance evaluation of hot spot effects on the MIN-based and HR-based shared-memory architectures. Analytical models are described for understanding network differences and for evaluating hot spot performance on botharchitectures. the analytical comparisons indicate that HR-based architectures have the potential to handle various contentions caused by hot spots more efficiently than MIN-based architectures. Although there is no analytical and experimental evidence that the tree saturation phenomenon occurs in non-blocking MIN architectures, remote accesses to both hot and cool memory modules are considerably slowed down, and overall performance is significantly degraded. Intensive performance measurements on hot spots have been conducted on the BBN TC2000 (MIN-based) and the KSR1 (HR-based) machines. performance experiments were also conducted on the practical experience of hot spots with respect to synchronization lock and barrier algorithms. the experimental results support the analytical models, and present practical observations and an evaluation of hot spots on the two types of architectures.< >
the notion of trivial computation, in which the appearance of simple operands renders potentially complex operations simple, is discussed. An example of a trivial operation is integer division, where the divisor is tw...
详细信息
the notion of trivial computation, in which the appearance of simple operands renders potentially complex operations simple, is discussed. An example of a trivial operation is integer division, where the divisor is two; the division becomes a simple shift operation. the concept of redundant computation, in which some operation repeatedly does the same function because it repeatedly sees the same operands, is also discussed. Experiments on two separate benchmark suites, the SPEC benchmarks and the Perfect Club, find a surprising amount of trivial and redundant operation. Various architectural means of exploiting this knowledge to improve computational efficiency include detection of trivial operands and the result cache. Further experimentation shows significant speedup from these techniques, as measured on three different styles of machine architecture.< >
this paper proposes a parallel structure, the mesh-of-appendixed-trees (MAT), for efficient implementation of artificial neural networks (ANNs). Algorithms to implement boththe recall and the training phases of the m...
详细信息
this paper proposes a parallel structure, the mesh-of-appendixed-trees (MAT), for efficient implementation of artificial neural networks (ANNs). Algorithms to implement boththe recall and the training phases of the multilayer perceptron and backpropagation ANN model are provided. A recursive procedure for embedding the MAT structure into the hypercube topology is used as the basis for an efficient mapping technique to map ANN computations on general purpose massively parallel hypercube systems. In addition, based on the mapping scheme, a fast special purpose parallel architecture for ANNs is developed. the major advantage of our technique is highperformance. Unlike the other techniques presented in the literature which require O(N) time, where N is the size of the largest layer, our implementation requires only O(log N) time. Moreover, it allows the pipelining of more than one input pattern and thus further improves the performance.< >
the implementation, optimization, and evaluation of an ion implanted, 0.5 /spl mu/m refractory self-aligned gate GaAs MESFET process for DCFL digital ICs for supercomputer applications is described. the MESFET perform...
详细信息
the implementation, optimization, and evaluation of an ion implanted, 0.5 /spl mu/m refractory self-aligned gate GaAs MESFET process for DCFL digital ICs for supercomputer applications is described. the MESFET performance has been optimized for minimal short channel effects, ultra highperformance, minimal backgating, and improved manufacturability. this device process has been coupled together with a three or four level metal interconnect process for producing 1 GHz clock rate LSI to VLSI digital computer ICs. the interconnect process makes use of up to four levels of CVD tungsten via fill for planarity throughout the interconnect process. this process yields typical propagation delays of 25 pS for a 2/4 /spl mu/m inverter with unity fanout. Four input NOR gates with a fanout of four have a typical delay of 65 pS. Moreover, a four input NOR buffer driving a fanout of seven through 500 /spl mu/m of minimum geometry metal has a delay of 63 pS. this delay increases to 93 pS when the metal length is increased to 1500 /spl mu/m. this process is being used to produce 5 to 10 K gate digital circuits for the 1 GHz clock rate Cray-4 supercomputer. this work has resulted in a manufacturing process which produces devices and circuits with world class performance.< >
the IEEE Futurebus+ is a very fast (3GB/sec.), industry standard backplane bus specification for computer systems. Futurebus+ was designed independent of any CPU architecture so it is truly open. Withthis open archit...
详细信息
the IEEE Futurebus+ is a very fast (3GB/sec.), industry standard backplane bus specification for computer systems. Futurebus+ was designed independent of any CPU architecture so it is truly open. Withthis open architecture Futurebus+ can be applied to many different computing applications. Profile B is a subset of the IEEE 896 Futurebus+ standard and targets highperformance, general purpose computer I/O applications. this paper describes how and why the functional, electrical, mechanical and environmental characteristics were chosen.
Low-latency communication is the key to achieving a high-performance parallel computer. In using state-of- the-art processors, we must take cache memory into account. this paper presents an architecture for low-latenc...
详细信息
Low-latency communication is the key to achieving a high-performance parallel computer. In using state-of- the-art processors, we must take cache memory into account. this paper presents an architecture for low-latency message communication and implementation, and performance evaluation. We developed a message controller (MSC) to support low-latency message passing communication for the AP1000, to minimize message handling overhead. MSC sends messages directly from cache memory and automatically receives messages in the circular buffer. We designed communication functions, between cells and evaluated communication performance by running benchmark programs such as the Pingpong benchmark, the LINPACK benchmark, the SLALOM benchmark, and a solver using the scaled conjugate gradient method.
We introduce and evaluate a class of prefetch schemes for on-chip data caches in high-performance RISC processors. these schemes are conservative, initiating a prefetch only when a sequential pattern of references hav...
详细信息
We introduce and evaluate a class of prefetch schemes for on-chip data caches in high-performance RISC processors. these schemes are conservative, initiating a prefetch only when a sequential pattern of references have been observed. performance results based on traces of five programs in the SPEC suite on an IBM RS/6000 show that these schemes result in a significant reduction in miss ratio without the large increase in memory traffic associated with earlier schemes.
In many applications, the main part of the computations may be encapsulated in compute-bounds kernels. Achieving highperformance on compute-bound primitives at a low hardware cost has became an important challenge. O...
详细信息
In many applications, the main part of the computations may be encapsulated in compute-bounds kernels. Achieving highperformance on compute-bound primitives at a low hardware cost has became an important challenge. OPAC was designed as the basic cell of a floating-point coprocessor dedicated to the execution of compute-bound kernels. Due to efficient hardware mechanisms for controlling and sequencing a pipeline performance close to a floating-point multiply-add per cycle per cell is reached on applications such as solving linear systems, FFTs or correlations in a microprocessor environment.
暂无评论