the I/O behavior of some scientific applications, a subset of Perfect benchmarks, executing on a multiprocessor is studied. the aim of this study is to explore the various patterns of I/O access of large scientific ap...
详细信息
the I/O behavior of some scientific applications, a subset of Perfect benchmarks, executing on a multiprocessor is studied. the aim of this study is to explore the various patterns of I/O access of large scientific applications, and to understand the impact of this observed behavior on the I/O subsystem architecture. I/O behavior of the program is characterized by the demands it imposes on the I/O subsystem. It is observed that implicit I/O or paging is not a major problem for the applications considered and the I/O problem is mainly manifest in the explicit I/O done in the program. Various characteristics of I/O accesses are studied, and their impact on architecture design is discussed.< >
Existing methods of generating and analyzing traces suffer from a variety of limitations, including complexity, inaccuracy, short length, inflexibility, or applicability only to CISC (complex-instruction-set-computer)...
详细信息
Existing methods of generating and analyzing traces suffer from a variety of limitations, including complexity, inaccuracy, short length, inflexibility, or applicability only to CISC (complex-instruction-set-computer) machines. the authors use a trace-generation mechanism based on link-time code modification which is simple to use, generates accurate long traces of multiuser programs, runs on a RISC (reduced-instruction-set-computer) machine, and can be flexibly controlled. Accurate performance data for large second-level caches can be obtained by on-the-fly analysis of the traces. A comparison is made of the performance of systems with 512 K to 16 M second-level caches, and it is show that, for today's large programs, second-level caches of more than 4 MB may be unnecessary. It is also shown that set associativity in second-level caches of more than 1 MB does not significantly improve system performance. In addition, the experiments provide insights into first-level and second-level cache line size.< >
Consideration is given to the development of strategies for predictable performance in homogeneous multicomputer data-flow architectures operating in real-time. Algorithms are restricted to the class of large-grained,...
详细信息
Consideration is given to the development of strategies for predictable performance in homogeneous multicomputer data-flow architectures operating in real-time. Algorithms are restricted to the class of large-grained, decision-free algorithms. the mapping of such algorithms onto the specified class of data-flow architectures is realized by a new marked graph model called ATAMM (algorithm to architecture mapping model). Algorithm performance and resource needs are determined for predictable periodic execution of algorithms, which is achieved by algorithm modification and input data injection control. performance is gracefully degraded to adapt to decreasing numbers of resources. the realization of the ATAMM model on a VHSIC four processor testbed is described. A software design tool for prediction of performance and resource requirements is described and is used to evaluate the performance of a space surveillance algorithm.< >
the authors present an empirical evaluation of two memory-efficient directory methods for maintaining coherent caches in large shared-memory multiprocessors. Both directory methods are modifications of a scheme propos...
详细信息
the authors present an empirical evaluation of two memory-efficient directory methods for maintaining coherent caches in large shared-memory multiprocessors. Both directory methods are modifications of a scheme proposed by L.M. Censier and P. Feautrier (1978) that does not rely on a specific interconnection network and can be readily distributed across interleaved main memory. the schemes considered here overcome the large amount of memory required for tags in the original scheme in two different ways. In the first scheme each main memory block is sectored into sub-blocks for which the large tag overhead is shared. In the second scheme a limited number of large tags are stored in an associative cache and shared among a much larger number of main memory blocks. Simulations show that in terms of access time and network traffic both directory methods provide significant performance improvements over a memory system in which shared-writable data are not cached. the large block sizes required for the sectored scheme, however, promote sufficient false sharing for its performance to be markedly worse than when a tag cache is used.< >
A communication architecture and a dynamic control protocol are presented for real-time communication in multiple token ring networks. the network can be formed by multiple channels through bandwidth subdivision of a ...
详细信息
A communication architecture and a dynamic control protocol are presented for real-time communication in multiple token ring networks. the network can be formed by multiple channels through bandwidth subdivision of a high-speed ring. A flexible preemption and dynamic load allocation scheme is developed which can reduce the lost percentage of critical packets and can maintain a high channel utilization at the same time. this performance improvement is demonstrated with extensive simulation results.< >
A class of problems, called signal understanding (SU) systems, that involves intensive numeric, as well as symbolic testing, is considered. A shared-memory multiprocessor computer system designed specifically to facil...
详细信息
A class of problems, called signal understanding (SU) systems, that involves intensive numeric, as well as symbolic testing, is considered. A shared-memory multiprocessor computer system designed specifically to facilitate the development and evaluation of parallel SU algorithms, which would be a basic tool for developing real-time SU systems, is described. this machine, called the MX-1, runs Common LISP, which has been augmented with extensions for user-directed parallelism. It is, in essence, a parallel LISP machine with digital signal processors (DSPs) integrated into the architecture to provide powerful numeric computational capability. the MX-1 has 16 crossbar interconnected processing nodes, each with a Motorola 68020 CPU operating at 16 MHz, 8 Mbytes of dynamic random-access memory and its own Weitek DSP coprocessor that can do 10 million multiply-accumulate operations per second. high-speed I/O ports are provided for direct entry of sensor data into the DSPs. Two 4-node systems and a 16-node system have been fabricated, and several applications have been run on the systems.< >
We describe the system architecture and the programming environment of the Pixel Machine -A parallel image computer with a distributed frame buffer. the architecture of the computer is based on an array of asynchronou...
详细信息
ISBN:
(纸本)0897913124
We describe the system architecture and the programming environment of the Pixel Machine -A parallel image computer with a distributed frame buffer. the architecture of the computer is based on an array of asynchronous MIMD nodes with parallel access to a large frame buffer. the machine consists of a pipeline of pipe nodes which execute sequential algorithms and an array of m x n pixel nodes which execute parallel algorithms. A pixel node directly accesses every m-th pixel on every n-th scan line of an interleaved frame buffer. Each processing node is based on a high-speed, floating-point programmable processor. the programmability of the computer allows all algorithms to be implemented in software. We present the mappings of a number of geometry and image-computing algorithms onto the machine and analyze their performance.
A description is given of the Parallel Unification Machine (PLUM), a Prolog processor that exploits fine-grain parallelism using multiple function units executing in parallel. In most cases the execution of bookkeepin...
详细信息
A description is given of the Parallel Unification Machine (PLUM), a Prolog processor that exploits fine-grain parallelism using multiple function units executing in parallel. In most cases the execution of bookkeeping instruction is almost completely overlapped by unification, and the performance of the processor is limited only by the available unification parallelism. Measurements from a register transfer level simulator of PLUM are presented. the results show that PLUM withthree unification units achieves an average speedup of approximately 3.4 over the Berkeley VLSI-PLM, which is usually regarded as the current highest-performance special-purpose, pipelined Prolog processor. Measurements that show the effects of multiple unification units and memory access time on performance are also presented.
To provide an arbitrary and fully dynamic connectivity in a network of processors, transport mechanisms must be implemented to provide the propagation of data from processor to processor, on the basis of addresses con...
详细信息
ISBN:
(纸本)9780897913195
To provide an arbitrary and fully dynamic connectivity in a network of processors, transport mechanisms must be implemented to provide the propagation of data from processor to processor, on the basis of addresses contained within a packet of data. Such data transport mechanisms must satisfy a number of requirements - deadlock and livelock freedom, good hot-spot performance, highthroughput, and low latency. the authors propose a solution of these problems that allows deadlock-free, adaptive, high-throughput packet routing to be implemented on networks of processors. Examples that illustrate the technique for 2-D array and toroidal networks are given. An implementation of this scheme on arrays of transputers is described. the scheme also serves as a basis for a very low latency routing strategy named the mad postman, a detailed implementation of which is described here as well.
the Εpsilon dataflow architecture is designed for high-speed uniprocessor execution as well as for parallel operation in a multiprocessor system. the Εpsilon architecture directly matches ready operands, thus elimin...
详细信息
the Εpsilon dataflow architecture is designed for high-speed uniprocessor execution as well as for parallel operation in a multiprocessor system. the Εpsilon architecture directly matches ready operands, thus eliminating the need for associative matching stores. Εpsilon also supports low-cost data fanout and critical sections. A 10-MFLOPS (millions of floating-point operations per second) CMOS/TTL processor prototype is running, and its performance has been measured with several benchmarks. the prototype has demonstrated sustained performance exceeding that of comparable control-flow processors running at high clock rates (three times faster than a 20-MHz transputer and 24 times faster than a Sun on a suite of arithmetic tests, for example).
暂无评论