A practical implementation of highperformance instruction level parallel architectures is constrained by the difficulty to build a large monolithic multi-ported register file (RF). A solution is to partition the RF i...
详细信息
A practical implementation of highperformance instruction level parallel architectures is constrained by the difficulty to build a large monolithic multi-ported register file (RF). A solution is to partition the RF into smaller RFs while keeping the total number of registers and ports equal. this paper applies RF partitioning to transport triggered architectures;these architectures are of the VLIW type. One may expect that partitioning increases the number of executed cycles because it constrains the number of ports per RF. It is shown that these performance losses are small;e.g. partitioning an RF with 24 registers and four read and four write ports into four RFs with 6 registers and one read and one write port gives a performance loss of only 5.8%. Partitioned RFs consume less area than monolithic RFs withthe same number of ports and registers. Experiments show that, if the area saved by partitioning is spent on extra registers, partitioning does, on average, not reduce the performance;it may even result in a small performance gain.
Very accurate branch prediction is an important requirement for achieving highperformance on deeply pipelined, superscalar processors. To improve on the prediction accuracy of current single-scheme branch predictors,...
详细信息
Very accurate branch prediction is an important requirement for achieving highperformance on deeply pipelined, superscalar processors. To improve on the prediction accuracy of current single-scheme branch predictors, hybrid (multiple-scheme) branch predictors have been proposed [6, 7]. these predictors combine multiple single-scheme predictors into a single predictor. they use a selection mechanism to decide for each branch, which single-scheme predictor to use. the performance of a hybrid predictor depends on its single-scheme predictor components and its selection mechanism. Using known single-scheme predictors and selection mechanisms, this paper identifies the most effective hybrid predictor implementation. In addition, it introduces a new selection mechanism, the 2-level selector, which further improves the performance of the hybrid branch predictor.
Prefetching has been shown to be one of several effective approaches that can tolerate large memory latencies. In this paper, we consider a prefetch engine called Hare, which handles prefetches at run time and is buil...
详细信息
Prefetching has been shown to be one of several effective approaches that can tolerate large memory latencies. In this paper, we consider a prefetch engine called Hare, which handles prefetches at run time and is built in addition to the data pipelining in the on-chip data cache for high-performance processors. the key design is that it is programmable so that techniques of software prefetching can be also employed in exploiting the benefits of prefetching. the engine always launches prefetches ahead of current execution, which is controlled by the program counter. We evaluate the proposed scheme by trace-driven simulation and consider area and cycle time factors for the evaluation of cost-effectiveness. Our performance results show that the prefetch engine can significantly reduce data access penalty with only little prefetching overhead.
Advances in network and processor technology have greatly changed the communication and computational power of local-area workstation clusters. However, operating systems still treat workstation clusters as a collecti...
详细信息
LIGHTNING is a dynamically reconfigurable WDM network testbed project for supercomputer interconnection. this paper describes a hierarchical WDM-based optical network testbed that is being constructed to interconnect ...
详细信息
LIGHTNING is a dynamically reconfigurable WDM network testbed project for supercomputer interconnection. this paper describes a hierarchical WDM-based optical network testbed that is being constructed to interconnect a large number of supercomputers and create a distributed shared memory environment. the objective of the hierarchical architecture is to achieve scalability yet avoid the requirement of multiple wavelength tunable devices per node. Furthermore, single-hop all-optical communication is achieved: a packet remains in the optical form from source to destination and does not require intermediate routing. the wavelength multiplexed hierarchical structure features wavelength channel re-use at each level, allowing scalability to very large system sizes. It partitions the traffic between different levels of the hierarchy without electronic intervention in a combination of wavelength- and space-division multiplexing. A significant advantage of this approach is its ability to dynamically vary the bandwidth provided to different levels of the hierarchy. Each node in LIGHTNING receives traffic on n channels in an n-level hierarchy, one channel for each level. Each node monitors the traffic intensities on each channel and can detect any temporal or spatial shift in traffic balance. LIGHTNING can dynamically reconfigure to balance the traffic at each level by moving wavelengths associated with each level up or down depending on need. Bandwidth re-allocation is completely decentralized - any node can initiate it, achieving highly fault tolerant system behavior. this paper describes the system architecture, network and memory interface, and the optical devices that have been developed in this project.
Accurate static branch prediction is the key to many techniques for exposing, enhancing, and exploiting Instruction Level Parallelism (ILP). the initial work on static correlated branch prediction (SCBP) demonstrated ...
详细信息
Accurate static branch prediction is the key to many techniques for exposing, enhancing, and exploiting Instruction Level Parallelism (ILP). the initial work on static correlated branch prediction (SCBP) demonstrated improvements in branch prediction accuracy, but did not address overall performance. In particular, SCBP expands the size of executable programs, which negatively affects the performance of the instruction memory hierarchy. Using the profile information available under SCBP, we can minimize these negative performance effects through the application of code layout and branch alignment techniques. We evaluate the performance effect of SCBP and these profile-driven optimizations on instruction cache misses, branch mispredictions, and branch misfetches for a number of recent processor implementations. We find that SCBP improves performance over (traditional) per-branch static profile prediction. We also find that SCBP improves the performance benefits gained from branch alignment. As expected, SCBP gives larger benefits on machine organizations withhigh mispredict/misfetch penalties and low cache miss penalties. Finally, we find that the application of profile-driven code layout and branch alignment techniques (without SCBP) can improve the performance of the dynamic correlated branch prediction techniques.
One of the most sought after software innovation of this decade is the construction of systems using off-the-shelf workstations that actually deliver, and even surpass, the power and reliability of supercomputers. Usi...
详细信息
One of the most sought after software innovation of this decade is the construction of systems using off-the-shelf workstations that actually deliver, and even surpass, the power and reliability of supercomputers. Using completely novel techniques: eager scheduling, evasive memory layouts and dispersed data management, it is possible to build a execution environment for parallel programs on workstation networks. these techniques were originally developed in a theoretical framework for an abstract machine which models a shared memory asynchronous multiprocessor. the network of workstations platform presents an inherently asynchronous environment for the execution of our parallel program. this gives rise to substantial problems of correctness of the computation and of proper automatic load balancing of the work amongst the processors, so that a slow processor will not hold up the total computation. A limiting case of asynchrony is when a processor becomes infinitely slow, i.e. fails. Our methodology copes with all these problems, as well as with memory failures. An interesting feature of this system is that it is neither a fault-tolerant system extended for parallel processing nor is it parallel processing system extended for fault tolerance. the same novel mechanisms ensure both properties.
this paper presents an object-based distributed computing environment based on a reflective architecture for industrial large-scale distributed systems. this distributed computing environment uses a compiler-based ref...
详细信息
this paper presents an object-based distributed computing environment based on a reflective architecture for industrial large-scale distributed systems. this distributed computing environment uses a compiler-based reflection technique to realize industrial distributed systems with standard workstations. A multiple-world model is also presented, in which a distributed system consists of hierarchical worlds that contain related objects. the distributed computing environment, based on the reflective architecture and the multiple-world model, provides object management and communication management functions, such as prioritized communications, high availability and reliability, nonstop maintenance and extension, and hierarchical transparency.
the RACE parallel computer system provides a high-performance parallel interconnection network at low cost. this paper describes the architecture and implementation of the RACE system, a parallel computer for embedded...
详细信息
the RACE parallel computer system provides a high-performance parallel interconnection network at low cost. this paper describes the architecture and implementation of the RACE system, a parallel computer for embedded applications. the topology of the network, which is constructed with 6-port switches, can be specified by the customer and is typically a fat-tree, a Clos network, or a mesh. the network employs a preemptable circuit switched strategy. the network and the processor-network interface work together to provide highperformance: 160 megabytes per second transfer rates with about 1 microsecond of latency. Priorities can be used to guarantee tight real-time constraints of a few microseconds through a congested network. A self-regulating circuit adjusts the impedance and output delay of the pin-driver pads.< >
the paper presents the results of performance analyses of a seismic analysis kernel code on the KSR multiprocessors. the purpose of such analysis is to understand the performance behaviors of a class of applications o...
详细信息
the paper presents the results of performance analyses of a seismic analysis kernel code on the KSR multiprocessors. the purpose of such analysis is to understand the performance behaviors of a class of applications on shared memory parallel machines. the g5 kernel code, commonly used in seismic analysis applications, is parallelized, and its computational and I/O performance is analyzed on a 32-node KSR-1 and a 64-node KSR-2.< >
暂无评论