Withthe explosive popularity of the internet and the world wide web (WWW), there is a rapidly growing need to provide unprecedented access to globally distributed data sources through the internet. Web accessibility ...
详细信息
Withthe explosive popularity of the internet and the world wide web (WWW), there is a rapidly growing need to provide unprecedented access to globally distributed data sources through the internet. Web accessibility will be an essential component of the services that future digital libraries should provide for clients. this need has created a strong demand for database access capability through the internet, and highperformance scalable web servers. As most popular web sites are experiencing overload from an increasing number of users accessing the sites at the same time, it is desired that scalable web servers should adapt to the changing access characteristics and should be capable of handling a large number of concurrent requests simultaneously, with reasonable response times and minimal request drop rates.
the Power3 processor is a 64-bit implementation of the PowerPC/sup TM/ architecture and is the successor to the Power2/sup TM/ processor for workstations and servers which require highperformance floating point capab...
详细信息
the Power3 processor is a 64-bit implementation of the PowerPC/sup TM/ architecture and is the successor to the Power2/sup TM/ processor for workstations and servers which require highperformance floating point capability. the previous processors used Newton-Raphson algorithms for their implementations of divide and square root. the Power3 processor has a longer pipeline latency, which would substantially increase the latency for these instructions. Instead, new algorithms based on power series approximations were developed which provide significantly better performancethan the Newton-Raphson algorithm for this processor. this paper describes the algorithms, and then shows how boththe series based algorithms and the Newton-Raphson algorithms are affected by pipeline length. For the Power3, the power series algorithms reduce the divide latency by over 20% and the square root latency by 35%.
Building dependable distributed systems using ad hoc methods is a challenging task. Without proper support, an application programmer must face the daunting requirement of having to provide fault tolerance at the appl...
详细信息
Building dependable distributed systems using ad hoc methods is a challenging task. Without proper support, an application programmer must face the daunting requirement of having to provide fault tolerance at the application level, in addition to dealing withthe complexities of the distributed application itself. this approach requires a deep knowledge of fault tolerance on the part of the application designer, and has a high implementation cost. What is needed is a systematic approach to providing dependability to distributed applications. Proteus, part of the AQuA architecture, fills this need and provides facilities to make a standard distributed CORBA application dependable, with minimal changes to an application. Furthermore, it permits applications to specify, either directly or via the Quality Objects (QuO) infrastructure, the level of dependability they expect of a remote object, and will attempt to configure the system to achieve the requested dependability level. Our previous papers have focused on the architecture and implementation of Proteus. this paper describes how to construct dependable applications using the AQuA architecture, by describing the interface that a programmer is presented with and the graphical monitoring facilities that it provides.
New VLSI circuit architectures for addition and multiplication modulo (2/sup n/-1) and (2/sup n/+1) are proposed that allow the implementation of highly efficient combinational and pipelined circuits for modular arith...
详细信息
New VLSI circuit architectures for addition and multiplication modulo (2/sup n/-1) and (2/sup n/+1) are proposed that allow the implementation of highly efficient combinational and pipelined circuits for modular arithmetic. It is shown that the parallel-prefix adder architecture is well suited to realize fast end-around-carry adders used for modulo addition. Existing modulo multiplier architectures are improved for higher speed and regularity. these allow the use of common multiplier speed-up techniques like Wallace-tree addition and Booth recoding, resulting in the fastest known modulo multipliers. Finally, a high-performance modulo multiplier-adder for the IDEA block cipher is presented. the resulting circuits are compared qualitatively and quantitatively, i.e., in a standard-cell technology, with existing solutions and ordinary integer adders and multipliers.
this paper presents a novel hardware-based approach for identifying, profiling, and monitoring hot spots in order to support runtime optimization of general purpose programs. the proposed approach consists of a set of...
ISBN:
(纸本)9780769501703
this paper presents a novel hardware-based approach for identifying, profiling, and monitoring hot spots in order to support runtime optimization of general purpose programs. the proposed approach consists of a set of tightly coupled hardware tables and control logic modules that are placed in the retirement stage of a processor pipeline removed from the critical path. the features of the proposed design include rapid detection of program hot spots after changes in execution behavior, runtime-tunable selection criteria for hot spot detection, and negligible overhead during application execution. Experiments using several SPEC95 benchmarks, as well as several large WindowsNT applications, demonstrate the promise of the proposed design.
Modern compilers must expose sufficient amounts of Instruction-Level Parallelism (ILP) to achieve the promised performance increases of superscalar and VLIW processors. One of the major impediments to achieving this g...
ISBN:
(纸本)9780769501703
Modern compilers must expose sufficient amounts of Instruction-Level Parallelism (ILP) to achieve the promised performance increases of superscalar and VLIW processors. One of the major impediments to achieving this goal has been inefficient programmatic control flow. Historically, the compiler has translated the programmer's original control structure directly into assembly code with conditional branch instructions. Eliminating inefficiencies in handling branch instructions and exploiting ILP has been the subject of much research. However, traditional branch handling techniques cannot significantly alter the program's inherent control structure. the advent of predication as a program control representation has enabled compilers to manipulate control in a form more closely related to the underlying program logic. this work takes full advantage of the predication paradigm by abstracting the program control flow into a logical form referred to as a program decision logic network. this network is modeled as a Boolean equation and minimized using modified versions of logic synthesis techniques. After minimization, the more efficient version of the program's original control flow is re-expressed in predicated code. Furthermore, this paper proposes extensions to the HPL PlayDoh predication model in support of more effective predicate decision logic network minimization. Finally, this paper shows the ability of the mechanisms presented to overcome limits on ILP previously imposed by rigid program control structure.
the proceedings contain 61 papers. the topics discussed include: new number representation and conversion techniques on reconfigurable mesh;precise control of instruction caches;more on arbitrary boundary packed arith...
ISBN:
(纸本)0818691948
the proceedings contain 61 papers. the topics discussed include: new number representation and conversion techniques on reconfigurable mesh;precise control of instruction caches;more on arbitrary boundary packed arithmetic;more on arbitrary boundary packed arithmetic;PERL - a registerless architecture;design alternatives for shared memory multiprocessors;a simple optimal list ranking algorithm;a parallel skeletonization algorithm and its VLSI architecture;improving error bounds for multipole-based treecodes;computation of penetration measures for convex polygons and polyhedra for graphics applications;extrapolation in distributed adaptive integration;and java data parallel extensions with runtime system support.
Nonlinear wave farces on offshore structures are investigated. the fluid motion is computed using a Euler-Lagrange time-domain approach. Nonlinear free surface boundary conditions are stepped forward in time using an ...
详细信息
Nonlinear wave farces on offshore structures are investigated. the fluid motion is computed using a Euler-Lagrange time-domain approach. Nonlinear free surface boundary conditions are stepped forward in time using an accurate and stable integration technique. the field equation with mixed boundary conditions that result at each time step are salved at N nodes using a desingularized boundary integral method with multipole acceleration Multipole accelerated solutions require O(N) computational effort and computer storage, while conventional solvers require O(N-2) effort and storage for an iterative solution and O(N-3) effort for direct inversion of the influence matrix. these methods are applied to the three-dimensional problem of wave diffraction by a vertical cylinder.
Over the past few years there has been increased interest in building custom computing machines (CCMs) as a way of achieving very highperformance on specific problems. the advent of high density field programmable ga...
详细信息
Over the past few years there has been increased interest in building custom computing machines (CCMs) as a way of achieving very highperformance on specific problems. the advent of high density field programmable gate arrays (FPGAs), in combination with new synthesis tools, have made it relatively easy to produce programmable custom machines without building specific hardware. In many cases, the performance achieved by a FPGA based custom computer is attributed to the exploitation of massive concurrency in the underlying application. In this paper we explore the sources of speedup for irregular problems in which is difficult to exploit such parallelism. We highlight 5 main sources of speedup that we have observed, namely the provision of high memory bandwidth, the use of flexible address generation hardware, the use of gather-scatter array operations, the use of lookup tables and the use of multiple tailored arithmetic units. By considering some representative examples of such irregular problems, the paper illustrates that good performance is possible given the current generation of FPGA devices and RISC processors. the paper then explores whether this performance gain will be possible given the next generation of RISC processors and FPGAs. It concludes that the only way to maintain the speedup is to alter the architecture of CCMs in combination with architectural changes to the FPGAs themselves.
Recent FPGA architectures have shown an increased emphasis on run-time reconfiguration, or the ability to rapidly change the functionality of the FPGA to sequentially accommodate large processing tasks. In addition, p...
详细信息
ISBN:
(纸本)0818684038
Recent FPGA architectures have shown an increased emphasis on run-time reconfiguration, or the ability to rapidly change the functionality of the FPGA to sequentially accommodate large processing tasks. In addition, partial reconfiguration allows for the reconfiguration of a portion of the FPGA while the remainder is running. these two features enable the use of reconfigurable computing in high-performance multi-threaded multi-user environments. However, current board designs are not optimized to provide the processing support required to maintain this run-time environment which includes management of the reconfigurable resources, interface to the host processor and data movement. In this paper, we will describe the architecture, design and applicability of the ACEcard, a highperformance reconfigurable co-processor. the ACEcard contains reconfigurable resources as well as an embedded processor to manage the runtime reconfiguration of those resources. We will provide details of the architecture of the card as well as a description of the current and future Java-based runtime environment.
暂无评论