Lack of object code compatibility in VLIW architectures is a severe limit to their adoption as a general-purpose computing paradigm. Previous approaches include hardware and software techniques, both of which have dra...
详细信息
Lack of object code compatibility in VLIW architectures is a severe limit to their adoption as a general-purpose computing paradigm. Previous approaches include hardware and software techniques, both of which have drawbacks. Hardware techniques add to the complexity of the architecture, whereas software techniques require multiple executables. this paper presents a technique called Dynamic Rescheduling that applies software techniques dynamically, using intervention by the OS: at each first-time page fault, the page of code is rescheduled for the new generation, if required. Results are presented to demonstrate the viability of the technique using the Illinois IMPACT-compiler and the TINKER architectural framework. For the machine models and the workloads used in this study, performance of the rescheduled code compares well withthe native scheduled code for a machine. the behavior of a subset of programs in the workload is such that they face a large number of first-time page faults. Due to this, their rescheduling overhead is higher relative to their total execution time. Such programs are called high-overhead programs. Caching of translated pages across multiple invocations of the program to reduce the rescheduling overhead, using a persistent rescheduled-page cache (PRC)((1)) is dis cussed. It was found that for the workload used in this evaluation, a PRC of size between 512 to 1024 pages, and which uses an overhead-based page replacement policy would be effective in reducing the overhead.
Cilk is a parallel programming language that allows programmers to write multithreaded parallel programs that use computational resources predictably and efficiently. the Cilk language allows programmers to specify th...
详细信息
the large design space of modern computerarchitectures calls for performance modelling tools to facilitate the evaluation of different alternatives. In this paper we give an overview of the Mermaid multicomputer simu...
详细信息
the large design space of modern computerarchitectures calls for performance modelling tools to facilitate the evaluation of different alternatives. In this paper we give an overview of the Mermaid multicomputer simulation environment. this environment allows the evaluation of a wide range of architectural design tradeoffs while delivering reasonable simulation performance. To achieve this, simulation takes place at a level of abstract machine instructions rather than at the level of real instructions. Moreover, a less detailed mode of simulation is also provided. So when accuracy is not the primary objective, this simulation mode can yield high simulation efficiency. As a consequence, Mermaid makes both fast prototyping and accurate evaluation of multicomputerarchitectures feasible.
the goal of the RaPiD (Reconfigurable Pipelined Datapath) architecture is to provide highperformance configurable computing for a range of computationally-intensive applications that demand special-purpose hardware. ...
详细信息
the goal of the RaPiD (Reconfigurable Pipelined Datapath) architecture is to provide highperformance configurable computing for a range of computationally-intensive applications that demand special-purpose hardware. this is accomplished by mapping the computation into a deep pipeline using a configurable array of coarse-grained computational units. A key feature of RaPiD is the combination of static and dynamic control. While the underlying computational pipelines are configured statically, a limited amount of dynamic control is provided which greatly increases the range and capability of applications that can be mapped to RaPiD. this paper illustrates this mapping and configuration for several important applications including a FIR filter, 2-D DCT, motion estimation, and parametric curve generation; it also shows how static and dynamic control are used to perform complex computations.
Typical reconfigurable machines exhibit shortcomings that make them less than ideal for general-purpose computing. the Garp architecture combines reconfigurable hardware with a standard MIPS processor on the same die ...
详细信息
Typical reconfigurable machines exhibit shortcomings that make them less than ideal for general-purpose computing. the Garp architecture combines reconfigurable hardware with a standard MIPS processor on the same die to retain the better features of both. Novel aspects of the architecture are presented, as well as a prototype software environment and preliminary performance results. Compared to an UltraSPARC, a Garp of similar technology could achieve speedups ranging from a factor of 2 to as high as a factor of 24 for some useful applications.
Two fast algorithms for static test sequence compaction are proposed for sequential circuits. the algorithms are based on the observation that test sequences traverse through a small set of states, and some states are...
详细信息
Two fast algorithms for static test sequence compaction are proposed for sequential circuits. the algorithms are based on the observation that test sequences traverse through a small set of states, and some states are frequently re-visited throughout the application of a test set. Subsequences that start and end on the same states may be removed if necessary and sufficient conditions are met for them. the techniques require only two fault simulation passes and are applied to test sequences generated by various test generators, resulting in significant compactions very quickly for circuits that have many revisited states.
RISC-based Massively Parallel Processors (MPPs) often show low efficiency in real-world applications because of cache miss penalty, insufficient throughput of the memory system, and poor inter-processor communication ...
详细信息
RISC-based Massively Parallel Processors (MPPs) often show low efficiency in real-world applications because of cache miss penalty, insufficient throughput of the memory system, and poor inter-processor communication performance. Hitachi's SR2201, an MPP scalable up to 2048 processors and 600 GFLOPS peak performance, overcomes these problems by introducing three novel features. First, its processor the 150 MHz HARP-IE, solves the cache miss penalty by "pseudo vector processing" (PVP). In PVP, data is loaded by prefetching to a special register bank, bypassing the cache. Second, a multi-bank memory architecturethat operates like a pipeline eliminates the memory system bottleneck. third, the inter-processor communication achieves highperformance on the three-dimensional crossbar network, using a "remote DMA transfer" protocol and a hardware-based cache coherency. As the result of these improvements, the SR2201 achieved 220.4 GFLOPS with 1024 processors in the LINPACK benchmark, which is almost 72% of the peak performance.
暂无评论