To implement chip design on satisfactory target architectures, architecture exploration should be done at higher levels of abstraction, in the earliest design stages. Using the SpecC language, an executable system lev...
详细信息
ISBN:
(纸本)0780366859
To implement chip design on satisfactory target architectures, architecture exploration should be done at higher levels of abstraction, in the earliest design stages. Using the SpecC language, an executable system level design language, system level architecture exploration can proceed easily and smoothly as the system specification is being created. A SpecC methodology of system level architecture exploration is introduced within this paper to illustrate this process. The design of a JPEG encoder is used as an example to illustrate the system level architecture exploration methodology.
We present the design, implementation, and evaluation of single assignment data structures and of a software controlled cache in an existing multi-threaded architecture platform – the Efficient architecture for Runni...
We present the design, implementation, and evaluation of single assignment data structures and of a software controlled cache in an existing multi-threaded architecture platform – the Efficient architecture for Running Threads (EARTH). The I-Structure Software-Controlled Cache (ISSC) exploits temporal and spatial locality of EARTH split-phased memory transactions for single-assignment memory references. Our experimental evaluation indicates that the caching mechanism for single-assignment storage makes the EARTH memory system more robust to variations in the latency of memory operations. As a consequence the system can be ported to a wider range of machine platforms and deliver speedup for both regular and irregular application.
The development of fine-grain multi-threaded program ex-ecution models has created an interesting challenge: how to partition a program into threads that can exploit machine parallelism, achieve latency tolerance, and...
详细信息
This paper makes two important contributions. First, the paper investigates the performance implications of data placement in OpenMP programs running on modern NUMA multiprocessors. Data locality and minimization of t...
This paper makes two important contributions. First, the paper investigates the performance implications of data placement in OpenMP programs running on modern NUMA multiprocessors. Data locality and minimization of the rate of remote memory accesses are critical for sustaining high performance on these systems. We show that due to the low remote-to-local memory access latency ratio of contemporary NUMA architectures, reasonably balanced page placement schemes, such as round-robin or random distribution, incur modest performance losses. Second, the paper presents a transparent, user-level page migration engine with an ability to gain back any performance loss that stems from suboptimal placement of pages in iterative OpenMP programs. The main body of the paper describes how our OpenMP runtime environment uses page migration for implementing implicit data distribution and redistribution schemes without programmer intervention. Our experimental results verify the effectiveness of the proposed framework and provide a proof of concept that it is not necessary to introduce data distribution directives in OpenMP and warrant the simplicity or the portability of the programming model.
The very long and highly variable latencies in the deep memory hierarchy of a petaflop-scale architecture design, such as the Hybrid Technology Multi-Threaded architecture (HTMT) [13], present a new challenge to its p...
详细信息
The Hybrid Technology Multi-Threading project is a long-term study of the feasibility of combining several emerging technologies to reach 1 petaFLOPS within ten years. HTMT will combine high-speed superconductor proce...
详细信息
The Hybrid Technology Multi-Threading project is a long-term study of the feasibility of combining several emerging technologies to reach 1 petaFLOPS within ten years. HTMT will combine high-speed superconductor processors, semiconductor memories with built-in processors, high-speed optical interconnects, and high-density holographic storage. While there are major challenges in all aspects of this project, those in processor architecture are the focus of this paper. Fundamental differences between RSFQ circuits and conventional semiconductor circuits, including a radical jump in clock speed, make today's processor design approaches inappropriate for HTMT. Sequential instruction dispatching, even within the lowest programming unit (a strand), will lead to unacceptably high latencies, hence poor performance. We propose alternative processor designs which use fine-grain synchronizations between individual instructions in order to avoid these bottlenecks.
Statistical multiplexing in packet-switched networks creates problems for packetized voice streams by introducing variable delays on delivered packets. The resulting jitter needs to be filtered so that received voice ...
详细信息
Statistical multiplexing in packet-switched networks creates problems for packetized voice streams by introducing variable delays on delivered packets. The resulting jitter needs to be filtered so that received voice packets can be reconstructed as a continuous stream at the receiver. One common approach to reconstruction is to play back the receiver voice data after a delay offset from the departure time at the source of the packet stream. While the added delay helps filter jitter, one cannot introduce too much delay, otherwise, interactiveness suffers. This paper presents a new technique to find the necessary delay offset (or play-back delay) to recreate the original voice data stream. This technique gives the user control over the fraction of packets that should arrive in time to be played back so that the added play-back delay can be effectively minimized.
Energy dissipated in on-chip caches represents a substantial portion in the energy budget of today's processors. Extrapolating current trends, this portion is likely to increase in the near future, since the devic...
详细信息
Energy dissipated in on-chip caches represents a substantial portion in the energy budget of today's processors. Extrapolating current trends, this portion is likely to increase in the near future, since the devices devoted to the caches occupy an increasingly larger percentage of the total area of the chip. We extend the work proposed by J. Kin et al. (1997), in which an extra, small cache (called filter cache) is inserted between the CPU data path and the L1 cache and serves to filter most of the references initiated from the CPU. In our scheme, the compiler is used to generate code that exploits the new memory hierarchy and reduces the possibility of a miss in the extra cache. Experimental results across a wide range of SPEC95 benchmarks show that this cache, which we call L-Cache, has a small performance overhead with respect to the scheme without any extra caches, and provides substantial energy savings. The L-Cache is placed between the CPU and the I-Cache. The D-Cache subsystem is not modified. Since the L-Cache is much smaller, and thus, has a smaller access time than the I-Cache, this scheme can also be used for performance improvements provided that the hit rate in the L-Cache is very high. In our experimental results, we show that the L-Cache does indeed improve performance in some cases.
SDL is currently gaining interest as a system level specification language for HW/SW codesign. Automated synthesis of SDL in hardware so far had problems with its efficiency. The investigations on the resource usage o...
详细信息
SDL is currently gaining interest as a system level specification language for HW/SW codesign. Automated synthesis of SDL in hardware so far had problems with its efficiency. The investigations on the resource usage of SDL-to-VHDL designs presented in this paper identify two key challenges: minimizing the overhead introduced by SDL process infrastructure, and choosing the appropriate synthesis method. This paper presents a framework for SDL hardware synthesis where VHDL code generation, high-level synthesis and RT-level synthesis are combined. A configurable run-time environment implements services like data handling and message passing in efficient, hand-coded library components, which take into account properties of the target architecture. For these components RT-level synthesis was found to be suitable. The behavior of each SDL process on the other hand is freely specified by the system designer. Depending on the type of application, i.e. complex data-oriented or control-oriented either high-level synthesis, RT-level synthesis, or a combination of both can prove to be optimal.
暂无评论