Resizable caches can trade-off capacity for access speed to dynamically match the needs of the workload. In Simultaneous Multi-Threaded (SMT) cores, the caching needs can vary greatly across the number of threads and ...
详细信息
Resizable caches can trade-off capacity for access speed to dynamically match the needs of the workload. In Simultaneous Multi-Threaded (SMT) cores, the caching needs can vary greatly across the number of threads and their characteristics, offering opportunities to dynamically adjust cache resources to the workload. In this paper we propose the use of resizable caches in order to improve the performance of SMT cores, and introduce a new control algorithm that provides good results independent of the number of running threads. In workloads with a single thread, the resizable cache control algorithm should optimize for cache miss behavior because misses typically form the critical path. In contrast, with several independent threads running, we show that optimizing for cache hit behavior has more impact, since large SMT workloads have other threads to run during a cache miss. Moreover, we demonstrate that these seemingly diametrically opposed policies can be simultaneously satisfied by using the harmonic mean of the per-thread speedups as the metric to evaluate the system performance, and to smoothly and naturally adjust to the degree of multithreading.
Real time rendering of three-dimensional scenes in high photorealistic detail is a hard task, such as in the Ray Tracing rendering algorithm. However, parallel implementations of Ray Tracing have been enabling real ti...
详细信息
Real time rendering of three-dimensional scenes in high photorealistic detail is a hard task, such as in the Ray Tracing rendering algorithm. However, parallel implementations of Ray Tracing have been enabling real time performance, as the algorithm is embarrassingly parallel. Thus, a custom parallel design in hardware is likely to achieve an acceptable performance. In this paper, we propose a hardware parallel architecture capable of dealing with the main desirable features of Ray Tracing, such as shadows and reflection effects, imposing low area cost and acceptable rendering performance.
Moore's law will grant computer architects ever more transistors for the foreseeable future, and the challenge is how to use them to deliver efficient performance and flexible programmability. We propose a many-core ...
详细信息
Moore's law will grant computer architects ever more transistors for the foreseeable future, and the challenge is how to use them to deliver efficient performance and flexible programmability. We propose a many-core architecture, Godson- T, to attack this challenge. On the one hand, Godson-T features a region-based cache coherence protocol, asynchronous data transfer agents and hardware-supported synchronization mechanisms, to provide full potential for the high efficiency of the on-chip resource utilization. On the other hand, Godson-T features a highly efficient runtime system, a Pthreadslike programming model, and versatile parallel libraries, which make this many-core design flexibly programmable. This hardware/software cooperating design methodology bridges the high-end computing with mass programmers. Experimental evaluations are conducted on a cycle-accurate simulator of Godson-T. The results show that the proposed architecture has good scalability, fast synchronization, high computational efficiency, and flexible programmability.
Current general-purpose memory allocators do not provide sufficient speed or flexibility for modern high-performance applications. To optimize metrics like performance, memory usage and energy consumption, software en...
详细信息
Current general-purpose memory allocators do not provide sufficient speed or flexibility for modern high-performance applications. To optimize metrics like performance, memory usage and energy consumption, software engineers often write custom allocators from scratch, which is a difficult and error-prone process. In this paper, we present a flexible and efficient simulator to study Dynamic Memory Managers (DMMs), a composition of one or more memory allocators. This novel approach allows programmers to simulate custom and general DMMs, which can be composed without incurring any additional runtime overhead or additional programming cost. We show that this infrastructure simplifies DMM construction, mainly because the target application does not need to be compiled every time a new DMM must be evaluated. Within a search procedure, the system designer can choose the "best" allocator by simulation for a particular target application. In our evaluation, we show that our scheme will deliver better performance, less memory usage and less energy consumption than single memory allocators.
Dynamic reconfigurable systems can evolve under various conditions due to changes imposed either by the architecture, or by the applications, or by the environment. In such systems, the design process becomes more sop...
详细信息
Dynamic reconfigurable systems can evolve under various conditions due to changes imposed either by the architecture, or by the applications, or by the environment. In such systems, the design process becomes more sophisticated as all the design decisions have to be optimized in terms of runtime behaviors and values. Runtime mapping exploration allows to explore reconfigurable systems at runtime to optimize task mappings in order to adapt to the changing behavior of the application(s), the architecture, or the environment. Performing such explorations at runtime enables a system to be more efficient in terms of various design constraints such as performance, chip area, power consumption, etc. Towards this goal, in this paper, we present a model that facilitates runtime mapping exploration of reconfigurable architectures. A case study of an MJPEG application shows that the presented model can be used to perform runtime exploration of various functional and non-functional design parameters.
The efficient support of cache coherence is extremely important to design and implement many-core processors. In this paper, we propose a synchronization-based coherence (SBC) protocol to efficiently support cache coh...
详细信息
The efficient support of cache coherence is extremely important to design and implement many-core processors. In this paper, we propose a synchronization-based coherence (SBC) protocol to efficiently support cache coherence for shared memory many-core architectures. The unique feature of our scheme is that it doesnpsilat use directory at all. Inspired by scope consistency memory model, our protocol maintains coherence at synchronization point. Within critical section, processor cores record write-sets (which lines have been written in critical section) with bloom-filter function. When the core releases the lock, the write-set is transferred to a synchronization manager. When another core acquires the same lock, it gets the write-set from the synchronization manager and invalidates stale data in its local cache. Experimental results show that the SBC outperforms by averages of 5% in execution time across a suite of scientific applications. At the mean time, the SBC is more cost-effective comparing to directory-based protocol that requires large amount of hardware resource and huge design verification effort.
The network file system (NFS) protocol, as the de facto standard for sharing files in a distributed environment, has deployed Infiniband as the underlying transport of sunRPC, namely NFS over RDMA. In the current Read...
详细信息
The network file system (NFS) protocol, as the de facto standard for sharing files in a distributed environment, has deployed Infiniband as the underlying transport of sunRPC, namely NFS over RDMA. In the current Read-Write design of NFS over RDMA, NFS write performance is limited for not fully utilizing the features of Infiniband. In this paper, we take on the challenge of enhancing the write performance of NFS. We propose and evaluate a new design of sunRPC over RDMA, namely Write-Write design. To guarantee the security of our design, we propose an HCA-based memory protection extension of Infiniband. Evaluations show that our Write-Write design increases the kernel-to-kernel RPC bandwidth by 15~27%. In real disk test, our Write-Write design gains 15%~22% in multi-client benchmarks compared with the Read-Write design.
Instruction-grain lifeguards monitor executing programs at the granularity of individual instructions to quickly detect bugs and security attacks, but their fine-grain nature incurs high monitoring overheads. This art...
详细信息
暂无评论