Single-issue processor cores are very energy efficient but suffer from the von Neumann bottleneck, in that they must explicitly fetch and issue the loads/storse necessary to feed their ALU/FPU. Each instruction spent ...
详细信息
Single-issue processor cores are very energy efficient but suffer from the von Neumann bottleneck, in that they must explicitly fetch and issue the loads/storse necessary to feed their ALU/FPU. Each instruction spent on moving data is a cycle not spent on computation, limiting ALU/FPU utilization to 33 percent on reductions. We propose "Stream Semantic Registers" to boost utilization and increase energy efficiency. SSR is a lightweight, non-invasive RISC-V ISA extension which implicitly encodes memory accesses as register reads/writes, eliminating a large number of loads/stores. We implement the proposed extension in the RTL of an existing multi-core cluster and synthesize the design for a modern 22 nm technology. Our extension provides a significant, 2x to 5x, architectural speedup across different kernels at a small 11 percent increase in core area. Sequential code runs 3x faster on a single core, and 3x fewer cores are needed in a cluster to achieve the same performance. The utilization increase to almost 100 percent in leads to a 2x energy efficiency improvement in a multi-core cluster. The extension reduces instruction fetches by up to 3.5x and instruction cache power consumption by up to 5.6x. Compilers can automatically map loop nests to SSRs, making the changes transparent to the programmer.
We propose FusedCache, a two-level set-associative Racetrack memory (RM) cache design that utilizes RM's high density for providing fast uniform access at one level, and non-uniform access at the next. FusedCache ...
详细信息
We propose FusedCache, a two-level set-associative Racetrack memory (RM) cache design that utilizes RM's high density for providing fast uniform access at one level, and non-uniform access at the next. FusedCache is well suited for private L1/L2 caches enforcing alignment of L1 data with the RM access points with the remaining non-aligned data serving as L2. It uses traditional LRU eviction for L1 misses. Promotion and demotion between L1 and L2 are performed through shifts and, when necessary, background swap operations. These swap operations do not require physical stores or loads, making accesses both faster and more energy efficient. Further, unlike a traditional inclusive cache hierarchy, fused L1 cache lines naturally exist in L2 avoiding duplicated storage and tag structures, promotions, and evictions. L1 status on each track is strictly enforced by track LRU maintenance and background swapping. Our results demonstrate that compared to an iso-area L1 SRAM cache replacement, FusedCache improves application performance by 7 percent while reducing cache energy by 33 percent. Compared to an iso-capacity two level (L1/L2) SRAM cache replacement, FusedCache provides similar performance with a dramatic 69 percent cache energy reduction. Compared to a TapeCache L1 scheme, FusedCache gains a 7 percent performance improvement with a 6 percent cache energy saving which translates to a 13 percent improvement in energy-delay product.
Software-defined radio (sdr) clouds combine sdr concepts with cloud computing technology for designing and managing future base stations. They provide a scalable solution for the evolution of wireless communications. ...
详细信息
Software-defined radio (sdr) clouds combine sdr concepts with cloud computing technology for designing and managing future base stations. They provide a scalable solution for the evolution of wireless communications. The authors focus on the resource management implications and propose a hierarchical approach for dynamically managing the real-time computing constraints of wireless communications systems that run on the sdr cloud.
GPUs rely on large register files to unlock thread-level parallelism for high throughput. Unfortunately, large register files are power hungry, making it important to seek for new approaches to improve their utilizati...
详细信息
ISBN:
(纸本)9781450388160
GPUs rely on large register files to unlock thread-level parallelism for high throughput. Unfortunately, large register files are power hungry, making it important to seek for new approaches to improve their utilization. This paper introduces a new register file organization for efficient register-packing of narrow integer and floating-point operands designed to leverage on advances in static analysis. We show that the hardware/software co-designed register file organization yields a performance improvement of up to 79%, and 18.6%, on average, at a modest output-quality degradation.
暂无评论