The GPU has been successfully used for diverse emerging compute-intensive applications, including imaging, computer vision, and more recently, deep learning, to name a few. To offer high performance for such applicati...
详细信息
ISBN:
(数字)9798331506476
ISBN:
(纸本)9798331506483
The GPU has been successfully used for diverse emerging compute-intensive applications, including imaging, computer vision, and more recently, deep learning, to name a few. To offer high performance for such applications, it is provisioned with massive Register Files (RFs) to exploit high Thread-Level parallelism (TLP). As RFs are designed to store thousands of contexts and provide very high bandwidth for uninterrupted supply of operands to hundreds of compute units for high TLP, they have become one of the most power-hungry components in the GPU. Meanwhile, faced with the end of Dennard scaling, it is more important than ever to innovate the microarchitecture of (power-constrained) GPUs to continuously improve the performance of future compute-intensive *** this work, we propose a GPU microarchitecture, WarpedCompaction, designed to utilize given RFs more efficiently, instead of relying on larger-capacity and/or higher-bandwidth RFs to further improve GPU performance. Specifically, first, we reverse-engineer the latest GPU’s RF organization through microbenchmarking. This uncovers that each sub-core within a streaming multiprocessor contains only two dual-ported RF banks, and accesses to these banks are arbitrated solely based on register IDs. We also reveal that, despite the modest configurations, RF banks are largely underutilized, staying inactive for 33.5% of the time. Second, we observe that previously proposed RF optimization techniques, data forwarding and dead register elimination, cannot address this underutilization problem. This is mainly due to the (R1) insufficient RF access requests from a limited number of Operand Collector Units (OCUs) and (R2) inefficient operand distribution by the conventional RF bank arbitration. Third, we present two architectural solutions to tackle the observed inefficiency: (S1) OCU sharing and (S2) skewed arbitrator. Building on enhanced OCU early allocation for partially ready instructions and operand forwardin
Integer Linear Programming (ILP) is an important mathematical approach for solving time-sensitive real-life optimization problems, including network routing, map routing, traffic scheduling, etc. However, the algorith...
详细信息
ISBN:
(数字)9798331506476
ISBN:
(纸本)9798331506483
Integer Linear Programming (ILP) is an important mathematical approach for solving time-sensitive real-life optimization problems, including network routing, map routing, traffic scheduling, etc. However, the algorithms for solving ILPs are typically sparse and branch-intensive, and not CPU/GPU friendly. In the paper “What could a million cores do to solve Integer programs”, Koch et al. [40] presented data illustrating that Integer Linear Programming (ILP) applications take tens of hours of execution time even on the largest parallel computers. Long execution time is a problem because many real-life applications need a decision in seconds or minutes. The widely used ILP solvers, like Gurobi (optimized for CPUs), perform software-based optimizations to handle the inherent sparsity in ILPs but still do not meet decision threshold because of the limited throughput of CPUs. GPUs are suited for large-sized dot-product compute, however, GPU-based ILP solvers also do not meet decision thresholds as (i) GPU is not sparsity friendly and (ii) GPU incurs thread divergence for branching, resulting in under-utilization of streaming engines and periodic host-GPU interaction. We propose SPARK, a sparsity-aware, reuse-aware, energy-efficient, reconfigurable, near-cache ILP architecture that (i) re-configures the existing L1 cache present in CPUs to perform near-cache acceleration with easy integration into the baseline CPU pipeline with minimal area overhead ($\sim 1.4 \%$ of a CPU), (ii) performs near-cache sparsity detection and sparsity-aware compute, reducing the number of insignificant computations, and data movement energy overheads, (iii) leverages the computational patterns present in algorithms used for solving ILP to realize a reuse-aware architecture, and (iv) is applicable to solving sparse and dense ILPs and LPs (Linear Programs). We observe $15 x / 20 x$, and $152 x / 740 x$ performance/energy improvement over AMD’s Zen3 CPU, and Nvidia’s Tesla v100 GPU for sparse rea
This workshop focuses on understanding the implications of accelerators on the architectures and programming environments of future systems. It seeks to ground accelerator research through studies of application kerne...
This workshop focuses on understanding the implications of accelerators on the architectures and programming environments of future systems. It seeks to ground accelerator research through studies of application kernels or whole applications on such systems, as well as tools and libraries that improve the performance or productivity of applications trying to use these systems. The goal of this workshop is to bring together researchers and practitioners who are involved in application studies for accelerators and other hybrid systems, to learn the opportunities and challenges in future design trends for HPC applications and systems.
HIPS-HPGC 2005 is a full-day workshop, focusing on high-performance grid computing and high-level parallel programming models. The papers deal with component models and service-based systems for grids, emphasizing on ...
HIPS-HPGC 2005 is a full-day workshop, focusing on high-performance grid computing and high-level parallel programming models. The papers deal with component models and service-based systems for grids, emphasizing on experiences with existing systems. Also the papers report on the state of the art of grid applications, both for academic and industrial problems
The proceedings contain 84 papers. The special focus in this conference is on Architectures, Networks, Languages and Algorithms. The topics include: The impact of multicore on math software and exploiting single preci...
ISBN:
(纸本)9783540680673
The proceedings contain 84 papers. The special focus in this conference is on Architectures, Networks, Languages and Algorithms. The topics include: The impact of multicore on math software and exploiting single precision computing to obtain double precision results;trends in high performance computing for industry and research;emerging technologies and large scale facilities in HPC;architecture and performance of dynamic offloader for cluster network;dynamic anchor based mobility management scheme for mobile IP networks;choosing a load balancing scheme for agent-based digital libraries;the optimum network on chip architectures for video object plane decoder design;a hardware NIC scheduler to guarantee QOS on high performance servers;multihomed routing in multicast service overlay network;region-based multicast routing protocol with dynamic address allocation scheme in mobile ad-hoc networks;limiting the effects of deafness and hidden terminal problems in directional communications;enforcing dimension-order routing in on-chip torus networks without virtual channels;a distributed backbone formation algorithm for mobile ad hoc networks;randomized leader election protocols in noisy radio networks with a single transceiver;a hybrid intelligent preventive fault-tolerant QOS unicast routing scheme in IP over DWDM optical internet;adaptive technique for automatic communication access pattern discovery applied to data prefetching in distributedapplications using neural networks and stochastic models;process scheduling using ant colony optimization techniques and interference aware dynamic subchannel allocation in a multi-cellular OFDMA system based on traffic situation.
暂无评论