Achieving highperformance on modern systems is challenging. Even with a detailed profile from a performance tool, writing or refactoring a program to remove its performance issues is still a daunting task for applica...
详细信息
ISBN:
(数字)9781450351140
ISBN:
(纸本)9781450351140
Achieving highperformance on modern systems is challenging. Even with a detailed profile from a performance tool, writing or refactoring a program to remove its performance issues is still a daunting task for application programmers: it demands lots of program optimization expertise that is often system specific. Vendors often provide some detailed optimization guides to assist programmers in the process. However, these guides are frequently hundreds of pages long, making it difficult for application programmers to master and memorize all the rules and guidelines and properly apply them to a specific problem instance. In this work, we develop a framework named Egeria to alleviate the difficulty. Through Egeria, one can easily construct an advising tool for a certain highperformance computing (HPC) domain (e.g., GPU programming) by providing Egeria with a optimization guide or other related documents for the target domain. An advising tool produced by Egeria provides a concise list of essential rules automatically extracted from the documents. At the same time, the advising tool serves as a question-answer agent that can interactively offers suggestions for specific optimization questions. Egeria is made possible through a distinctive multi-layered design that leverages natural language processing techniques and extends them with knowledge of HPC domains and how to extract information relevant to code optimization Experiments on CUDA, OpenCL, and Xeon Phi programming guides demonstrate, both qualitatively and quantitatively, the usefulness of Egeria for HPC.
Input binarization has shown to be an effective way for network acceleration. However, previous binarization scheme could be regarded as simple pixel-wise thresholding operations (i.e., order-one approximation) and su...
详细信息
ISBN:
(纸本)9781538610329
Input binarization has shown to be an effective way for network acceleration. However, previous binarization scheme could be regarded as simple pixel-wise thresholding operations (i.e., order-one approximation) and suffers a big accuracy loss. In this paper, we propose a high-order binarization scheme, which achieves more accurate approximation while still possesses the advantage of binary operation. In particular, the proposed scheme recursively performs residual quantization and yields a series of binary input images with decreasing magnitude scales. Accordingly, we propose high-order binary filtering and gradient propagation operations for both forward and backward computations. Theoretical analysis shows approximation error guarantee property of proposed method. Extensive experimental results demonstrate that the proposed scheme yields great recognition accuracy while being accelerated.
Resistive random access memory (RRAM) is promising to be used as high density storage-class memory by employing crossbar structure. However, the wire resistance in crossbar array causes the IR drop problem, which make...
详细信息
ISBN:
(纸本)9781538622544
Resistive random access memory (RRAM) is promising to be used as high density storage-class memory by employing crossbar structure. However, the wire resistance in crossbar array causes the IR drop problem, which makes nonuniformity of write latency throughout the array. In large crossbar array, the write latency differs greatly even in the same row. Since the write latency of a region is determined by its slowest write-unit, the conventional group-by-row region partition and addressing scheme is suboptimal for improving the overall performance of RRAM. In this work, we present DAWS, a novel RRAM architecture that exploits intrinsic features of crossbar structure. We first build a circuit model to analyze the voltage distribution and write latency distribution in a crossbar array. Then we propose a voltage bias scheme to optimize write latency via minimizing the IR drop path. We further present block diagonal partition to narrow the variance of write latency within each region, thus the write latency of each region is reduced. Moreover, we provide block diagonal addressing to make the write latency monotonically increase with the physical address, which is in favor of address mapping and memory allocation. We also design diagonal writing and diagonal swapping to overlap SET and RESET operations by applying a particular voltage bias pattern that can exploit row level parallelism, thus the number of write operations is halved. The experimental results show that DAWS can reduce memory access latency by 24.0% and improve system performance by 29.7% over an aggressive baseline.
Dynamic analysis is a powerful technique to detect correctness, performance, and security problems, in particular for programs written in dynamic languages, such as JavaScript. To catch mistakes as early as possible, ...
详细信息
Dynamic analysis is a powerful technique to detect correctness, performance, and security problems, in particular for programs written in dynamic languages, such as JavaScript. To catch mistakes as early as possible, developers should run such analyses regularly, e.g., by analyzing the execution of a regression test suite before each commit. Unfortunately, the high overhead of these analyses make this approach prohibitively expensive, hindering developers from benefiting from the power of heavyweight dynamic analysis. This paper presents change-aware dynamic program analysis, an approach to make a common class of dynamic analyses change-aware. The key idea is to identify parts of the code affected by a change through a lightweight static change impact analysis, and to focus the dynamic analysis on these affected parts. We implement the idea based on the dynamic analysis framework Jalangi and evaluate it with 46 checkers from the DLint and JITProf tools. Our results show that change-aware dynamic analysis reduces the overall analysis time by 40%, on average, and by at least 80% for 31% of all commits.
Nowadays, FPGA Placement problems have become more complicated since they need to account area constraint and time constraint. Placement is still one of the most difficult problems as the FPGA designs become larger an...
详细信息
ISBN:
(纸本)9781538652589;9781538652572
Nowadays, FPGA Placement problems have become more complicated since they need to account area constraint and time constraint. Placement is still one of the most difficult problems as the FPGA designs become larger and more complex. As FPGAs are programmable in nature they are an ideal fit for many different markets such as Aerospace, Defense, Audio, Automotive, Broadcast, Industrial, Medical, Security, Video & Image Processing, Wired & Wireless communications. Also, FPGA is becoming popular among big companies and researchers who apply FPGA to high-performance computing and deep learning as it provides better performance, flexible programmability, better cost, etc. In this paper, we present a Tree-based placement algorithm for Homogeneous FPGAs. By applying our algorithm on a set of benchmark circuits we have effectively reduced the placement cost. We have compared results with VPR that uses Simulated annealing approach and our results are comparatively better.
In this paper, an improved changing-topology moving mesh method is developed in OpenFOAM (a widely used CFD software) to solve the moving boundary problem. We use mesh smoothing and edge swapping to improve the mesh q...
详细信息
ISBN:
(纸本)9781509059577
In this paper, an improved changing-topology moving mesh method is developed in OpenFOAM (a widely used CFD software) to solve the moving boundary problem. We use mesh smoothing and edge swapping to improve the mesh quality, together with edge bisection and edge contraction to control the mesh resolution. In order to increase the mesh motion efficiency, a local refinement algorithm is realized. In addition, to ensure the regenerated mesh quality is higher than the original mesh, an improved checking algorithm is implemented during the mesh motion. The effectiveness of our method is demonstrated by simulations of the NACA0012 airfoil with translation, rotation and fish-like undulating locomotion.
The existing HPC capabilities in the MICHELLE ES PIC code is being supplanted with new distributed-memory, MPI-based domain decomposition and per-node accelerators such as GPUs and multicore processing. New interfaces...
详细信息
ISBN:
(纸本)9781509059164
The existing HPC capabilities in the MICHELLE ES PIC code is being supplanted with new distributed-memory, MPI-based domain decomposition and per-node accelerators such as GPUs and multicore processing. New interfaces are also being built between MICHELLE and existing DOD software tools such as CAPSTONE, GSB, and ParaView to form the next generation framework for efficient design and optimization workflow. This is an evolutionary process, and this paper reports on the latest progress and discusses applicable algorithms and implementations.
Loop pipelining is an important optimization in high-level synthesis (HLS) because it allows successive loop iterations to be overlapped during execution. While current HLS pipelining approach achieves high performanc...
详细信息
ISBN:
(纸本)9781538618233
Loop pipelining is an important optimization in high-level synthesis (HLS) because it allows successive loop iterations to be overlapped during execution. While current HLS pipelining approach achieves highperformance for loops with regular and statically analyzable program patterns, it remains challenging to pipeline loops with irregular memory accesses, irregular dependence patterns, and unbalanced workload. The lack of support for dynamic program behaviors results in conservatively synthesized pipelines that sacrifice performance for maintaining presumed regularity. In this paper, we survey some of our recent work that addresses these challenges using a coordinated dynamic-static approach for enabling high-throughput pipelining of irregular loops. We propose to augment the HLS pipeline with dynamic scheduling to adapt to data-dependent behaviors, while employing static compile-time optimizations to minimize the hardware overhead associated with runtime optimization. Experimental results demonstrate that our proposed techniques can significantly improve effective pipeline throughput while conserving hardware resources.
We propose a Fast-Reconfigurable Optical Interconnect (FROI) architecture enabled by time-synchronized node coordination for high perfounance computing. Experimental results show that an ultra-low reconfiguration time...
详细信息
ISBN:
(纸本)9781943580347
We propose a Fast-Reconfigurable Optical Interconnect (FROI) architecture enabled by time-synchronized node coordination for high perfounance computing. Experimental results show that an ultra-low reconfiguration time of 45.4 mu s can be achieved after traffic pattern changes.
In this paper, we propose a generalized buffer-state-based relaying protocol in the context of finite buffer-aided cooperative systems. The proposed relaying scheme relies on two concepts: the simultaneous activation ...
详细信息
暂无评论