Emphasis on static timing analysis (STA) has shifted from graph-based analysis (GBA) to path-based analysis (PBA) for reducing unwanted slack pessimism. However, it is extremely time-consuming for a PBA engine to anal...
详细信息
ISBN:
(纸本)9781665423694
Emphasis on static timing analysis (STA) has shifted from graph-based analysis (GBA) to path-based analysis (PBA) for reducing unwanted slack pessimism. However, it is extremely time-consuming for a PBA engine to analyze a large set of critical paths. Recent years have seen many parallel PBA applications, but most of them are limited to CPU parallelism and do not scale beyond a few threads. To overcome this challenge, we propose in this paper a high-performance graphics processing unit (GPU)-accelerated PBA framework that efficiently analyzes the timing of a generated critical path set. We represent the path set in three dimensions, timing test, critical path, and pin, to structure the granularity of parallelism scalable to arbitrary problem sizes. In addition, we leverage task-based parallelism to decompose the PBA workload into CPU-GPU dependent tasks where kernel computation and data processing overlap efficiently. Experimental results show that our framework applied to an important PBA application can speed up the state-of-the-art baseline up to 10x on a million-gate design.
Many graph optimization problems, such as theMaximumWeighted Independent Set problem, are NP-hard. For large scale graphs that have billions of edges or vertices, these problems are hard to be computed directly even u...
详细信息
FPGA-based accelerators have received increasing attention recently. Nevertheless, the amount of resources available on even the most powerful FPGA is still not enough to speed up very large workloads. To achieve that...
详细信息
ISBN:
(纸本)9781665435550
FPGA-based accelerators have received increasing attention recently. Nevertheless, the amount of resources available on even the most powerful FPGA is still not enough to speed up very large workloads. To achieve that, FPGAs need to be interconnected in a Multi-FPGA architecture. However, programming such architecture is a challenging endeavor. This paper extends the OpenMP task-based computation offloading model to enable several FPGAs to work as a single Multi-FPGA architecture. Experimental results, for a set of OpenMP stencil applications running on a Multi-FPGA platform, have shown close to linear speedups as the number of FPGAs and IP-cores per FPGA increases.
Coherence induced cache misses are an important aspect limiting the scalability of shared memory parallel programs. Many coherence misses are avoidable, namely misses due to false sharing when different threads write ...
详细信息
ISBN:
(纸本)9781450384414
Coherence induced cache misses are an important aspect limiting the scalability of shared memory parallel programs. Many coherence misses are avoidable, namely misses due to false sharing when different threads write to different memory addresses that are contained within the same cache block causing unnecessary invalidations. Past work has proposed numerous ways to mitigate false sharing from coherence protocols optimized for certain sharing patterns, to software tools for false-sharing detection and repair. Our work leverages approximate computing and store value similarity in error-tolerant multi-threaded applications. We introduce a novel cache coherence protocol which implements an approximate store instruction and coherence states to allow some limited incoherence within approximatable shared data to mitigate both coherence misses and coherence traffic for various sharing patterns. For applications from the Phoenix and AxBench suites, we see dynamic energy improvements within the NoC and memory hierarchy of up to 50.1% and speedup of up to 37.3% with low output error for approximate applications that exhibit false sharing.
The article considers one algorithm for the optimal distribution of "objects" of an arbitrary nature among "storages", the essence of which is determined by the subject area. Some subject areas for...
详细信息
ISBN:
(纸本)9781665404761
The article considers one algorithm for the optimal distribution of "objects" of an arbitrary nature among "storages", the essence of which is determined by the subject area. Some subject areas for which the optimal distribution problem is relevant are considered. Authors considers the problem of accelerating of the Join operation is considered. In the case of big data parallel processing, the Join operation requires uniform distribution of data between the cluster processors. In this case, parallel implementation of the Join operation will be effective only when the computational complexities of its execution in all database fragments will be minimally different from each other. The optimality criterion should ensure uniform distribution of data. A detailed description of the heuristic optimal distribution algorithm is given. Objective functions for the problems under consideration are proposed. A description is given of the experiments that made it possible to assess the quality of the heuristic greedy optimal distribution algorithm. As a result of these experiments, the dependences of the execution time of the algorithm on the number of distributed objects and the quality of distribution (the difference between the maximum and minimum storage capacity) on the number of stores and the interval of the values of the objects weight. It is shown that the algorithm is quite simple and can be easily implemented in any programming language. The running time of the algorithm, even for big data, is small, which allows it to be effectively used in the preparation of data for parallel solving problems with high computational complexity. The algorithm shows good results when distributing ables-operands across data warehouses. The largest storage capacity differs from the smallest by a small amount.
Graphics Processing Units (GPUs) have become a key technology for accelerating node performance in supercomputers, including the US Department of Energy's forthcoming exascale systems. Since the execution model fo...
详细信息
ISBN:
(纸本)9781665411103
Graphics Processing Units (GPUs) have become a key technology for accelerating node performance in supercomputers, including the US Department of Energy's forthcoming exascale systems. Since the execution model for GPUs differs from that for conventional processors, applications need to be rewritten to exploit GPU parallelism. Performance tools are needed for such GPU-accelerated systems to help developers assess how well applications offload computation onto GPUs. In this paper, we describe extensions to Rice University's HPCToolkit performance tools that support measurement and analysis of Intel's DPC++ programming model for GPU-accelerated systems atop an implementation of the industry-standard OpenCL framework for heterogeneous parallelism on Intel GPUs. HPCToolkit supports three techniques for performance analysis of programs atop OpenCL on Intel GPUs. First, HPCToolkit supports profiling and tracing of OpenCL kernels. Second, HPCToolkit supports CPU-GPU blame shifting for OpenCL kernel executions-a profiling technique that can identify code that executes on one or more CPUs while GPUs are idle. Third, HPCToolkit supports fine-grained measurement, analysis, and attribution of performance metrics to OpenCL GPU kernels, including instruction counts, execution latency, and SIMD waste. The paper describes these capabilities and then illustrates their application in case studies with two applications that offload computations onto Intel GPUs.
Lattice based cryptography can be considered a candidate alternative for post-quantum cryptosystems offering key exchange, digital signature and encryption functionality. Number Theoretic Transform (NTT) can be utiliz...
详细信息
ISBN:
(纸本)9781665427036
Lattice based cryptography can be considered a candidate alternative for post-quantum cryptosystems offering key exchange, digital signature and encryption functionality. Number Theoretic Transform (NTT) can be utilized to achieve better performance for these functionalities, where polynomials are needed to be multiplied. NTT simplifies the multiplication overhead allowing point-wise multiplication by transforming the polynomials into the spectral domain and then inversing the result to the original domain. It is important to optimize this technique that is used in a wide range of computing systems. In this paper we study the feasibility of using OpenCL, a portable framework, to implement a parallelized version of NTT which allows deployment on heterogeneous platforms, such as Graphic Processing Units (GPUs) and Field Programmable Gate Arrays (FPGAs). We measure the performance of our implementation on a GPU and evaluate when and where such a deployment is beneficial. Our results showed that the proposed parallel implementation is a viable acceleration approach for these algorithms for lattice-based cryptography solutions.
Node sizes in multicore clusters are becoming larger, so applications should exploit the shared memory inside a node, to potentially reduce communication latencies compared to network communications. The Message Passi...
详细信息
Virtual screening is an early stage of the drug discovery process that selects the most promising candidates. In the urgent computing scenario it is critical to find a solution in a short time frame. In this paper, we...
详细信息
This paper discusses the optimization problem for parallelization the program code in heterogeneous system. The optimization problem and constraints are defined. Authors present the main approached to find the best so...
详细信息
ISBN:
(纸本)9781665414746
This paper discusses the optimization problem for parallelization the program code in heterogeneous system. The optimization problem and constraints are defined. Authors present the main approached to find the best solution. The special aspects of optimization problem in heterogeneous systems arc discussed and the heuristics according to the aspects are proposed.
暂无评论