检索结果-内蒙古大学图书馆

IEEE High Performance Extreme Computing Conference (HPEC)

作者： Zamani, Yasin Huang, Tsung-Wei Univ Utah Dept Elect & Comp Engn Salt Lake City UT 84132 USA

ISBN: (纸本)9781665423694

Emphasis on static timing analysis (STA) has shifted from graph-based analysis (GBA) to path-based analysis (PBA) for reducing unwanted slack pessimism. However, it is extremely time-consuming for a PBA engine to analyze a large set of critical paths. Recent years have seen many parallel PBA applications, but most of them are limited to CPU parallelism and do not scale beyond a few threads. To overcome this challenge, we propose in this paper a high-performance graphics processing unit (GPU)-accelerated PBA framework that efficiently analyzes the timing of a generated critical path set. We represent the path set in three dimensions, timing test, critical path, and pin, to structure the granularity of parallelism scalable to arbitrary problem sizes. In addition, we leverage task-based parallelism to decompose the PBA workload into CPU-GPU dependent tasks where kernel computation and data processing overlap efficiently. Experimental results show that our framework applied to an important PBA application can speed up the state-of-the-art baseline up to 10x on a million-gate design.

关键词： parallel programming Static Timing Analysis (STA) Path-Based Analysis (PBA) CUDA Graph

来源：评论

学校读者我要写书评

暂无评论

Towards systematic parallel programming of graph problems via tree decomposition and tree parallelism

Towards systematic parallel programming of graph problems vi...

引用

2nd ACM SIGPLAN Workshop on Functional High-Performance Computing, FHPC 2013 - Co-located with the 18th ACM SIGPLAN International Conference on Functional programming, ICFP 2013

作者： Wang, Qi Chen, Meixian Liu, Yu Hu, Zhenjiang STAP Group School of Software Shanghai Jiao Tong University China Dpt of Computer Science and Engineering Shanghai Jiao Tong University China University for Advanced Studies Japan National Institute of Informatics Japan

ISBN: (纸本)9781450323819

Many graph optimization problems, such as theMaximumWeighted Independent Set problem, are NP-hard. For large scale graphs that have billions of edges or vertices, these problems are hard to be computed directly even using popular data-intensive frameworks like MapReduce or Pregel that are deployed on large computerclusters, because of the extremely high computational complexity. On the other hand, many studies have shown the existence of polynomial time algorithms on graphs with bounded treewidth, which makes it possible to solve these problems on large graphs. However, the algorithms are usually difficult to be understood or parallelized. In this paper, we propose a novel programming framework which provides a user-friendly programming interface and automatic in-black-box parallelization. The programming interface, which is a simple and straightforward abstraction called Generate- Test-Aggregate (GTA for short), is used to describe a set of graph problems. We propose to derive bottom-up dynamic programming algorithms on tree decompositions from the user-specified GTA algorithms, and further transform the bottom-up algorithms to parallel ones which run in a divide-and-conquer manner on a list of subtrees. Besides, balanced tree partition strategies are discussed for efficient parallel computing. Our preliminary experimental results on the Maximum Weighted Independent Set problem demonstrate the practical viability of our approaches. Copyright © 2013 by the Association for Computing Machinery, Inc. (ACM).

关键词： parallel programming

来源：评论

学校读者我要写书评

暂无评论

Enabling OpenMP Task parallelism on Multi-FPGAs 29

Enabling OpenMP Task Parallelism on Multi-FPGAs

引用

29th IEEE Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)

作者： Nepomuceno, Ramon Sterle, Renan Valarini, Guilherme Pereira, Marcio Yviquel, Herve Araujo, Guido Univ Estadual Campinas Inst Comp Campinas SP Brazil

ISBN: (纸本)9781665435550

FPGA-based accelerators have received increasing attention recently. Nevertheless, the amount of resources available on even the most powerful FPGA is still not enough to speed up very large workloads. To achieve that, FPGAs need to be interconnected in a Multi-FPGA architecture. However, programming such architecture is a challenging endeavor. This paper extends the OpenMP task-based computation offloading model to enable several FPGAs to work as a single Multi-FPGA architecture. Experimental results, for a set of OpenMP stencil applications running on a Multi-FPGA platform, have shown close to linear speedups as the number of FPGAs and IP-cores per FPGA increases.

关键词： Multi FPGAs OpenMP parallel programming Task parallelism

来源：评论

学校读者我要写书评

暂无评论

Ghostwriter: A Cache Coherence Protocol for Error-Tolerant Applications 21

Ghostwriter: A Cache Coherence Protocol for Error-Tolerant A...

引用

50th International Conference on parallel Processing (ICPP)

作者： Kao, Henry San Miguel, Joshua Jerger, Natalie Enright Univ Toronto Toronto ON Canada Univ Wisconsin Madison WI USA Huawei Technol Canada Markham ON Canada

ISBN: (纸本)9781450384414

Coherence induced cache misses are an important aspect limiting the scalability of shared memory parallel programs. Many coherence misses are avoidable, namely misses due to false sharing when different threads write to different memory addresses that are contained within the same cache block causing unnecessary invalidations. Past work has proposed numerous ways to mitigate false sharing from coherence protocols optimized for certain sharing patterns, to software tools for false-sharing detection and repair. Our work leverages approximate computing and store value similarity in error-tolerant multi-threaded applications. We introduce a novel cache coherence protocol which implements an approximate store instruction and coherence states to allow some limited incoherence within approximatable shared data to mitigate both coherence misses and coherence traffic for various sharing patterns. For applications from the Phoenix and AxBench suites, we see dynamic energy improvements within the NoC and memory hierarchy of up to 50.1% and speedup of up to 37.3% with low output error for approximate applications that exhibit false sharing.

关键词： Cache Coherence Approximate Computing parallel programming

来源：评论

学校读者我要写书评

暂无评论

The Heuristic Algorithm For Symmetric Horizontal Data Distribution

The Heuristic Algorithm For Symmetric Horizontal Data Distri...

引用

IEEE Conference of Russian Young Researchers in Electrical and Electronic Engineering (ElConRus)

作者： Munerman, Victor Munerman, Daniel Samoilova, Tatyana Smolensk State Univ SmolGU Phys & Math Dept Smolensk Russia

ISBN: (纸本)9781665404761

The article considers one algorithm for the optimal distribution of "objects" of an arbitrary nature among "storages", the essence of which is determined by the subject area. Some subject areas for which the optimal distribution problem is relevant are considered. Authors considers the problem of accelerating of the Join operation is considered. In the case of big data parallel processing, the Join operation requires uniform distribution of data between the cluster processors. In this case, parallel implementation of the Join operation will be effective only when the computational complexities of its execution in all database fragments will be minimally different from each other. The optimality criterion should ensure uniform distribution of data. A detailed description of the heuristic optimal distribution algorithm is given. Objective functions for the problems under consideration are proposed. A description is given of the experiments that made it possible to assess the quality of the heuristic greedy optimal distribution algorithm. As a result of these experiments, the dependences of the execution time of the algorithm on the number of distributed objects and the quality of distribution (the difference between the maximum and minimum storage capacity) on the number of stores and the interval of the values of the objects weight. It is shown that the algorithm is quite simple and can be easily implemented in any programming language. The running time of the algorithm, even for big data, is small, which allows it to be effectively used in the preparation of data for parallel solving problems with high computational complexity. The algorithm shows good results when distributing ables-operands across data warehouses. The largest storage capacity differs from the smallest by a small amount.

关键词： optimal distribution parallel programming heuristic algorithm

来源：评论

学校读者我要写书评

暂无评论

Measurement and Analysis of GPU-Accelerated OpenCL Computations on Intel GPUs 3

Measurement and Analysis of GPU-Accelerated OpenCL Computati...

引用

IEEE/ACM International Workshop on programming and Performance Visualization Tools (ProTools)

作者： Cherian, Aaron Thomas Zhou, Keren Grubisic, Dejan Meng, Xiaozhu Mellor-Crummey, John Rice Univ Dept Comp Sci Houston TX 77251 USA

ISBN: (纸本)9781665411103

Graphics Processing Units (GPUs) have become a key technology for accelerating node performance in supercomputers, including the US Department of Energy's forthcoming exascale systems. Since the execution model for GPUs differs from that for conventional processors, applications need to be rewritten to exploit GPU parallelism. Performance tools are needed for such GPU-accelerated systems to help developers assess how well applications offload computation onto GPUs. In this paper, we describe extensions to Rice University's HPCToolkit performance tools that support measurement and analysis of Intel's DPC++ programming model for GPU-accelerated systems atop an implementation of the industry-standard OpenCL framework for heterogeneous parallelism on Intel GPUs. HPCToolkit supports three techniques for performance analysis of programs atop OpenCL on Intel GPUs. First, HPCToolkit supports profiling and tracing of OpenCL kernels. Second, HPCToolkit supports CPU-GPU blame shifting for OpenCL kernel executions-a profiling technique that can identify code that executes on one or more CPUs while GPUs are idle. Third, HPCToolkit supports fine-grained measurement, analysis, and attribution of performance metrics to OpenCL GPU kernels, including instruction counts, execution latency, and SIMD waste. The paper describes these capabilities and then illustrates their application in case studies with two applications that offload computations onto Intel GPUs.

关键词： Supercomputers High performance computing Performance analysis parallel programming

来源：评论

学校读者我要写书评

暂无评论

Studying OpenCL-based Number Theoretic Transform for heterogeneous platforms 24

Studying OpenCL-based Number Theoretic Transform for heterog...

引用

24th Euromicro Conference on Digital System Design (DSD)

作者： Haleplidis, Evangelos Tsakoulis, Thanasis El-Kady, Alexander Dimopoulos, Charis Koufopavlou, Odysseas Fournaris, Apostolos P. RC ATHENA Ind Syst Inst Patras Sci Pk Platani Patras 26504 Greece Univ Piraeus Dept Digital Syst Piraeus Greece Univ Patras Elect & Comp Engn Dept Rion Campus Patras Greece

ISBN: (纸本)9781665427036

Lattice based cryptography can be considered a candidate alternative for post-quantum cryptosystems offering key exchange, digital signature and encryption functionality. Number Theoretic Transform (NTT) can be utilized to achieve better performance for these functionalities, where polynomials are needed to be multiplied. NTT simplifies the multiplication overhead allowing point-wise multiplication by transforming the polynomials into the spectral domain and then inversing the result to the original domain. It is important to optimize this technique that is used in a wide range of computing systems. In this paper we study the feasibility of using OpenCL, a portable framework, to implement a parallelized version of NTT which allows deployment on heterogeneous platforms, such as Graphic Processing Units (GPUs) and Field Programmable Gate Arrays (FPGAs). We measure the performance of our implementation on a GPU and evaluate when and where such a deployment is beneficial. Our results showed that the proposed parallel implementation is a viable acceleration approach for these algorithms for lattice-based cryptography solutions.

关键词： NTT Inverse NTT Cryptography OpenCL parallel programming

来源：评论

学校读者我要写书评

暂无评论

Optimizing Mpi Collectives with Hierarchical Design for Efficient Cpu Oversubscription

SSRN

引用

SSRN 2023年

作者： Utrera, Gladys Bull, J. Mark Computer Architecture Department Universitat Politècnica de Catalunya BarcelonaTECH Barcelona08034 Spain EPCC University of Edinburgh EdinburghEH8 9BT United Kingdom

Node sizes in multicore clusters are becoming larger, so applications should exploit the shared memory inside a node, to potentially reduce communication latencies compared to network communications. The Message Passing Interface library (MPI) serves as the de facto standard for parallel applications on distributed memory environments. The appearance of the MPI-3 shared memory extension has made the idea of optimizing the MPI collective operations at the intra-node level both attractive and portable. Taking advantage of this facility, we present a hierarchical design of the algorithms for MPI_Allreduce and MPI_Bcast collective operations, which we name Fullpar. The proposal is based on partitioning the messages and exploiting concurrency between network communication and shared-memory operations at intra-node level in the message dissemination *** proposed hierarchical collective algorithms do not exploit all the available parallelism and/or the direct implementation of shared-memory at intra-node level. Furthermore, we explore the application of oversubscription, to enhance resource utilization. Oversubscribing CPUs offers opportunities to optimize resource allocation, as well as making CPUs available to other applications. We implement our proposal on top of the Intel MPI and OpenMPI libraries using the MPI profiling (PMPI) mechanism, and carry out evaluations on platforms with different architectures characteristics such as the node size and interconnection network. Experimental results show that we obtain significant benefits in performance improvement for medium to large message sizes over the native algorithms of the libraries, and over other hierarchical designs from the literature for both collectives. Furthermore, the introduction of oversubscription is shown to have almost no overhead, especially for the broadcast of large messages. © 2023, The Authors. All rights reserved.

关键词： parallel programming

来源：评论

学校读者我要写书评

暂无评论

Improving computation efficiency using input and architecture features for a virtual screening application

arXiv

引用

arXiv 2023年

作者： Gianmarco, Accordi Emanuele, Vitali Davide, Gadioli Luigi, Crisci Biagio, Cosenza Mauro, Bisson Fatica, Massimiliano Andrea, Beccari Palermo, Gianluca Dipartimento di Elettronica Informazione e Bionigegneria Politecnico di Milano Milano Italy Csc It Center for Science Espoo Finland Dipartimento di Informatica Università Degli Studi di Salerno Salerno Italy Nvidia Corporation Santa ClaraCA United States Dompé Farmaceutici SpA Napoli Italy

Virtual screening is an early stage of the drug discovery process that selects the most promising candidates. In the urgent computing scenario it is critical to find a solution in a short time frame. In this paper, we focus on a real-world virtual screening application to evaluate out-of-kernel optimizations, that consider input and architecture features to improve the computation efficiency on GPU. Experiment results on a modern supercomputer node show that we can almost double the performance. Moreover, we implemented the optimization using SYCL and it provides a consistent benefit with the CUDA optimization. A virtual screening campaign can use this gain in performance to increase the number of evaluated candidates, improving the probability of finding a drug. © 2023, CC BY.

关键词： parallel programming

来源：评论

学校读者我要写书评

暂无评论

Heuristics for Program Code Optimization in Heterogeneous Systems 31

Heuristics for Program Code Optimization in Heterogeneous Sy...

引用

31st International Conference on Radioelektronika (RADIOELEKTRONIKA) Part of MAREW Conference

作者： Voloshko, Anna Ivutin, Alexey Novikov, Alexander S. Tula State Univ Dept Comp Technol Tula Russia

ISBN: (纸本)9781665414746

This paper discusses the optimization problem for parallelization the program code in heterogeneous system. The optimization problem and constraints are defined. Authors present the main approached to find the best solution. The special aspects of optimization problem in heterogeneous systems arc discussed and the heuristics according to the aspects are proposed.

关键词： parallel programming optimization heterogeneous system Petri nets semantic relations heuristic

来源：评论

学校读者我要写书评

暂无评论

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

时间限定

文献类型

馆藏选择

核心期刊

语言

文献类型

帮助

文字说明：

检索规则说明：

检索范例：

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案：

请选择收藏分类：

通借通还

建议与咨询 留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

时间限定

文献类型

馆藏选择

核心期刊

语言

文献类型

帮助

文字说明：

检索规则说明：

检索范例：

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案： 新增检索档案 确定 取消

请选择收藏分类： 新增自定义分类 确定 取消

通借通还

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

请选择保存的检索档案：

请选择收藏分类：