检索结果-内蒙古大学图书馆

arXiv 2024年

作者： Rosas, Miguel Romero Sanchez, Miguel Torres Eigenmann, Rudolf University of Delaware NewarkDE United States

In the contemporary landscape of computer architecture, the demand for efficient parallel programming persists, needing robust optimization techniques. Traditional optimizing compilers have historically been pivotal in this endeavor, adapting to the evolving complexities of modern software systems. The emergence of Large Language Models (LLMs) raises intriguing questions about the potential for AI-driven approaches to revolutionize code optimization methodologies. This paper presents a comparative analysis between two state-of-the-art Large Language Models, GPT-4.0 and CodeLlama-70B, and traditional optimizing compilers, assessing their respective abilities and limitations in optimizing code for maximum efficiency. Additionally, we introduce a benchmark suite of challenging optimization patterns and an automatic mechanism for evaluating performance and correctness of the code generated by such tools. We used two different prompting methodologies to assess the performance of the LLMs - Chain of Thought (CoT) and Instruction Prompting (IP). We then compared these results with three traditional optimizing compilers, CETUS, PLUTO and ROSE, across a range of real-world use cases. A key finding is that while LLMs have the potential to outperform current optimizing compilers, they often generate incorrect code on large code sizes, calling for automated verification methods. Our extensive evaluation across 3 different benchmarks suites shows CodeLlama-70B as the superior optimizer among the two LLMs, capable of achieving speedups of up to 2.1x. Additionally, CETUS is the best among the optimizing compilers, achieving a maximum speedup of 1.9x. We also found no significant difference between the two prompting methods: Chain of Thought (Cot) and Instructing prompting (IP). © 2024, CC BY.

关键词： parallel programming

来源：评论

学校读者我要写书评

暂无评论

GPA: A GPU Performance Advisor Based on Instruction Sampling 21

GPA: A GPU Performance Advisor Based on Instruction Sampling

引用

19th IEEE/ACM International Symposium on Code Generation and Optimization (CGO)

作者： Zhou, Keren Meng, Xiaozhu Sai, Ryuichi Mellor-Crummey, John Rice Univ Dept Comp Sci Houston TX 77005 USA

ISBN: (纸本)9781728186139

Developing efficient GPU kernels can be difficult because of the complexity of GPU architectures and programming models. Existing performance tools only provide coarse-grained tuning advice at the kernel level, if any. In this paper, we describe GPA, a performance advisor for NVIDIA GPUs that suggests potential code optimizations at a hierarchy of levels, including individual lines, loops, and functions. To relieve users of the burden of interpreting performance counters and analyzing bottlenecks, GPA uses data flow analysis to approximately attribute measured instruction stalls to their root causes and uses information about a program's structure and the GPU to match inefficiency patterns with optimization strategies. To quantify the potential benefits of each optimization strategy, we developed PC sampling-based performance models to estimate its speedup. Our experiments with benchmarks and applications show that GPA provides insightful reports to guide performance optimization. Using GPA, we obtained speedups on a Volta V100 GPU ranging from 1.01x to 3.58x, with a geometric mean of 1.22x.

关键词： High performance computing Performance analysis parallel programming parallel architectures

来源：评论

学校读者我要写书评

暂无评论

Frame Rate Latency Reduction for Real-time Vehicle Tracking using Network Cameras

Frame Rate Latency Reduction for Real-time Vehicle Tracking ...

引用

IEEE Region 10 Symposium (TENSYMP) - Good Technologies for Creating Future

作者： Cempron, Jonathan Paul C. Bautista, Carlo Migel Cu, Gregory Ilao, Joel P. De La Salle Univ Comp Technol Dept Coll Comp Studies Manila Philippines De La Salle Univ Software Technol Dept Coll Comp Studies Manila Philippines

ISBN: (纸本)9781665400268

Traffic monitoring and vehicle counting systems that use surveillance cameras employ several computer vision techniques, one of which is object tracking, which approximates the trajectory of the vehicle throughout the scene. However, a major challenge in processing videos from network camera feeds is the irregular and low frame rates, affecting the performance of object tracking. In this paper, we present a concurrent implementation framework intended to increase the input network video frame rate.

关键词： real-time network camera parallel programming computer vision object tracking

来源：评论

学校读者我要写书评

暂无评论

Lightweight Function Monitors for Fine-Grained Management in Large Scale Python Applications 35

Lightweight Function Monitors for Fine-Grained Management in...

引用

35th IEEE International parallel and Distributed Processing Symposium (IPDPS)

作者： Shaffer, Tim Li, Zhuozhao Tovar, Ben Babuji, Yadu Dasso, T. J. Surma, Zoe Chard, Kyle Foster, Ian Thain, Douglas Univ Notre Dame Notre Dame IN 46556 USA Univ Chicago Chicago IL 60637 USA Argonne Natl Lab Argonne IL 60439 USA

ISBN: (纸本)9781665440660

Python has become a widely used programming language for research, not only for small one-off analyses, but also for complex application pipelines running at supercomputer-scale. Modern parallel programming frameworks for Python present users with a more granular unit of management than traditional Unix processes and batch submissions: the Python function. We review the challenges involved in running native Python functions at scale, and present techniques for dynamically determining a minimal set of dependencies and for assembling a lightweight function monitor (LFM) that captures the software environment and manages resources at the granularity of single functions. We evaluate these techniques in a range of environments, from campus cluster to supercomputer, and show that our advanced dependency management planning and dynamic resource management methods provide superior performance and utilization relative to coarser-grained management approaches, achieving several-fold decrease in execution time for several large Python applications.

关键词： parallel programming Pipelines Tools Software Supercomputers Planning Resource management

来源：评论

学校读者我要写书评

暂无评论

Adaptive MPI collective operations based on evaluations in LogP model 14

Adaptive MPI collective operations based on evaluations in L...

引用

14th International Symposium on Intelligent Systems

作者： Paznikov, A. A. Kupriyanov, M. S. St Petersburg Electrotech Univ LETI Ul Prof Popov 5 St Petersburg 197376 Russia

Message passing model, represented by MPI (Message Passing Interface), is the principal parallel programming tool for distributed computer systems. The most of MPI-programs contain collective communications, which involve all the processes of a parallel program. Effectiveness of collective communications substantially effects on total time of program execution. In this work, we consider the problem of design of adaptive algorithms of collective communications on the example of barrier synchronization, which refers to one of the most common types of collective communications. We developed adaptive algorithm of barrier synchronization, which suboptimally selects barrier synchronization scheme in parallel MPI-programs among such algorithms as Central Counter, Combining Tree and Dissemination Barrier. The adaptive algorithm chooses the barrier algorithm with the minimal evaluation of execution time in the model LogP. Model LogP considers performance of computational resources and interconnect for point-to-point communications. Proposed algorithm has been implemented for MPI. We present the results of experiments on cluster systems, analyse dependency of algorithm selection on LogP parameters values. In particular, for the number of processes less than 20 adaptive algorithm selects Combining Tree, while for a larger number of processes adaptive algorithm selects Dissemination Barrier. Developed algorithm minimizes average time of barrier synchronization by 4%, in comparison with the most common determined barrier algorithms. (C) 2021 The Authors. Published by Elsevier B.V.

关键词： Collectives collective communications barrier barrier synchronization distributed computer systems LogP MPI parallel programming

来源：评论

学校读者我要写书评

暂无评论

parallel implementation of space charge force calculation in SUNRAY-1D using MPI 22

Parallel implementation of space charge force calculation in...

引用

22nd International Vacuum Electronics Conference (IVEC)

作者： Latha, A. Mercy Gahlaut, Vishant Srivastava, Vishnu Ghosh, S. K. CSIR Cent Elect Engn Chennai Ctr CSIR Madras Complex Chennai Tamil Nadu India Banasthali Vidyapith Dept Phys Banasthali India CSIR Cent Elect Engn Res Inst Vacuum Electron Devices Design Grp Pilani Rajasthan India

ISBN: (纸本)9781665441056

SUNRAY-1D is a one-dimensional large signal code for analyzing the beam-wave interaction in helix traveling wave tubes (TWT5). In order to improve the performance of SUNRAY-1D, parallelization of few of its modules has been initiated. parallel implementation of space charge force module using MPI (message passing interface) has been successful. Improvements, in terms of increased accuracy and reduced computational time, have been the key benefits achieved.

关键词： parallel programming MPI space charge force computational time speed-up

来源：评论

学校读者我要写书评

暂无评论

TRIPP: Transparent Resource Provisioning for Multi-Tenant CPU-GPU based Cloud Environments 11

TRIPP: Transparent Resource Provisioning for Multi-Tenant CP...

引用

11th Brazilian Symposium on Computing Systems Engineering (SBESC)

作者： Vicenzi, Julio Costella Knorst, Tiago Jordan, Michael G. Korol, Guilherme Schneider Beck, Antonio Carlos Rutzig, Mateus Beck Univ Fed Rio Grande do Sul UFRGS Inst Informat PGMICRO Porto Alegre RS Brazil Univ Fed Rio Grande do Sul UFRGS Inst Informat Porto Alegre RS Brazil Univ Fed Santa Maria UFSM Elect & Comp Dept Santa Maria RS Brazil

ISBN: (纸本)9781665443111

Cloud Warehouses have been expanding their computational resources to cover the growing offloading of tenants' applications. Currently, cloud nodes integrate heterogeneous resources, such as CPU and GPU, so they can exploit different types and levels of parallelism available in the applications. However, heterogeneous cloud nodes bring challenges to the software development process, since the programmer must be aware of each device's specifications, analyze and distribute the code over the available devices. Even though OpenCL supports transparent programming on heterogeneous devices, softening the programmer's burden, the choice of target device is still the programmer's responsibility. Given that, this work proposes a framework for the execution of OpenCL applications on a multi-tenant CPU-GPU cloud environment, responsible for transparently scheduling the applications to the best available device, without any interaction from the programmer. The framework has the goal of optimizing resource provisioning, reducing makespan and energy consumption. Considering the execution of the PolyBench benchmark suite, the framework shows reduction on makespan of 3.4x and energy savings of 33% when compared to the GPU standalone execution.

关键词： OpenCL Cloud computing Heterogeneous Systems parallel programming

来源：评论

学校读者我要写书评

暂无评论

Distributed and Heterogeneous SAR Backprojection with Halide

Distributed and Heterogeneous SAR Backprojection with Halide

引用

IEEE High Performance Extreme Computing Conference (HPEC)

作者： Imes, Connor Li, Tzu-Mao Glines, Mark Khan, Rishi Walters, John Paul Univ Southern Calif Informat Sci Inst Los Angeles CA 90007 USA Univ Calif San Diego La Jolla CA 92093 USA Extreme Scale Solut New Castle DE USA MIT CSAIL Cambridge MA USA

ISBN: (纸本)9781665423694

Writing efficient, scalable, and portable HPC synthetic aperture radar (SAR) applications is increasingly challenging due to the growing diversity and heterogeneity in distributed systems. Considerable developer and computational resources are often spent to port applications to new HPC platforms and architectures, which is both time consuming and expensive. Domain-specific languages have been shown to be highly productive for development effort, but additionally achieving both scalable computational efficiency and platform portability remains challenging. The Halide programming language is both productive and efficient for dense data processing, supports common CPU architectures and heterogeneous resources like GPUs, and has previously been extended for distributed processing. We propose to use a distributed Halide implementation for scalable and heterogeneous HPC SAR processing. We implement a backprojection algorithm for SAR image reconstruction and demonstrate scalability on the OLCF Summit supercomputer up to 1,024 compute nodes (43,008 cores, each with 4 hardware threads) with a large 32,768x32,768 dataset, and up to 8 distributed GPUs with a 8,192x 8,192 dataset. Our results show excellent scaling and portability to heterogeneous resources, and motivate additional improvements in Halide to better support distributed high-performance signal processing.

关键词： parallel programming high performance computing synthetic aperture radar

来源：评论

学校读者我要写书评

暂无评论

Streamlining the OpenMP programming Model on Ultra-Low-Power Multi-core MCUs 34th

Streamlining the OpenMP Programming Model on Ultra-Low-Power...

引用

34th International Conference on Architecture of Computing Systems (ARCS)

作者： Montagna, Fabio Tagliavini, Giuseppe Rossi, Davide Garofalo, Angelo Benini, Luca Univ Bologna Bologna Italy Swiss Fed Inst Technol Zurich Switzerland

ISBN: (纸本)9783030816827;9783030816810

High-level programming models aim at exploiting hardware parallelism and reducing software development costs. However, their adoption on ultra-low-power multi-core microcontroller (MCU) platforms requires minimizing the overheads of work-sharing constructs on fine-grained parallel regions. This work tackles this challenge by proposing OMP-SPMD, a streamlined approach for parallel computing enabling the OpenMP syntax for the Single-Program Multiple-Data (SPMD) paradigm. To assess the performance improvement, we compare our solution with two alternatives: a baseline implementation of the OpenMP runtime based on the fork-join paradigm (OMP-base) and a version leveraging hardware-specific optimizations (OPM-opt). We benchmarked these libraries on a parallel Ultra-Low Power (PULP) MCU, highlighting that hardware-specific optimizations improve OMP-base performance up to 69%. At the same time, OMP-SPMD leads to an extra improvement up to 178%.

关键词： Ultra-low-power multi-core MCU parallel programming OpenMP SPMD

来源：评论

学校读者我要写书评

暂无评论

Spell Checker Application Based on Levenshtein Automaton 22nd

Spell Checker Application Based on Levenshtein Automaton

引用

22nd International Conference on Intelligent Data Engineering and Automated Learning

作者： Buse-Dragomir, Alexandra Popescu, Paul Stefan Mihaescu, Marian Cristian Univ Craiova Craiova Romania

ISBN: (纸本)9783030916077;9783030916084

This paper presents a spell checker project based on Levenshtein distance and evaluates the system's performance on both parallel and sequential implementations. The Levenshtein algorithm approaches are presented in this paper: Levenshtein Matrix Distance, Levenshtein Vector Distance, Levenshtein automaton (along with an optimised version), Levenshtein trie and the performance evaluation is performed using three edit distances. Each edit distance is evaluated based on a set of misspelt words, so the results are relevant for various cases. For this scenario, the Levenshtein trie, along with the Levenshtein automaton, performed the best in both sequential and parallel versions for a large amount of misspelt words.

关键词： Spell checker Levenshtein automaton parallel programming

来源：评论

学校读者我要写书评

暂无评论

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

时间限定

文献类型

馆藏选择

核心期刊

语言

文献类型

帮助

文字说明：

检索规则说明：

检索范例：

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案：

请选择收藏分类：

通借通还

建议与咨询 留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

时间限定

文献类型

馆藏选择

核心期刊

语言

文献类型

帮助

文字说明：

检索规则说明：

检索范例：

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案： 新增检索档案 确定 取消

请选择收藏分类： 新增自定义分类 确定 取消

通借通还

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

请选择保存的检索档案：

请选择收藏分类：