检索结果-内蒙古大学图书馆

Modeling the Interplay between Loop Tiling and Fusion in optimizing compilers Using Affine Relations

ACM TRANSACTIONS ON COMPUTER SYSTEMS 2023年第1-4期41卷 1-45页

作者： Zhao, Jie Xu, Jinchen Di, Peng Nie, Wang Hu, Jiahui Yi, Yanzhi Yang, Sijia Geng, Zhen Zhang, Renwei Li, Bojie Gan, Zhiliang Jin, Xuefeng Hunan Univ Coll Comp Sci & Elect Engn Changsha 410082 Peoples R China Informat Engn Univ Zhengzhou 450001 Peoples R China Huawei Technol Co Ltd Beijing 100085 Peoples R China

Loop tiling and fusion are two essential transformations in optimizing compilers to enhance the data locality of programs. Existing heuristics either perform loop tiling and fusion in a particular order, missing some of their profitable compositions, or execute ad-hoc implementations for domain-specific applications, calling for a generalized and systematic solution in optimizing compilers. In this article, we present a so-called basteln (an abbreviation for backward slicing of tiled loop nests) strategy in polyhedral compilation to better model the interplay between loop tiling and fusion. The basteln strategy first groups loop nests by preserving their parallelism/tilability and next performs rectangular/parallelogram tiling to the output groups that produce data consumed outside the considered program fragment. The memory footprints required by each tile are then computed, from which the upward exposed data are extracted to determine the tile shapes of the remaining fusion groups. Such a tiling mechanism can construct complex tile shapes imposed by the dependences between these groups, which are further merged by a post-tiling fusion algorithm for enhancing data locality without losing the parallelism/tilability of the output groups. The basteln strategy also takes into account the amount of redundant computations and the fusion of independent groups, exhibiting a general applicability. We integrate the basteln strategy into two optimizing compilers, with one a general-purpose optimizer and the other a domain-specific compiler for deploying deep learning models. The experiments are conducted on CPU, GPU, and a deep learning accelerator to demonstrate the effectiveness of the approach for a wide class of application domains, including deep learning, image processing, sparse matrix computation, and linear algebra. In particular, the basteln strategy achieves a mean speedup of 1.8x over cuBLAS/cuDNN and 1.1x over TVM on GPU when used to optimize deep learning models

关键词： Tiling fusion data locality parallelism redundant computation memory hierarchy polyhedral model optimizing compilers

来源：评论

学校读者我要写书评

暂无评论

Guided Equality Saturation

引用

PROCEEDINGS OF THE ACM ON PROGRAMMING LANGUAGES-PACMPL 2024年第POPL期8卷 1727-1758页

作者： Koehler, Thomas Goens, Andres Bhat, Siddharth Grosser, Tobias Trinder, Phil Steuwer, Michel INRIA Strasbourg France Univ Strasbourg CNRS ICube Lab Strasbourg France Univ Amsterdam Amsterdam Netherlands Univ Edinburgh Edinburgh Midlothian Scotland Univ Cambridge Cambridge England Univ Glasgow Glasgow Lanark Scotland Tech Univ Berlin Berlin Germany

Rewriting is a principled term transformation technique with uses across theorem proving and compilation. In theorem proving, each rewrite is a proof step;in compilation, rewrites optimize a program term. While developing rewrite sequences manually is possible, this process does not scale to larger rewrite sequences. Automated rewriting techniques, like greedy simplification or equality saturation, work well without requiring human input. Yet, they do not scale to large search spaces, limiting the complexity of tasks where automated rewriting is effective, and meaning that just a small increase in term size or rewrite length may result in failure. This paper proposes a semi-automatic rewriting technique as a means to scale rewriting by allowing human insight at key decision points. Specifically, we propose guided equality saturation that embraces human guidance when fully automated equality saturation does not scale. The rewriting is split into two simpler automatic equality saturation steps: from the original term to a human-provided intermediate guide, and from the guide to the target. Complex rewriting tasks may require multiple guides, resulting in a sequence of equality saturation steps. A guide can be a complete term, or a sketch containing undefined elements that are instantiated by the equality saturation search. Such sketches may be far more concise than complete terms. We demonstrate the generality and effectiveness of guided equality saturation using two case studies. First, we integrate guided equality saturation in the Lean 4 proof assistant. Proofs are written in the style of textbook proof sketches, as a series of calculations omitting details and skipping steps. These proofs conclude in less than a second instead of minutes when compared to unguided equality saturation, and can find complex proofs that previously had to be done manually. Second, in the compiler of the RISE array language, where unguided equality saturation fails to perform optimizati

关键词： e-graphs equality saturation theorem provers optimizing compilers

来源：评论

学校读者我要写书评

暂无评论

Architecture-Aware Currying 32

Architecture-Aware Currying

引用

32nd International Conference on Parallel Architectures and Compilation Techniques (PACT)

作者： Kandemir, Mahmut Taylan Akbulut, Gulsum Gudukbay Choi, Wonil Karakoy, Mustafa Penn State Univ University Pk PA 16802 USA Hanyang Univ Seoul South Korea TUBITAK BILGEM Gebze Turkiye

ISBN: (纸本)9798350342543

In near-data computing (NDC), computation is brought into data, as opposed to bringing data to computation. While there is prior work focusing on different NDC opportunities, there is no study, to our knowledge, that investigates the importance of "neighborhood" in NDC. This paper explores the neighborhood concept in multithreaded programs that run on on-chip network-based manycore systems. We define the concept of "neighborhood", in terms of on-chip network links, and use it to formulate the NDC problem. We propose a "generic" compiler algorithm, called "architecture-aware currying", that uses the neighborhood concept to implement NDC. So, a core can perform some portions of computation with the nearby data and postpone the remainder of the computation until the remaining data become nearby. It can also perform computations - with nearby data - on behalf of other cores. Our experimental evaluation shows that the proposed compiler algorithm outperforms state-of-the-art data locality optimization strategies.

关键词： currying optimizing compilers data locality distance-to-data manycore systems

来源：评论

学校读者我要写书评

暂无评论

Automatic code optimization for computing the McCaskill partition functions 17

Automatic code optimization for computing the McCaskill part...

引用

17th Conference on Computer Science and Intelligence Systems (FedCSIS)

作者： Bielecki, Wlodzimierz Palkowski, Marek Poliwoda, Maciej West Pomeranian Univ Technol Szczecin Ul Zolnierska 49 PL-71210 Szczecin Poland

ISBN: (数字)9788396242396

ISBN: (纸本)9788396242396

In this paper, we present the application of three automatic source-to-source compilers to code implementing McCaskill's bioinformatics algorithm. It computes probabilities of various substructures for RNA prediction. McCaskill's algorithm is compute and data intensive and it is within dynamic programming. A corresponding programming code exposes non-uniform dependences that complicate tiling of that code. The corresponding code is represented within the polyhedral model. Its optimization is still a challenging task for optimizing compilers employing multi-threaded loop tiling. To generate optimized code, we used the popular PLuTo compiler that finds and applies affine transformations, the TRACO compiler based on calculating the transitive closure of loop dependence graphs, and the newest polyhedral tool DAPT implementing space-time tiling. An experimental study fulfilled on two multi-core machines: an AMD Epyc with 64 threads and a 2x Intel Xeon Platinum 9242 with 192 threads demonstrates considerable speedup, high locality, and scalability for various problem sizes and the number of threads of generated codes by means of space-time tiling.

关键词： Codes optimizing compilers Instruction sets RNA Heuristic algorithms Platinum Prediction algorithms

来源：评论

学校读者我要写书评

暂无评论

Compiler Support for Structured Data 23

Compiler Support for Structured Data

引用

31st ACM/SIGDA International Symposium on Field Programmable Gate Arrays (FPGA)

作者： Amarasinghe, Saman MIT Dept Elect Engn & Comp Sci EECS Cambridge MA 02139 USA

ISBN: (纸本)9781450394178

In 1957, the FORTRAN language and compiler introduced multidimensional dense arrays or dense tensors. Subsequent programming languages added a myriad of data structures from lists, sets, hash tables, trees, to graphs. Still, when dealing with extremely large data sets, dense tensors are the only simple and practical solution. However, modern data is anything but dense. Real world data, generated by sensors, produced by computation, or created by humans, often contain underlying structure, such as sparsity, runs of repeated values, or *** this talk I will describe how programming languages and compilers can support large data sets with structure. I will introduce TACO, a compiler for sparse data computing. TACO is the first system to automatically generate kernels for any tensor algebra operation on tensors in any of the commonly used formats. It pioneered a new technique for compiling compound tensor expressions into efficient loops in a systematic way. TACO generated code has competitive performance to best-in-class hand-written codes for tensor and matrix operations. With TACO, I will show how to put sparse array programming on the same compiler transformation and code generation footing as dense array codes. Structured data has immense potential for hardware acceleration. However, instead of one-off single-operation compute engines, with compilers frameworks such as TACO, I believe that it is possible to create hardware for an entire class of sparse computations. With the help of the FPGA community, I am looking forward to such a future.

关键词： Domain Specific Languages Sparse Computing optimizing compilers Lossless Compression

来源：评论

学校读者我要写书评

暂无评论

Optimized Code Generation for Deep Neural Networks 33rd

Optimized Code Generation for Deep Neural Networks

引用

33rd International Workshop on Languages and compilers for Parallel Computing (LCPC)

作者： Lake, Janaan Patabandi, Tharindu R. Hall, Mary Univ Utah Sch Comp Salt Lake City UT 84112 USA

ISBN: (纸本)9783030959531;9783030959524

As Deep Neural Networks (DNNs) become more widely used in a variety of applications, the need for performance and portability on many different architectures, including CPUs, becomes increasingly important. Compiler-based methods offer opportunities for performance gains over statically-tuned libraries by exploiting data reuse and parallelism, efficient memory access, and vectorization for specific backends with the use of abstraction. The Batch Normalization (BN) operator can accelerate the training and increase the robustness of DNNs, making it a widely-used operator in many DNNs. LATTE is a domain-specific language for DNNs, and SWIRL is a compiler that can be used with LATTE. We extend the applicability of LATTE/SWIRL by incorporating the BN operator into the LATTE framework and by expanding the optimizations of SWIRL to this operator. The optimized BN operator in LATTE/SWIRL is compared to existing frameworks such as TensorFlow, TensorFlow with Intel MKL-DNN, TensorFlow with XLA, PyTorch with MKL-DNN and MXNet with MKL-DNN. The results show that a compiler-based approach for the BN operator can increase performance on CPU architectures.

关键词： optimizing compilers Batch normalization Deep neural networks Code generation

来源：评论

学校读者我要写书评

暂无评论

FreeTensor: A Free-Form DSL with Holistic Optimizations for Irregular Tensor Programs 2022

FreeTensor: A Free-Form DSL with Holistic Optimizations for ...

引用

43rd ACM SIGPLAN International Conference on Programming Language Design and Implementation (PLDI)

作者： Tang, Shizhi Zhai, Jidong Wang, Haojie Jiang, Lin Zheng, Liyan Yuan, Zhenhao Zhang, Chen Tsinghua Univ Beijing Peoples R China

ISBN: (纸本)9781450392655

Tensor programs are of critical use in many domains. Existing frameworks, such as PyTorch, TensorFlow, and JAX, adopt operator-based programming to ease programming, increase performance, and perform automatic differentiation. However, as the rapid development of tensor programs, operator-based programming shows significant limitations for irregular patterns since a large amount of redundant computation or memory access is introduced. In this work, we propose FreeTensor, a free-form domain specific language which supports redundancy-avoid programming by introducing fine-grained control flow. With optimizations including partial evaluation, dependence-aware transformations, and fine-grained automatic differentiation, FreeTensor is able to generate high performance tensor programs on both CPU and GPU. Experiments show a speedup over existing tensor programming frameworks up to 5.10x (2.08x on average) without differentiation, and up to 127.74x (36.26x on average) after differentiation, for typical irregular tensor programs.

关键词： tensor computing optimizing compilers DSL

来源：评论

学校读者我要写书评

暂无评论

Lambda the Ultimate SSA: optimizing Functional Programs in SSA 22

Lambda the Ultimate SSA: Optimizing Functional Programs in S...

引用

20th IEEE/ACM International Symposium on Code Generation and Optimization (CGO)

作者： Bhat, Siddharth Grosser, Tobias IIIT Hyderabad CSTAR Hyderabad India Univ Edinburgh Sch Informat Edinburgh Midlothian Scotland

ISBN: (纸本)9781665405843

Static Single Assignment (SSA) is the workhorse of modern optimizing compilers for imperative programming languages. However, functional languages have been slow to adopt SSA and prefer to use intermediate representations based on minimal lambda calculi due to SSA's inability to express higher-order constructs. We exploit a new SSA construct - regions - in order to express functional optimizations via classical SSA-based reasoning. Region optimization currently relies on ad-hoc analyses and transformations on imperative programs. These ad-hoc transformations are sufficient for imperative languages as regions are used in a limited fashion. In contrast, we use regions pervasively to model sub-expressions in our functional IR. This motivates us to systematize region optimizations. We extend classical SSA reasoning to regions for functional-style analyses and transformations. We implement a new SSA+regions based backend for LEAN4, a theorem prover that implements a purely functional, dependently typed programming language. Our backend is feature-complete and handles all constructs of LEAN4's functional intermediate representation lambda rc within the SSA framework. We evaluate our proposed region optimizations by optimizing lambda rc within an SSA+regions based framework implemented in MLIR and demonstrating performance parity with the current LEAN4 backend. We believe our work will pave the way for a unified optimization framework capable of representing, analyzing, and optimizing both functional and imperative languages.

关键词： optimizing compilers Functional programming

来源：评论

学校读者我要写书评

暂无评论

Quantum Circuit Optimization and Transpilation via Parameterized Circuit Instantiation 3

Quantum Circuit Optimization and Transpilation via Parameter...

引用

3rd IEEE International Conference on Quantum Computing and Engineering (QCE)

作者： Younis, Ed Iancu, Costin Lawrence Berkeley Natl Lab Berkeley CA 94720 USA

ISBN: (数字)9781665491136

ISBN: (纸本)9781665491136

Parameterized circuit instantiation is a common technique encountered in the generation of circuits for a large class of hybrid quantum-classical algorithms. Despite being supported by popular quantum compilation infrastructures such as IBM Qiskit and Google Cirq, instantiation has not been extensively considered in the context of circuit compilation and optimization pipelines. In this work, we describe algorithms to apply instantiation during two common compilation steps: circuit optimization and gate-set transpilation. When placed in a compilation workflow, our circuit optimization algorithm produces circuits with an average of 13% fewer gates than other optimizing compilers. Our gate-set transpilation algorithm can target any gate-set, even sets with multiple two-qubit gates, and produces circuits with an average of 12% fewer two-qubit gates than other compilers. Overall, we show how instantiation can be incorporated into a compiler workflow to improve circuit quality and enhance portability, all while maintaining a reasonably low compile time overhead.

关键词： Circuit optimization Runtime optimizing compilers Pipelines Logic gates Hybrid power systems Hardware

来源：评论

学校读者我要写书评

暂无评论

One-Shot Tuner for Deep Learning compilers 2022

One-Shot Tuner for Deep Learning Compilers

引用

31st ACM SIGPLAN International Conference on Compiler Construction (CC)

作者： Ryu, Jaehun Park, Eunhyeok Sung, Hyojin POSTECH Dept Comp Sci & Engn Pohang South Korea

ISBN: (纸本)9781450391832

Auto-tuning DL compilers are gaining ground as an optimizing back-end for DL frameworks. While existing work can generate deep learning models that exceed the performance of hand-tuned libraries, they still suffer from prohibitively long auto-tuning time due to repeated hardware measurements in large search spaces. In this paper, we take a neural-predictor inspired approach to reduce the auto-tuning overhead and show that a performance predictor model trained prior to compilation can produce optimized tensor operation codes without repeated search and hardware measurements. To generate a sample-efficient training dataset, we extend input representation to include task-specific information and to guide data sampling methods to focus on learning high-performing codes. We evaluated the resulting predictor model, One-Shot Tuner, against AutoTVM and other prior work, and the results show that One-Shot Tuner speeds up compilation by 2.81x to 67.7x compared to prior work while providing comparable or improved inference time for CNN and Transformer models.

关键词： optimizing compilers autotuning performance models deep neural networks

来源：评论

学校读者我要写书评

暂无评论

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

时间限定

文献类型

馆藏选择

核心期刊

语言

文献类型

帮助

文字说明：

检索规则说明：

检索范例：

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案：

请选择收藏分类：

通借通还

建议与咨询 留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

时间限定

文献类型

馆藏选择

核心期刊

语言

文献类型

帮助

文字说明：

检索规则说明：

检索范例：

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案： 新增检索档案 确定 取消

请选择收藏分类： 新增自定义分类 确定 取消

通借通还

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

请选择保存的检索档案：

请选择收藏分类：