检索结果-内蒙古大学图书馆

APINT: A Full-Stack Framework for Acceleration of Privacy-Preserving Inference of Transformers based on Garbled Circuits

学校读者我要写书评

暂无评论

arXiv 2025年

作者： Cho, Hyunjun Jeon, Jaeho Heo, Jaehoon Kim, Joo-Young KAIST Daejeon Korea Republic of

As the importance of Privacy-Preserving Inference of Transformers (PiT) increases, a hybrid protocol that integrates Garbled Circuits (GC) and Homomorphic Encryption (HE) is emerging for its implementation. While this protocol is preferred for its ability to maintain accuracy, it has a severe drawback of excessive latency. To address this, existing protocols primarily focused on reducing HE latency, thus making GC the new latency bottleneck. Furthermore, previous studies only focused on individual computing layers, such as protocol or hardware accelerator, lacking a comprehensive solution at the system level. This paper presents APINT, a full-stack framework designed to reduce PiT’s overall latency by addressing the latency problem of GC through both software and hardware solutions. APINT features a novel protocol that reallocates possible GC workloads to alternative methods (i.e., HE or standard matrix operation), substantially decreasing the GC workload. It also suggests GC-friendly circuit generation that reduces the number of AND gates at the most, which is the expensive operator in GC. Furthermore, APINT proposes an innovative netlist scheduling that combines coarse-grained operation mapping and fine-grained scheduling for maximal data reuse and minimal dependency. Finally, APINT’s hardware accelerator, combined with its compiler speculation, effectively resolves the memory stall issue. Putting it all together, APINT achieves a remarkable end-to-end reduction in latency, outperforming the existing protocol on CPU platform by 12.2× online and 2.2× offline. Meanwhile, the APINT accelerator not only reduces its latency by 3.3× but also saves energy consumption by 4.6× while operating PiT compared to the state-of-the-art GC accelerator. © 2025, CC BY-NC-ND.

关键词： program compilers

Enhancing The Open Network: Definition and Automated Detection of Smart Contract Defects

学校读者我要写书评

暂无评论

arXiv 2025年

作者： Song, Hao Li, Teng Chen, Jiachi Chen, Ting Li, Beibei Lin, Zhangyan Lu, Yi Li, Pan Zhou, Xihan Sichuan University Chengdu China University of Electronic Science and Technology of China Chengdu China Sun Yat-Sen University Guangzhou China BitsLab Singapore TonBit China

The Open Network (TON), designed to support Telegram’s extensive user base of hundreds of millions, has garnered considerable attention since its launch in 2022. FunC is the most popular programming language for writing smart contracts on TON. It is distinguished by a unique syntax compared to other smart contract languages. Despite growing interest, research on the practical defects of TON smart contracts is still in its early stages. In this paper, we summarize eight smart contract defects identified from TON’s official blogs and audit reports, each with detailed definitions and code examples. Furthermore, we propose a static analysis framework called TONScanner to facilitate the detection of these defects. Specifically, TONScanner reuses FunC compiler’s frontend code to transform the FunC source code into FunC intermediate representation (IR) in the form of a directed acyclic graph (DAG). Based on this IR, TONScanner constructs a control flow graph (CFG), then transforms it into a static single assignment (SSA) form to simplify further analysis. TONScanner also integrates Data Dependency, Call Graph, Taint Analysis, and Cell Construct, which are specifically tailored for TON blockchain’s unique data structures. These components finally facilitate the identification of the eight defects. We evaluate the effectiveness of TONScanner by applying it to 1,640 smart contracts and find a total of 14,995 defects. Through random sampling and manual labeling, we find that TONScanner achieves an overall precision of 97.49%. The results reveal that current TON contracts contain numerous defects, indicating that developers are prone to making errors. TONScanner has proven its ability to accurately identify these defects, thereby aiding in their correction. Copyright © 2025, The Authors. All rights reserved.

关键词： program compilers

MappedTrace: Tracing Pointer Remotely with Compiler-generated Maps

学校读者我要写书评

暂无评论

arXiv 2025年

作者： Ma, Zhiyao Li, Caihua Zhong, Lin Department of Computer Science Yale University United States

Existing precise pointer tracing methods introduce substantial runtime overhead to the program being traced and are applicable only at specific program execution points. We propose MappedTrace that leverages compiler-generated read-only maps to accurately identify all pointers in any given snapshot of a program’s execution state. The maps record the locations and types of pointers, allowing the tracer to precisely identify pointers without requiring the traced program to maintain bookkeeping data structures or poll at safe points, thereby reducing runtime overhead. By running the tracer from a different address space or machine, MappedTrace presents new opportunities to improve memory management techniques like memory leak detection and enables novel use cases such as infinite memory abstraction for resource-constrained environments. © 2025, CC BY-NC-ND.

关键词： program compilers

Equality Saturation for Optimizing High-Level Julia IR∗

学校读者我要写书评

暂无评论

arXiv 2025年

作者： Merckx, Jules Besard, Tim de Sutter, Bjorn Computing Systems Lab Ghent University Ghent Belgium Cambridge MA United States

compilers are indispensable for transforming code written in high-level languages into performant machine code, but their general-purpose optimizations sometimes fall short. Domain experts might be aware of certain optimizations that the compiler is unable to apply or that are only valid in a particular domain. We have developed a system that allows domain experts to express rewrite rules to optimize code in the Julia programming language. Our system builds on e-graphs and equality saturation. It can apply optimizations in the presence of control flow and side effects. As Julia uses multiple dispatch, we allow users to constrain rewrite rules by argument types, and propagate type information through the e-graph representation. We propose an ILP formulation for optimal e-graph extraction taking into account dominance properties for code reuse and introduce CFG skeleton relaxation to rewrite calls to pure functions as well as those with side effects. Use cases demonstrate that our system can perform rewrites on high-level, domain-specific code, as well as on lower-level code such as Julia’s broadcasting mechanism. Finally, we analyze the required compilation time. © 2025, CC BY.

关键词： program compilers

A Multi-level Compiler Backend for Accelerated Micro-kernels Targeting RISC-V ISA Extensions

学校读者我要写书评

暂无评论

arXiv 2025年

作者： Lopoukhine, Alexandre Ficarelli, Federico Vasiladiotis, Christos Lydike, Anton Van Delm, Josse Dutilleul, Alban Benini, Luca Verhelst, Marian Grosser, Tobias University of Cambridge Cambridge United Kingdom University of Bologna Bologna Italy University of Edinburgh Edinburgh United Kingdom KU Leuven Leuven Belgium ENS Rennes Rennes France ETH Zurich Zurich Switzerland Cineca Bologna Italy

High-performance micro-kernels must fully exploit today’s diverse and specialized hardware to deliver peak performance to deep neural networks (DNNs). While higher-level optimizations for DNNs are offered by numerous compilers (e.g., MLIR, TVM, OpenXLA), performance-critical micro-kernels are left to specialized code generators or handwritten assembly. Even though widely-adopted compilers (e.g., LLVM, GCC) offer tuned backends, their CPU-focused input abstraction, unstructured intermediate representation (IR), and general-purpose best-effort design inhibit tailored code generation for innovative hardware. We think it is time to widen the classical hourglass backend and embrace progressive lowering across a diverse set of structured abstractions to bring domain-specific code generation to compiler backends. We demonstrate this concept by implementing a custom backend for a RISC-V-based accelerator with hardware loops and streaming registers, leveraging knowledge about the hardware at levels of abstraction that match its custom instruction set architecture (ISA). We use incremental register allocation over structured IRs, while dropping classical spilling heuristics, and show up to 90% floating-point unit (FPU) utilization across key DNN kernels. By breaking the backend hourglass model, we reopen the path from domain-specific abstractions to specialized hardware. © 2025, CC BY-NC-ND.

关键词： program compilers

TenSet: A Large-scale program Performance Dataset for Learned Tensor compilers 35

学校读者我要写书评

暂无评论

TenSet: A Large-scale Program Performance Dataset for Learne...

35th Conference on Neural Information Processing Systems - Track on Datasets and Benchmarks, NeurIPS Datasets and Benchmarks 2021

作者： Zheng, Lianmin Liu, Ruochen Shao, Junru Chen, Tianqi Gonzalez, Joseph E. Stoica, Ion Haj-Ali, Ameer UC Berkeley United States OctoML United States Carnegie Mellon University United States

Search-based tensor compilers can greatly accelerate the execution of machine learning models by generating high-performance tensor programs, such as matrix multiplications and convolutions. These compilers take a high-level mathematical expression as an input and search for the fastest low-level implementation. At the core of the search procedure is a cost model, which estimates the performance of different implementations to reduce the frequency of time-consuming on-device measurements. There has been a growing interest in using deep learning techniques to learn a cost model to ease the effort of building an analytical model. To realize the potential of such deep learning models, a standard dataset for pre-training and benchmarking learned cost models is necessary. However, this dataset is lacking. We introduce TenSet, a large-scale tensor program performance dataset. TenSet contains 52 million program performance records collected from 6 hardware platforms. We provide comprehensive studies on how to learn and evaluate the cost models, including data collection, model architectures, loss functions, transfer learning, and evaluation metrics. We also show that a cost model pre-trained on TenSet can accelerate the search time in the state-of-the-art tensor compiler by up to 10×. © 2021 Neural information processing systems foundation. All rights reserved.

关键词： program compilers

An Attempt to Catch Up with JIT compilers The False Lead of Optimizing Inline Caches

学校读者我要写书评

暂无评论

arXiv 2025年

作者： Poirier, Aurore Rohou, Erven Serrano, Manuel University of Rennes Inria CNRS IRISA France Inria University of Côte d’Azur France

Context Just-in-Time (JIT) compilers are able to specialize the code they generate according to a continuous profiling of the running programs. This gives them an advantage when compared to Ahead-of-Time (AoT) compilers that must choose the code to generate once for all. Inquiry Is it possible to improve the performance of AoT compilers by adding Dynamic Binary Modification (DBM) to the executions? Approach We added to the Hopc AoT JavaScript compiler a new optimization based on DBM to the inline cache (IC), a classical optimization dynamic languages use to implement object property accesses efficiently. Knowledge Reducing the number of memory accesses as the new optimization does, does not shorten execution times on contemporary architectures. Grounding The DBM optimization we have implemented is fully operational on x86_64 architectures. We have conducted several experiments to evaluate its impact on performance and to study the reasons of the lack of acceleration. Importance The (negative) result we present in this paper sheds new light on the best strategy to be used to implement dynamic languages. It tells that the old days where removing instructions or removing memory reads always yielded to speed up is over. Nowadays, implementing sophisticated compiler optimizations is only worth the effort if the processor is not able by itself to accelerate the code. This result applies to AoT compilers as well as JIT compilers. Copyright © 2025, The Authors. All rights reserved.

关键词： program compilers

COMPASS: A Compiler Framework for Resource-Constrained Crossbar-Array Based In-Memory Deep Learning Accelerators

学校读者我要写书评

暂无评论

arXiv 2025年

作者： Park, Jihoon Choe, Jeongin Kim, Dohyun Kim, Jae-Joon Seoul National University Seoul Korea Republic of

Recently, crossbar array based in-memory accelerators have been gaining interest due to their high throughput and energy efficiency. While software and compiler support for the in-memory accelerators has also been introduced, they are currently limited to the case where all weights are assumed to be on-chip. This limitation becomes apparent with the significantly increasing network sizes compared to the in-memory footprint. Weight replacement schemes are essential to address this issue. We propose COMPASS, a compiler framework for resource-constrained crossbar-based processing-in-memory (PIM) deep neural network (DNN) accelerators. COMPASS is specially targeted for networks that exceed the capacity of PIM crossbar arrays, necessitating access to external memories. We propose an algorithm to determine the optimal partitioning that divides the layers so that each partition can be accelerated on chip. Our scheme takes into account the data dependence between layers, core utilization, and the number of write instructions to minimize latency, memory accesses, and improve energy efficiency. Simulation results demonstrate that COMPASS can accommodate much more networks using a minimal memory footprint, while improving throughput by 1.78X and providing 1.28X savings in energy-delay product (EDP) over baseline partitioning methods. © 2025, CC BY.

关键词： program compilers

Fully-Automated Code Generation for Efficient Computation of Sparse Matrix Permanents on GPUs

学校读者我要写书评

暂无评论

arXiv 2025年

作者： Elbek, Deniz Kaya, Kamer Department of Computer Science and Engineering Faculty of Engineering and Natural Sciences Sabanci University Istanbul Turkey

Registers are the fastest memory components within the GPU's complex memory hierarchy, accessed by names rather than addresses. They are managed entirely by the compiler through a process called register allocation, during which the compiler attempts to cache predictable data from thread-local memory into thread-private registers. Computing the permanent of a sparse matrix poses a challenge for compilers, as optimizing this process is hindered by the unpredictable distribution of nonzero elements, which only become known at runtime. In this work, we employ fully-automated code generation to address this, producing highly optimized kernels tailored to the matrix's sparsity pattern. State-of-the-art permanent computation algorithms require each thread to store a private array, denoted x, of size n. We first propose a technique that fully stores these arrays in registers, with inclusion and exclusion kernels generated for each column. To minimize control divergence and reduce the number of unique kernels within a warp, we exploit the internal structure of Gray codes, which are also used in the state-of-theart algorithm. Our second technique reduces register pressure by utilizing both registers and global memory and introduces a matrix ordering and partitioning strategy for greater efficiency. On synthetic matrices, this approach achieves a 31× speedup over state-of-the-art CPU implementations on 112 cores, and an 8× speedup compared to our traditional GPU implementation. For real-world matrices, these speedups are 24.9× and 4.9×. Copyright © 2025, The Authors. All rights reserved.

关键词： program compilers