检索结果-内蒙古大学图书馆

24th Euromicro Conference on Digital System Design (DSD)

作者： Haleplidis, Evangelos Tsakoulis, Thanasis El-Kady, Alexander Dimopoulos, Charis Koufopavlou, Odysseas Fournaris, Apostolos P. RC ATHENA Ind Syst Inst Patras Sci Pk Platani Patras 26504 Greece Univ Piraeus Dept Digital Syst Piraeus Greece Univ Patras Elect & Comp Engn Dept Rion Campus Patras Greece

ISBN: (纸本)9781665427036

Lattice based cryptography can be considered a candidate alternative for post-quantum cryptosystems offering key exchange, digital signature and encryption functionality. Number Theoretic Transform (NTT) can be utilized to achieve better performance for these functionalities, where polynomials are needed to be multiplied. NTT simplifies the multiplication overhead allowing point-wise multiplication by transforming the polynomials into the spectral domain and then inversing the result to the original domain. It is important to optimize this technique that is used in a wide range of computing systems. In this paper we study the feasibility of using OpenCL, a portable framework, to implement a parallelized version of NTT which allows deployment on heterogeneous platforms, such as Graphic Processing Units (GPUs) and Field Programmable Gate Arrays (FPGAs). We measure the performance of our implementation on a GPU and evaluate when and where such a deployment is beneficial. Our results showed that the proposed parallel implementation is a viable acceleration approach for these algorithms for lattice-based cryptography solutions.

关键词： NTT Inverse NTT Cryptography OpenCL parallel programming

来源：评论

学校读者我要写书评

暂无评论

Optimizing Mpi Collectives with Hierarchical Design for Efficient Cpu Oversubscription

SSRN

引用

SSRN 2023年

作者： Utrera, Gladys Bull, J. Mark Computer Architecture Department Universitat Politècnica de Catalunya BarcelonaTECH Barcelona08034 Spain EPCC University of Edinburgh EdinburghEH8 9BT United Kingdom

Node sizes in multicore clusters are becoming larger, so applications should exploit the shared memory inside a node, to potentially reduce communication latencies compared to network communications. The Message Passing Interface library (MPI) serves as the de facto standard for parallel applications on distributed memory environments. The appearance of the MPI-3 shared memory extension has made the idea of optimizing the MPI collective operations at the intra-node level both attractive and portable. Taking advantage of this facility, we present a hierarchical design of the algorithms for MPI_Allreduce and MPI_Bcast collective operations, which we name Fullpar. The proposal is based on partitioning the messages and exploiting concurrency between network communication and shared-memory operations at intra-node level in the message dissemination *** proposed hierarchical collective algorithms do not exploit all the available parallelism and/or the direct implementation of shared-memory at intra-node level. Furthermore, we explore the application of oversubscription, to enhance resource utilization. Oversubscribing CPUs offers opportunities to optimize resource allocation, as well as making CPUs available to other applications. We implement our proposal on top of the Intel MPI and OpenMPI libraries using the MPI profiling (PMPI) mechanism, and carry out evaluations on platforms with different architectures characteristics such as the node size and interconnection network. Experimental results show that we obtain significant benefits in performance improvement for medium to large message sizes over the native algorithms of the libraries, and over other hierarchical designs from the literature for both collectives. Furthermore, the introduction of oversubscription is shown to have almost no overhead, especially for the broadcast of large messages. © 2023, The Authors. All rights reserved.

关键词： parallel programming

来源：评论

学校读者我要写书评

暂无评论

Improving computation efficiency using input and architecture features for a virtual screening application

arXiv

引用

arXiv 2023年

作者： Gianmarco, Accordi Emanuele, Vitali Davide, Gadioli Luigi, Crisci Biagio, Cosenza Mauro, Bisson Fatica, Massimiliano Andrea, Beccari Palermo, Gianluca Dipartimento di Elettronica Informazione e Bionigegneria Politecnico di Milano Milano Italy Csc It Center for Science Espoo Finland Dipartimento di Informatica Università Degli Studi di Salerno Salerno Italy Nvidia Corporation Santa ClaraCA United States Dompé Farmaceutici SpA Napoli Italy

Virtual screening is an early stage of the drug discovery process that selects the most promising candidates. In the urgent computing scenario it is critical to find a solution in a short time frame. In this paper, we focus on a real-world virtual screening application to evaluate out-of-kernel optimizations, that consider input and architecture features to improve the computation efficiency on GPU. Experiment results on a modern supercomputer node show that we can almost double the performance. Moreover, we implemented the optimization using SYCL and it provides a consistent benefit with the CUDA optimization. A virtual screening campaign can use this gain in performance to increase the number of evaluated candidates, improving the probability of finding a drug. © 2023, CC BY.

关键词： parallel programming

来源：评论

学校读者我要写书评

暂无评论

Heuristics for Program Code Optimization in Heterogeneous Systems 31

Heuristics for Program Code Optimization in Heterogeneous Sy...

引用

31st International Conference on Radioelektronika (RADIOELEKTRONIKA) Part of MAREW Conference

作者： Voloshko, Anna Ivutin, Alexey Novikov, Alexander S. Tula State Univ Dept Comp Technol Tula Russia

ISBN: (纸本)9781665414746

This paper discusses the optimization problem for parallelization the program code in heterogeneous system. The optimization problem and constraints are defined. Authors present the main approached to find the best solution. The special aspects of optimization problem in heterogeneous systems arc discussed and the heuristics according to the aspects are proposed.

关键词： parallel programming optimization heterogeneous system Petri nets semantic relations heuristic

来源：评论

学校读者我要写书评

暂无评论

Analyzing Reduction Abstraction Capabilities

Analyzing Reduction Abstraction Capabilities

引用

International Workshop on Performance, Portability and Productivity in HPC (P3HPC)

作者： Deakin, Tom McIntosh-Smith, Simon Pennycook, S. John Sewall, Jason Univ Bristol Dept Comp Sci Bristol Avon England Intel Corp Santa Clara CA USA

ISBN: (纸本)9781665424394

Reductions are a common pattern in parallel programming, and every parallel programming language or framework has its own reduction abstraction with its own idiosyncrasies. These abstractions differ not only in their syntax, but also in their semantics and their ability to express certain types of reduction. Such differences may prevent specific combinations of abstraction and hardware platform from reaching high levels of performance, with consequences for portability and programmer productivity. In this paper, we present a set of representative reduction benchmarks to explore the capabilities of five contemporary programming languages and frameworks - OpenMP, Kokkos, RAJA, SYCL, and the oneAPI DPC++ Library (oneDPL) - across a variety of hardware platforms, including CPUs and GPUs from multiple vendors. We discuss the advantages and disadvantages of each reduction abstraction, and conclude with recommendations to improve their design and implementation.

关键词： Productivity parallel programming Semantics Benchmark testing Writing Syntactics Hardware

来源：评论

学校读者我要写书评

暂无评论

A Dense Tensor Accelerator with Data Exchange Mesh for DNN and Vision Workloads 53

A Dense Tensor Accelerator with Data Exchange Mesh for DNN a...

引用

IEEE International Symposium on Circuits and Systems (IEEE ISCAS)

作者： Lin, Yu-Sheng Chen, Wei-Chao Yang, Chia-Lin Chien, Shao-Yi Inventec Corp Taipei Taiwan Natl Taiwan Univ Taipei Taiwan

ISBN: (纸本)9781728192017

We propose a dense tensor accelerator called VectorMesh, a scalable, memory-efficient architecture that can support a wide variety of DNN and computer vision workloads. Its building block is a tile execution unit (TEU), which includes dozens of processing elements (PEs) and SRAM buffers connected through a butterfly network. A mesh of FIFOs between the TEUs facilitates data exchange between tiles and promote local data to global visibility. Our design performs better according to the roofline model for CNN, GEMM, and spatial matching algorithms compared to state-of-the-art architectures. It can reduce global buffer and DRAM fetches by 2-22 times and up to 5 times, respectively.

关键词： Neural network hardware vector processors parallel programming

来源：评论

学校读者我要写书评

暂无评论

Concurrent and Distributed Pseudocode: A Systematic Literature Review 47

Concurrent and Distributed Pseudocode: A Systematic Literatu...

引用

47th Latin American Computing Conference (CLEI)

作者： Ulate-Caballero, Bryan Alexander Berrocal-Rojas, Allan Hidalgo-Cespedes, Jeisson Univ Costa Rica Comp & Informat San Jose Costa Rica Univ Costa Rica ECCI San Jose Costa Rica Univ Costa Rica ECCI CITIC San Jose Costa Rica

ISBN: (纸本)9781665495035

Pseudocode is a valuable resource used in programming education, software development, and scientific reports for designing algorithmic solutions as it is easy to write, understand, and modify. Since pseudocode is lacking in its ability to be tested, it is difficult to determine whether a pseudocode solution is correct or not. Software tools are specially required to reach this goal, e.g., helping professors find race conditions, deadlocks, or starvation issues while grading students' concurrent pseudocode. Although there are various tools to work with sequential pseudocode, there is a lack of tools to work with concurrent pseudocode. This shortage motivated us to determine the state-of-the-art in notations and tools for testing concurrent and distributed pseudocode. We conducted a systematic literature review and found only a few related publications, confirming that this topic is understudied. We found and report about five software tools capable of interpreting concurrent or distributed pseudocode, and two software tools capable of verifying its correctness. As another result, no other literature review was found about this topic, conferring novelty to the contributions of this work.

关键词： Concurrent pseudocode distributed pseudocode parallel pseudocode code correctness concurrent programming distributed programming parallel programming pseudocode pseudo-code pseudo-language education teaching

来源：评论

学校读者我要写书评

暂无评论

The parallel optimization based on the PVS algorithm and research on the evaluation function in the Game of the Amazons 33

The parallel optimization based on the PVS algorithm and res...

引用

33rd Chinese Control and Decision Conference (CCDC)

作者： Wang, Haoyu Qiu, Hongkun Shenyang Aerosp Univ Sch Comp Sci Shenyang 110136 Peoples R China Shenyang Aerosp Univ Engn Training Ctr Shenyang 110136 Peoples R China

ISBN: (纸本)9781665440899

The PVS search function, as a current mainstream and efficient algorithm, has been widely used in various kinds of chess program. We applied the parallel search function based on the PVS and improved the running speed of the program. At the same time, we also did some research and experiments on the evaluation function of Amazon chess which provided a set of available Amazon evaluation functions and parameter adjustment results for reference.

关键词： Computer Game Amazon Game PVS parallel programming Evaluation Function

来源：评论

学校读者我要写书评

暂无评论

Accelerating Messages by Avoiding Copies in an Asynchronous Task-based programming Model 6

Accelerating Messages by Avoiding Copies in an Asynchronous ...

引用

IEEE/ACM 6th International Workshop on Extreme Scale programming Models and Middleware (ESPM2)

作者： Bhat, Nitin White, Sam Ramos, Evan Kale, Laxmikant, V Charmworks Inc Urbana IL 61801 USA Univ Illinois Dept Comp Sci Urbana IL USA

ISBN: (纸本)9781665411400

Task-based programming models promise improved communication performance for irregular, fine-grained, and load imbalanced applications. They do so by relaxing some of the messaging semantics of stricter models and taking advantage of those at the lower-levels of the software stack. For example, while MPI's two-sided communication model guarantees in-order delivery, requires matching sends to receives, and has the user schedule communication, task-based models generally favor the runtime system scheduling all execution based on the dependencies and message deliveries as they happen. The messaging semantics are critical to enabling high performance. In this paper, we build on previous work that added zero copy semantics to Converse/LRTS. We examine the messaging semantics of Charm++ as it relates to large message buffers, identify shortcomings, and define new communication APIs to address them. Our work enables in-place communication semantics in the context of point-to-point messaging, broadcasts, transmission of read-only variables at program startup, and for migration of chares. We showcase the performance of our new communication APIs using benchmarks for Charm++ and Adaptive MPI, which result in nearly 90% latency improvement and 2x lower peak memory usage.

关键词： Charm plus AMPI RDMA parallel programming Asynchronous Tasking Communication Optimizations

来源：评论

学校读者我要写书评

暂无评论

Real-Time Scheduling Models in Diverse Multi-core OpenMP Applications

Real-Time Scheduling Models in Diverse Multi-core OpenMP App...

引用

International Conference on Decision Aid Sciences and Application (DASA)

作者： Waheed, Musfirah Nadeem Siddique, Mohammed Majan Univ Coll Fac Informat Technol Muscat Oman

ISBN: (纸本)9781665416344

This research presents some of the critical information required to understand the concept of parallel programming and the implementation of OpenMP in parallel programming. parallelism is the preferred tool for expediting an algorithm, as demonstrated by the evolution of computing architectures (multi-core and many-core) towards a greater number of processing cores. The report will focus on OpenMP parallel programming models and further examine its implementation and features. parallel programming OpenMP model is increasingly preferred for its ability to deliver real-time processing, thereby, meeting system requirements performance wise. Furthermore, the study of implementing OpenMP in enhancing the efficiency of 3D discontinuous deformation analysis (3D-DDA) for expansive simulation using parallel block Jacobi (BJ) and Pre-conditioned conjugate gradient (PCG) algorithms. The absence of synchronization of data in parallel programming makes the system more prone to errors in programming since the parallel environment is much more complicated than perceived. The studies performed will highlight how synchronization is managed using OpenMP model. In the field of biometrics, the most important issue faced in DNA sequencing and pattern discovery is locating the longest common subsequence (LCS) among sequences. To identify the LCS of DNA sequences, we will look into the solutions achieved using OpenMP mols based on CPU, that extend major improvements hi processing speed, capital, and ubiquity, and the results based on the analysis are discussed.

关键词： parallel programming openmp synchronization sporadic dimensional discontinuous deformation thread

来源：评论

学校读者我要写书评

暂无评论

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

时间限定

文献类型

馆藏选择

核心期刊

语言

文献类型

帮助

文字说明：

检索规则说明：

检索范例：

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案：

请选择收藏分类：

通借通还

建议与咨询 留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

时间限定

文献类型

馆藏选择

核心期刊

语言

文献类型

帮助

文字说明：

检索规则说明：

检索范例：

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案： 新增检索档案 确定 取消

请选择收藏分类： 新增自定义分类 确定 取消

通借通还

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

请选择保存的检索档案：

请选择收藏分类：