检索结果-内蒙古大学图书馆

parallel Fast Multipole Method accelerated FFT on HPC clusters

parallel COMPUTING 2021年 104卷 102783-102783页

作者： Mehta, Chahak Karthi, Amarnath Jetly, Vishrut Chaudhury, Bhaskar Dhirubhai Ambani Inst Informat & Commun Technol Grp Computat Sci & HPC Gandhinagar 382007 India

With increasing sizes of distributed systems, there comes an increased risk of communication bottlenecks. In the past decade there has been a growing interest in communication-avoiding algorithms. The distributed memory Fast Fourier Transform is an important algorithm which suffers from major communication bottlenecks. In this work, we take a look at an existing communication-avoiding algorithm FMM-FFT, an alternative to FFT which utilizes the Fast Multipole Method (FMM) to reduce communications to a single all-to-all communication. We present a detailed implementation of FMM-FFT relying on modern libraries and demonstrate it on two distinct distributed memory architectures notably a traditional Intel Xeon based HPC cluster and then a Beowulf cluster. We show that while the FMM-FFT is significantly slower than FFT on the traditional HPC cluster, on the Beowulf cluster it outperforms standard FFT, consistently getting speedups of 1.5x or more against FFTW. We then proceed to show how the communication to computation cost metric is important and useful in explaining the performance results of FMM-FFT against standard FFT. The source code pertaining to this work is being made publicly available under a permissive open source licence at Github.

关键词： Fast Fourier Transform Fast Multipole Method Beowulf cluster Communication avoiding algorithms parallel programming High performance computing

来源：评论

学校读者我要写书评

暂无评论

Lambda calculus with algebraic simplification for reduction parallelisation: Extended study

引用

JOURNAL OF FUNCTIONAL programming 2021年第1期31卷 e7-e7页

作者： Morihata, Akimasa Univ Tokyo Meguro Ku 3-8-1 Komaba Tokyo Japan

parallel reduction is a major component of parallel programming and widely used for summarisation and aggregation. It is not well understood, however, what sorts of non-trivial summarisations can be implemented as parallel reductions. This paper develops a calculus named lambda(AS), a simply typed lambda calculus with algebraic simplification. This calculus provides a foundation for studying a parallelisation of complex reductions by equational reasoning. Its key feature is delta abstraction. A delta abstraction is observationally equivalent to the standard lambda abstraction, but its body is simplified before the arrival of its arguments using algebraic properties such as associativity and commutativity. In addition, the type system of lambda(AS) guarantees that simplifications due to delta abstractions do not lead to serious overheads. The usefulness of lambda(AS) is demonstrated on examples of developing complex parallel reductions, including those containing more than one reduction operator, loops with conditional jumps, prefix sum patterns and even tree manipulations.

关键词： parallel programming

来源：评论

学校读者我要写书评

暂无评论

A scalable parallel algorithm for building web directories

引用

CONCURRENCY AND COMPUTATION-PRACTICE & EXPERIENCE 2021年第9期33卷 e6121-e6121页

作者： Seshadri, Karthick Maruthappan, Aswin Sundar Raman, Mukunthapriya Natl Inst Technol Dept Comp Sci & Engn Tadepalligudem Andhra Pradesh India Verizon Media New York NY USA Univ Southern Calif Los Angeles CA 90007 USA

Web directories like Wikipedia and Open Directory Mozilla facilitate efficient information retrieval (IR) of web documents from a huge web corpus. Maintenance of these web directories is understandably a difficult task that requires manual curation by human editors or semi-automated mechanisms. Research on parallel algorithms for the automated curation of these web directories will be beneficial to the IR domain. Hence, in this article, we propose a parallel algorithm for automatically creating web directories from a corpus of web-documents. We have used centrality-based techniques to split the corpus into fine-grained clusters and subsequently an agglomeration based on locality sensitive hashing to identify coarse-grained clusters in the web-directory. Experimental results show that the algorithm generates meaningful hierarchies of the input corpus as measured by cluster-validity indices, like F-measure, rand index, and cluster purity. The algorithm achieves a significant speedup and scales well both with the number of processors and the size of the input corpus.

关键词： computational intelligence‐ concurrent computing high performance computing knowledge engineering‐ knowledge discovery learning systems‐ supervised learning parallel programming text mining web mining

来源：评论

学校读者我要写书评

暂无评论

Verification of concurrent code from synchronous specifications

引用

SCIENCE OF COMPUTER programming 2021年 206卷 102625-102625页

作者： Hu, Kai Zhang, Teng Ding, Yi Zhu, Jian Talpin, Jean-Pierre Beihang Univ State Key Lab Software Dev Environm Beijing Peoples R China Univ Penn Philadelphia PA 19104 USA Beijing Wuzi Univ Sch Informat Beijing Peoples R China INRIA Rennes Campus Beaulieu Rennes France

The synchronous language SIGNAL is a formal specification formalism for developing safety-critical real-time systems. It is a multi-clocked data-flow modeling language suitable for specifying deterministic concurrent behaviors. Its model of computation and communication very well matches recent trends to utilize multi-core processors for executing real-time systems, by taking advantage of its concurrent semantics. The SIGNAL compiler generates code from data-flow specifications while analyzing and verifying safety properties of the system under design: deadlock-freedom, determinism. However, most of recent works have focused on generating sequential code from SIGNAL. Choosing the parallel library OpenMP as the target, this paper proposes a methodology to generate and verify concurrent code automatically from SIGNAL specifications. This is done by first exploring clock relations among signals by application of a so-called clock calculus. Then, specifications are translated into EDGs (Equation-Dependency Graphs) to analyze global data-dependency relations. An EDG is then partitioned into concurrent tasks to help explore parallelism in the original specification while preserving its semantic. Combined with clock relations, parallel tasks are finally mapped onto the OpenMP structures. The proposed approach is illustrated by a realistic case study. (C) 2021 Elsevier B.V. All rights reserved.

关键词： Synchronous specifications SIGNAL parallel programming OpenMP Code generation

来源：评论

学校读者我要写书评

暂无评论

parallel hyper-heuristics for process engineering optimization

引用

COMPUTERS & CHEMICAL ENGINEERING 2021年 153卷 107440-107440页

作者： Oteiza, Paola P. Ardenghi, Juan, I Brignole, Nelida B. Univ Nacl Sur UNS Lab Invest & Desarrollo Comp Cient LIDECC Dept Ciencias & Ingn Computat DCIC Bahia Blanca Buenos Aires Argentina Univ Nacl Sur UNS Dept Ingn Quim DIQ B8000 Bahia Blanca Buenos Aires Argentina Planta Piloto Ingn Quim Univ Nacl Sur CONICET Camino Carrindanga Km 7 RA-8000 Bahia Blanca Buenos Aires Argentina

This paper presents the general framework of a parallel cooperative hyper-heuristic optimizer (PCHO) to solve systems of nonlinear algebraic equations with equality and inequality constraints. The algorithm comprises the classical metaheuristics called Genetic Algorithms, Simulated Annealing and Particle Swarm Optimization, whose parameters are adaptively chosen during the executions. A Master-Worker architecture was designed and implemented, where the Master processor ranks the solution candidates informed by the metaheuristics and immediately communicates the most promising candidate to update all Workers. Algorithmic performance was tested with general models, most of them corresponding to PSE process systems. The results confirmed the efficiency of the proposed approach since both online parameter retuning and parallel processing sped up the search. (c) 2021 Elsevier Ltd. All rights reserved.

关键词： Optimization Evolutionary algorithms Metaheuristics Hyper-heuristics parallel programming

来源：评论

学校读者我要写书评

暂无评论

Adaptive tiling for parallel N-body simulations on many core

引用

ASTRONOMY AND COMPUTING 2021年 36卷 100466-100466页

作者： Khan, M. A. Al-Mouhamed, M. A. Mohammad, N. Prince Mohammad Bin Fahd Univ Coll Comp Engn & Sci Khobar Saudi Arabia King Fahd Univ Petr & Minerals Comp Engn Khobar Saudi Arabia Prince Mohammad Bin Fahd Univ Cybersecur Ctr Khobar Saudi Arabia

The N-body simulations consist of computing mutual gravitational forces exerted on each body in O(N). The Barnes-Hut approximation allows processing a group of bodies in O(1) if they are far enough from a given body, which drops the complexity of the whole simulation to O(NLogN). The octree is used to ease the pruning process but at the cost of some irregularity in the access pattern. In a parallel N-body implementation the bodies are partitioned among threads that are executed on multiple cores. The depth-first traversal of the octree is used for processing each body, which causes repeated cache misses during traversal. This paper proposes different types of tiling methods to improve the performance of N-body simulations. It presents an experimental analysis of octree traversal by using these tiling methods to identify the potential of cache data reuse. It then evaluates these tiling methods for varying tile sizes with different galaxy sizes and a varying number of threads on several machine architectures. The efficiency of tiling approaches depends on the chosen tile size. It is shown that a speedup of 8 times can be achieved by choosing the appropriate tile size on a 60-core Intel accelerator. In order to determine appropriate tile size, the paper proposes an adaptive tiling approach to implicitly adapt the tile size to the distribution of threads, the cache capacity, cache latency, problem size and dynamic changes in the access pattern over the iterations. The proposed adaptive tiling approach can be used as an optimization option in parallel compilers. (C) 2021 Elsevier B.V. All rights reserved.

关键词： N-body simulations Tiling Cache optimization Many Integrated Core (MIC) parallel programming

来源：评论

学校读者我要写书评

暂无评论

Informed scenario-based RRT* for aircraft trajectory planning under ensemble forecasting of thunderstorms

引用

TRANSPORTATION RESEARCH PART C-EMERGING TECHNOLOGIES 2021年 129卷 103232-103232页

作者： Andre, Eduardo Gonzalez-Arribas, Daniel Soler, Manuel Kamgarpour, Maryam Sanjurjo-Rivo, Manuel Univ Carlos III Dept Bioengn & Aerosp Engn Madrid Spain Univ British Columbia Elect & Comp Engn Vancouver BC Canada

Thunderstorms represent a major hazard for flights, as they compromise the safety of both the airframe and the passengers. To address trajectory planning under thunderstorms, three variants of the scenario-based rapidly exploring random trees (SB-RRTs) are proposed. During an iterative process, the so-called SB-RRT, the SB-RRT* and the Informed SB-RRT* find safe trajectories by meeting a user-defined safety threshold. Additionally, the last two techniques converge to solutions of minimum flight length. Through parallelization on graphical processing units the required computational times are reduced substantially to become compatible with near real-time operation. The proposed methods are tested considering a kinematic model of an aircraft flying between two waypoints at constant flight level and airspeed;the test scenario is based on a realistic weather forecast and assumed to be described by an ensemble of equally likely members. Lastly, the influence of the number of scenarios, safety margin and iterations on the results is analyzed. Results show that the SB-RRTs are able to find safe and, in two of the algorithms, close to-optimum solutions.

关键词： Aircraft path planning Sampling-based algorithms Uncertain thunderstorm avoidance parallel programming

来源：评论

学校读者我要写书评

暂无评论

Comparison of massively parallel algorithms on graphics processing unit for MIMO radar

引用

e-Prime - Advances in Electrical Engineering, Electronics and Energy 2022年 2卷

作者： Pitre, Eric Roberge, Vincent Bray, Joey Hefnawi, Mostafa Royal Military College of Canada Department of Electrical and Computer Engineering Canada

This paper proposes a method for accelerating an enhanced resolution 3D Multiple Input Multiple Output (MIMO) radar on a Graphics Processing Unit (GPU). Due to the size of the data required for range, bearing, and doppler processing, computations for the MIMO radar are extensive and seldom permit real-time operation without performance compromises. Current methods for achieving reasonable frame rates include reducing the scope of the radar (i.e., limiting the number of dimensions, the field of view, or the ranges of interest), choosing efficient but coarse algorithms (i.e., the FFT for range, velocity, and bearing estimation), or offloading the computations on task specific hardware, DSP, or FPGA. The proposed framework enables real-time operation of the MIMO radar by performing the signal processing on a GPU without compromising the radar coverage, while replacing the widely used 3D FFT with an enhanced resolution alternative. This paper compares the execution times of the various algorithms when performed on a Central Processing Unit (CPU), and when performed on the GPU. © 2022

关键词： Chirp Z transform Graphics processing unit MIMO radar parallel programming

来源：评论

学校读者我要写书评

暂无评论

Kcollections: A Fast and Efficient Library for K-mers 34

Kcollections: A Fast and Efficient Library for K-mers

引用

34th IEEE International parallel and Distributed Processing Symposium (IPDPS)

作者： Fujimoto, M. Stanley Lyman, Cole A. Clement, Mark J. Brigham Young Univ Dept Comp Sci Provo UT 84602 USA

ISBN: (纸本)9781728174457

K-mers form the backbone of many bioinformatic algorithms. They are, however, difficult to store and use efficiently because the number of k-mers increases exponentially as k increases. Many algorithms exist for compressed storage of k-mers but suffer from slow insert times or are probabilistic resulting in false-positive k-mers. Furthermore, k-mer libraries usually specialize in associating specific values with k-mers such as a color in colored de Bruijn Graphs or k-mer count. We present kcollections1, a compressed and parallel data structure designed for k-mers generated from whole, assembled genomes. Kcollections is available for C++ and provides set- and map-like structures as well as a k-mer counting data structure all of which utilize parallel operations designed using a MapReduce paradigm. Additionally, we provide basic Python bindings for rapid prototyping. Kcollections makes developing bioinformatic algorithms simpler by abstracting away the tedious task of storing k-mers.

关键词： data structure genomics k-mer parallel programming

来源：评论

学校读者我要写书评

暂无评论

parallel Stream Processing with MPI for Video Analytics and Data Visualization 19th

Parallel Stream Processing with MPI for Video Analytics and ...

引用

19th Symposium on High-Performance Computing Systems (WSCAD)

作者： Vogel, Adriano Rista, Cassiano Justo, Gabriel Ewald, Endrius Griebler, Dalvan Mencagli, Gabriele Fernandes, Luiz Gustavo Pontificia Univ Catolica Rio Grande do Sul Sch Technol Porto Alegre RS Brazil Univ Pisa Dept Comp Sci Pisa Italy Tres de Maio Fac SETREM Lab Adv Res Cloud Comp LARCC Tres De Maio Brazil

ISBN: (纸本)9783030410506;9783030410490

The amount of data generated is increasing exponentially. However, processing data and producing fast results is a technological challenge. parallel stream processing can be implemented for handling high frequency and big data flows. The MPI parallel programming model offers low-level and flexible mechanisms for dealing with distributed architectures such as clusters. This paper aims to use it to accelerate video analytics and data visualization applications so that insight can be obtained as soon as the data arrives. Experiments were conducted with a Domain-Specific Language for Geospatial Data Visualization and a Person Recognizer video application. We applied the same stream parallelism strategy and two task distribution strategies. The dynamic task distribution achieved better performance than the static distribution in the HPC cluster. The data visualization achieved lower throughput with respect to the video analytics due to the I/O intensive operations. Also, the MPI programming model shows promising performance outcomes for stream processing applications.

关键词： parallel programming Stream parallelism Distributed processing Cluster

来源：评论

学校读者我要写书评

暂无评论

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

时间限定

文献类型

馆藏选择

核心期刊

语言

文献类型

帮助

文字说明：

检索规则说明：

检索范例：

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案：

请选择收藏分类：

通借通还

建议与咨询 留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

时间限定

文献类型

馆藏选择

核心期刊

语言

文献类型

帮助

文字说明：

检索规则说明：

检索范例：

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案： 新增检索档案 确定 取消

请选择收藏分类： 新增自定义分类 确定 取消

通借通还

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

请选择保存的检索档案：

请选择收藏分类：