检索结果-内蒙古大学图书馆

Automatic Cost Analysis for Imperative BSP Programs

INTERNATIONAL JOURNAL OF parallel programming 2019年第2期47卷 184-212页

作者： Jakobsson, Arvid Univ Orleans INSA Ctr Val Loire LIFO EA 4022 Orleans France Huawei Technol France Res Ctr Boulogne France

Bulk Synchronous parallel (BSP) is a model for parallel computing with predictable scalability. BSP has a cost model: programs can be assigned a cost which describes their resource usage on any parallel machine. However, the programmer has to manually derive this cost. This paper describes an automatic method for the derivation of BSP program costs, based on classic cost analysis and approximation of polyhedral integer volumes. Our method requires and analyzes programs with textually aligned synchronization and textually aligned, polyhedral communication. We have implemented the analysis and our prototype obtains cost formulas that are parametric in the input parameters of the program and the parameters of the BSP computer and thus bound the cost of running the program with any input on any number of cores. We evaluate the cost formulas and find that they are indeed upper bounds, and tight for data-oblivious programs. Additionally, we evaluate their capacity to predict concrete run times in two parallel settings: a multi-core computer and a cluster. We find that when exact upper bounds can be found, they accurately predict run-times. In networks with full bisection bandwidth, as the BSP model supposes, results are promising with errors <50%.

关键词： parallel programming Bulk Synchronous parallelism Static analysis Cost analysis

来源：评论

学校读者我要写书评

暂无评论

Some useful optimisations for unstructured computational fluid dynamics codes on multicore and manycore architectures

引用

COMPUTER PHYSICS COMMUNICATIONS 2019年 235卷 305-323页

作者： Hadade, Ioan Wang, Feng Carnevale, Mauro di Mare, Luca Imperial Coll London Rolls Royce Vibrat UTC London SW7 2AZ England Univ Oxford Oxford Thermofluids Inst Oxford OX2 0ES England

This paper presents a number of optimisations for improving the performance of unstructured computational fluid dynamics codes on multicore and manycore architectures such as the Intel Sandy Bridge, Broadwell and Skylake CPUs and the Intel Xeon Phi Knights Corner and Knights Landing manycore processors. We discuss and demonstrate their implementation in two distinct classes of computational kernels: face-based loops represented by the computation of fluxes and cell-based loops representing updates to state vectors. We present the importance of making efficient use of the underlying vector units in both classes of computational kernels with special emphasis on the changes required for vectorising face-based loops and their intrinsic indirect and irregular access patterns. We demonstrate the advantage of different data layouts for cell-centred as well as face data structures and architectural specific optimisations for improving the performance of gather and scatter operations which are prevalent in unstructured mesh applications. The implementation of a software prefetching strategy based on auto tuning is also shown along with an empirical evaluation on the importance of multithreading for in order architectures such as Knights Corner. We explore the various memory modes available on the Intel Xeon Phi Knights Landing architecture and present an approach whereby both traditional DRAM as well as MCDRAM interfaces are exploited for maximum performance. We obtain significant full application speed-ups between 2.8 and 3X across the multicore CPUs in two-socket node configurations, 8.6X on the Intel Xeon Phi Knights Corner coprocessor and 5.6X on the Intel Xeon Phi Knights Landing processor in an unstructured finite volume CFD code representative in size and complexity to an industrial application. Program summary Program Title: some_opt_for_unstructured_cfd Program Files doi: http://***/10.17632/zyh2zkf3jw.1 Licensing provisions: GNU General Public License 3 (GPL)

关键词： Unstructured grids Computational fluid dynamics Code optimisation High performance computing parallel programming

来源：评论

学校读者我要写书评

暂无评论

A Heterogeneous Multi-Core Based Biomedical Application Processing System and programming Toolkit

引用

JOURNAL OF SIGNAL PROCESSING SYSTEMS FOR SIGNAL IMAGE AND VIDEO TECHNOLOGY 2019年第8期91卷 963-978页

作者： Hussain, Tassadaq Haider, Amna Taleb-Ahmed, Abdelmalik Riphah Int Univ Islamabad Pakistan UCERD Islamabad Pakistan Lab Ind & Human Automat Mech & Comp Sci Famars France Univ Valenciennes & Hainaut Cambresis Bat Malvache Famars France

Due to the growth of biological databases and biomedical instruments, the high performance active (real-time) signal processing becomes a challenge for medical scientists and engineers. The medical applications require a high-performance signal processor which can process the scientific and engineering biomedical applications and is easy to program. In this article, we have suggested a biomedical sensor interface and heterogeneous multi-core processing architecture based biomedical application processing system (BAPS) and biomedical applications toolkit. The biomedical sensor interface supports multiple regular and complex medical signals and provides digital data to the processing system. The BAPS uses heterogeneous multi-core architecture that processes biomedical applications with the performance up to 10 billion operations per sec and accuracy of 1 mu sec. The biomedical application toolkit provides programmability by giving support of hardware-level, scientific and artificial intelligence programming. The BAPS provides a single embedded platform solution to process a wide range of biomedical signal and image processing applications. To prove the importance of the proposed system, we developed the BAPS hardware architecture and tested it with different biomedical applications. When compared the results of BAPS with the baseline system, the results show that BAPS improves active (real-time) applications performance up to 12.8 times and processes passive (non-real-time) application 7.4 times faster and improves the 4.84-time performance of artificial intelligence application. While comparing the power and energy, the BAPS draws 1.56 times less dynamic power and consumes 21.85 times less energy.

关键词： FPGA Multi-core Embedded system HPC parallel programming Biomedical

来源：评论

学校读者我要写书评

暂无评论

Engineering Algorithms for Scalability through Continuous Validation of Performance Expectations

引用

IEEE TRANSACTIONS ON parallel AND DISTRIBUTED SYSTEMS 2019年第8期30卷 1768-1785页

作者： Shudler, Sergei Berens, Yannick Calotoiu, Alexandru Hoefler, Torsten Strube, Alexandre Wolf, Felix Argonne Natl Lab Lemont IL 60439 USA Tech Univ Darmstadt D-64289 Darmstadt Germany Swiss Fed Inst Technol CH-8092 Zurich Switzerland Julich Supercomp Ctr D-52425 Julich Germany

Many libraries in the HPC field use sophisticated algorithms with clear theoretical scalability expectations. However, hardware constraints or programming bugs may sometimes render these expectations inaccurate or even plainly wrong. While algorithm and performance engineers have already been advocating the systematic combination of analytical performance models with practical measurements for a very long time, we go one step further and show how this comparison can become part of automated testing procedures. The most important applications of our method include initial validation, regression testing, and benchmarking to compare implementation and platform alternatives. Advancing the concept of performance assertions, we verify asymptotic scaling trends rather than precise analytical expressions, relieving the developer from the burden of having to specify and maintain very fine-grained and potentially non-portable expectations. In this way, scalability validation can be continuously applied throughout the whole development cycle with very little effort. Using MPI and parallel sorting algorithms as examples, we show how our method can help uncover non-obvious limitations of both libraries and underlying platforms.

关键词： Software engineering high performance computing parallel programming performance analysis performance modeling

来源：评论

学校读者我要写书评

暂无评论

On the maturity of parallel applications for asymmetric multi-core processors

引用

JOURNAL OF parallel AND DISTRIBUTED COMPUTING 2019年 127卷 105-115页

作者： Chronaki, Kallia Moreto, Miguel Casas, Marc Rico, Alejandro Badia, Rosa M. Ayguade, Eduard Valero, Mateo Barcelona Supercomp Ctr Barcelona Spain ARM Richardson TX USA CSIC Artificial Intelligence Res Inst IIIA Madrid Spain

Asymmetric multi-cores (AMCs) are a successful architectural solution for both mobile devices and supercomputers. By maintaining two types of cores (fast and slow) AMCs are able to provide high performance under the facility power budget. This paper performs the first extensive evaluation of how portable are the current HPC applications for such supercomputing systems. Specifically we evaluate several execution models on an ARM *** AMC using the PARSEC benchmark suite that includes representative highly parallel applications. We compare schedulers at the user, OS and runtime levels, using both static and dynamic options and multiple configurations, and assess the impact of these options on the well-known problem of balancing the load across AMCs. Our results demonstrate that scheduling is more effective when it takes place in the runtime system level as it improves the baseline by 23%, while the heterogeneous-aware OS scheduling solution improves the baseline by 10%. (C) 2019 Published by Elsevier Inc.

关键词： parallel programming Scheduling Runtime systems Asymmetric multi-cores HPC

来源：评论

学校读者我要写书评

暂无评论

A view of programming scalable data analysis: from clouds to exascale

引用

JOURNAL OF CLOUD COMPUTING-ADVANCES SYSTEMS AND APPLICATIONS 2019年第1期8卷 1-16页

作者： Talia, Domenico Univ Calabria DIMES Arcavacata Di Rende Italy

Scalability is a key feature for big data analysis and machine learning frameworks and for applications that need to analyze very large and real-time data available from data repositories, social media, sensor networks, smartphones, and the Web. Scalable big data analysis today can be achieved by parallel implementations that are able to exploit the computing and storage facilities of high performance computing (HPC) systems and clouds, whereas in the near future Exascale systems will be used to implement extreme-scale data analysis. Here is discussed how clouds currently support the development of scalable data mining solutions and are outlined and examined the main challenges to be addressed and solved for implementing innovative data analysis applications on Exascale systems.

关键词： Big data analysis Cloud computing Exascale computing Data mining parallel programming Scalability

来源：评论

学校读者我要写书评

暂无评论

Novel circuit designs of memristor synapse and neuron

引用

NEUROCOMPUTING 2019年 330卷 11-16页

作者： Hong, Qinghui Zhao, Liang Wang, Xiaoping Huazhong Univ Sci & Technol Sch Automat Wuhan 430074 Hubei Peoples R China

In this work, novel circuits based on memristors for implementing electronic synapse and artificial neuron are designed. First, two simple synaptic circuits for implementing weighting calculations of voltage and current modes using twin memristors are proposed. A synaptic weighting operation is defined as a difference function between the twin memristors, which can be adjusted in reverse by applying programmed signals and conducting positive, zero, and negative synaptic weights. Second, two neuron circuits using the proposed memristor synapses, in which parallel computing and programming can be achieved, are designed. Finally, performances of the proposed memristor synapses and neuron circuits, such as weight programming, neuron computing, and parallel operation, are analyzed through PSpice simulations. (C) 2018 Elsevier B.V. All rightsreserved.

关键词： Memristor Synaptic circuit Neuron circuit parallel programming

来源：评论

学校读者我要写书评

暂无评论

Full-neighbor-list based numerical reproducibility method for parallel molecular dynamics simulations

引用

parallel COMPUTING 2019年 85卷 109-118页

作者： Xu, Liyang Ren, Xiaoguang Wang, Qian Xu, Xinhai Yang, Xuejun Natl Univ Def Technol State Key Lab High Performance Comp Changsha Hunan Peoples R China Natl Innovat Inst Def Technol Artificial Intelligence Res Ctr Beijing Peoples R China Natl Innovat Inst Def Technol Beijing Peoples R China

The numerical nonreproducibility in parallel molecular dynamics (MD) simulations, which relates to the non-associate accumulation of float point data, leads to great challenges for development, debugging and validation. The most common solutions to this problem are using a high-precision data type or operation sorting, but these solutions are accompanied by significant computational overhead. This paper analyzes the sources of nonreproducibility in parallel MD simulations in detail. Two general solutions, namely, sorting by force component value and using an 80-bit long double data type, are implemented and evaluated in LAMMPS. To optimize the computational cost, a full-list based method with operation order sorted by particle distance is proposed, which is inspired by the spatial characteristics of MD simulations. An experiment on a system with constant energy dynamics shows that the new method can ensure reproducibility at any parallelism with an extra 50% computational overhead. (C) 2019 Published by Elsevier B.V.

关键词： Molecular dynamics Numerical reproducibility parallel programming LAMMPS Floating-Point arithmetic

来源：评论

学校读者我要写书评

暂无评论

DuctTeip: An efficient programming model for distributed task based parallel computing

arXiv

引用

arXiv 2018年

作者： Zafari, Afshin Larsson, Elisabeth Tillenius, Martin Uppsala University Department of Information Technology Box 337 UppsalaSE-751 05 Sweden

Current high-performance computer systems used for scientific computing typically combine shared memory computational nodes in a distributed memory environment. Extracting high performance from these complex systems requires tailored approaches. Task based parallel programming has been successful both in simplifying the programming and in exploiting the available hardware parallelism for shared memory systems. In this paper we focus on how to extend task parallel programming to distributed memory systems. We use a hierarchical decomposition of tasks and data in order to accommodate the different levels of hardware. We test the proposed programming model on two different applications, a Cholesky factorization, and a solver for the Shallow Water Equations. We also compare the performance of our implementation with that of other frameworks for distributed task parallel programming, and show that it is competitive. Copyright © 2018, The Authors. All rights reserved.

关键词： parallel programming

来源：评论

学校读者我要写书评

暂无评论

OpenACC parallelization of Stochastic Simulations on GPUs

引用

IEICE TRANSACTIONS ON INFORMATION AND SYSTEMS 2019年第8期E102D卷 1565-1568页

作者： Kang, Pilsung Sun Moon Univ Dept Comp Engn Asan South Korea

We present an OpenACC-based parallelization implementation of stochastic algorithms for simulating biochemical reaction networks on modern GPUs (graphics processing units). To investigate the effectiveness of using OpenACC for leveraging the massive hardware parallelism of the GPU architecture, we carefully apply OpenACC's language constructs and mechanisms to implementing a parallel version of stochastic simulation algorithms on the GPU. Using our OpenACC implementation in comparison to both the NVidia CUDA and the CPU-based implementations, we report our initial experiences on OpenACC's performance and programming productivity in the context of GPU-accelerated scientific computing.

关键词： GPU computing OpenACC parallel programming stochastic simulation

来源：评论

学校读者我要写书评

暂无评论

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

时间限定

文献类型

馆藏选择

核心期刊

语言

文献类型

帮助

文字说明：

检索规则说明：

检索范例：

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案：

请选择收藏分类：

通借通还

建议与咨询 留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

时间限定

文献类型

馆藏选择

核心期刊

语言

文献类型

帮助

文字说明：

检索规则说明：

检索范例：

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案： 新增检索档案 确定 取消

请选择收藏分类： 新增自定义分类 确定 取消

通借通还

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

请选择保存的检索档案：

请选择收藏分类：