检索结果-内蒙古大学图书馆

An Optimized Lattice Boltzmann Code for BlueGene/Q 1

10th international conference on parallel processing and Applied Mathematics (PPAM)

作者： Pivanti, Marcello Mantovani, Filippo Schifano, Sebastiano Fabio Tripiccione, Raffaele Zenesini, Luca Univ Ferrara I-44100 Ferrara Italy

ISBN: (数字)9783642551956

ISBN: (纸本)9783642551956

In this paper we describe an optimized implementation of a Lattice Boltzmann (LB) code on the BlueGene/Q system, the latest generation massively parallel system of the BlueGene family. We consider a state-of-art LB code, that accurately reproduces the thermo-hydrodynamics of a 2D-fluid obeying the equations of state of a perfect gas. the regular structure of LB algorithms offers several levels of algorithmic parallelism that can be matched by a massively parallel computer architecture. However the complex memory access patterns associated to our LB model make it not trivial to efficiently exploit all available parallelism. We describe our implementation strategies, based on previous experience made on clusters of many-core processors and GPUs, present results and analyze and compare performances.

关键词： Lattice Boltzmann High performance computing Massively parallel architectures Performance Analysis

来源：评论

学校读者我要写书评

暂无评论

An Empirical Comparison of k-Shortest Simple Path algorithms on Multicores 18

An Empirical Comparison of <i>k</i>-Shortest Simple Path Alg...

引用

47th international conference on parallel processing (ICPP)

作者： Ajwani, Deepak Duriakova, Erika Hurley, Neil Meyer, Ulrich Schickedanz, Alexander Nokia Bell Labs Dublin Ireland Univ Coll Dublin Sch Comp Sci Insight Ctr Data Analyt Dublin Ireland Goethe Univ Frankfurt Germany

ISBN: (纸本)9781450365109

We consider the loop less k-shortest path (KSP) problem. Although this problem has been studied in the sequential setting for at least the last two decades, no good parallel implementations are known. In this paper, we provide (i) a first systematic empirical comparison of various KSP algorithms and heuristic optimisations, (ii) carefully engineer various parallel implementations of these sequential algorithms and (iii) perform an extensive study of these parallel implementations on a range of graph classes and multicore architectures to determine the best algorithm and parallelization strategy for different graph classes. We find that even though the worst-case complexity of the best undirected KSP algorithm O (k (m + n log n)) is significantly better than that of the popular and considerably simpler directed KSP algorithm O(kn(m + n log n)), the two algorithms are fairly competitive in terms of their empirical performance on small diameter graphs. Furthermore, we show that a few simple optimisations help to bridge the gap between these KSP algorithms even more. However, on moderate to large diameter graphs, the undirected KSP algorithm is considerably faster than the directed algorithms, both in sequential and parallel settings. In terms of the parallelisation strategy, simply replacing the shortest path subroutine by parallel A-stepping algorithm can provide a good speed-up for many KSP algorithms on random graphs. In contrast, for graphs with skewed degree distribution, a more complex strategy of parallelizing the different deviations and then parallelizing the shortest path computation inside the deviations with the remaining threads, provides a better performance.

关键词： Algorithm Engineering parallel k-Shortest Path parallel Graph algorithms parallel Single Source Shortest Path

来源：评论

学校读者我要写书评

暂无评论

An Axiomatization for BSP algorithms 18th

An Axiomatization for BSP Algorithms

引用

18th international conference on algorithms and architectures for parallel processing (ICA3PP)

作者： Marquer, Yoann Gava, Frederic Univ Paris East Lab Algorithms Complex & Log LACL Creteil France

ISBN: (纸本)9783030050573;9783030050566

the Gurevich's thesis stipulates that sequential abstract state machines (ASMS) capture the essence of sequential algorithms. On another hand, the bulk-synchronous parallel (BSP) bridging model is a well known model for HPC algorithm design. It provides a conceptual bridge between the physical implementation of the machine and the abstraction available to a programmer of that machine. the assumptions of the BSP model are thus provide portable and scalable performance predictions on most HPC systems. We follow Gurevich's thesis and extend the sequential postulates in order to intuitively and realistically capture BSP algorithms.

关键词： BSP ASM parallel algorithm HPC Postulates Cost model

来源：评论

学校读者我要写书评

暂无评论

Framework for mapping data mining applications on GPUs

Framework for mapping data mining applications on GPUs

引用

2011 10th international Symposium on parallel and Distributed Computing, ISPDC 2011

作者： Gainaru, Ana Slusanschi, Emil National Center for Supercomputing Applications University of Illinois at Urbana-Champaign Urbana-Champaign IL United States Department of Computer Science University Politehnica of Bucharest Bucharest Romania

ISBN: (纸本)9780769545400

Data mining algorithms are expensive by nature, but when dealing with today's dataset sizes, they are becoming even more slow and hard to use. Previous work has focused on parallelizing data mining algorithms on different architectures, and more recently, applications are starting to take advantage of the massive computation power and high bandwidth offered by GPUs. However there has been almost no prior work in offering a general methodology for parallelizing all types of data mining applications on hybrid architectures. this paper presents a framework for fast and efficient parallelization of data mining algorithms on GPU systems. the framework implements I/O transfer models that deal with the huge amount of data entries which are processed by this type of algorithms, all with numerous dependencies. Also the framework allows users to specify data requirements for each task so that the data scheduler can map efficiently each task on a GPU node and on a block in each of these processors improving the overall performance of the algorithm with around 20%. © 2011 IEEE.

关键词： Graphics processing unit

来源：评论

学校读者我要写书评

暂无评论

Test Harness on a Preconditioned Conjugate Gradient Solver on GPUs: An Efficiency Analysis

引用

IEEE TRANSACTIONS ON MAGNETICS 2013年第5期49卷 1729-1732页

作者： Rodrigues, A. Wendell de O. Chevallier, Loic Le Menach, Yvonnick Guyomarch, Frederic Univ Lille 1 LIFL F-59655 Villeneuve Dascq France Univ Lille 1 L2EP F-59655 Villeneuve Dascq France

the parallelization of numerical simulation algorithms, i.e., their adaptation to parallel processing architectures, is an aim to reach in order to hinder exorbitant execution times. the parallelism has been imposed at the level of processor architectures and graphics cards are now used for general-purpose calculation, also known as "General-Purpose computation on Graphics processing Unit (GPGPU)". the clear benefit is the excellent performance over price ratio. Besides hiding the low level programming, software engineering leads to a faster and more secure application development. this paper presents the real interest of using GPU processors to increase performance of larger problems which concern electrical machines simulation. Indeed, we show that our auto-generated code applied to several models allows achieving speedups of the order of 10 x.

关键词： Gradient methods numerical simulation parallel architectures software engineering

来源：评论

学校读者我要写书评

暂无评论

Performance Analysis of parallel Sorting algorithms using MPI 12

Performance Analysis of Parallel Sorting Algorithms using MP...

引用

12th international conference on Frontiers of Information Technology (FIT)

作者： Durad, Muhammad Hanif Akhtar, Muhammad Naveed Irfan-ul-Haq PIEAS DCIS Islamabad Pakistan

ISBN: (纸本)9781479975051

Sorting is one of the classic problems of data processing and many practical applications require implementation of parallel sorting algorithms. Only a few algorithms have been implemented using MPI, in this paper a few additional parallel sorting algorithms have been implemented using MPI. A unified performance analysis of all these algorithms has been presented using two different architectures. On basis of experimental results obtained some guidelines has been suggested for the selection of proper algorithms.

关键词： parallel sorting binary sort bitonic sort hyper quick sort merge sort odd even sort regular sampling quick sort radix sort shell sort

来源：评论

学校读者我要写书评

暂无评论

Efficient barriers for distributed shared memory computers

Efficient barriers for distributed shared memory computers

引用

Proceedings of the 8th international parallel processing Symposium

作者： Grunwald, Dirk Vajracharya, Suvas Univ of Colorado Boulder United States

ISBN: (纸本)0818656026

Barrier algorithms are central to the performance of numerous algorithms on scalable, high-performance architectures. Numerous barrier algorithms have been suggested and studied for Non-Uniform Memory Access (NUMA) architectures, but less work has been done for Cache Only Memory Access (COMA) or attraction memory [1] architectures such as the KSR-1. In this paper, we presented two new barrier algorithms that offer the best performance we have recorded on the KSR-1 distributed cache multiprocessor. We discuss the trade-offs and the performance of seven algorithms on two architectures. the new barrier algorithms adapt well to a hierarchical caching memory model and take advantage of parallel communication offered by most multiprocessor interconnection networks,. Performance results are shown for a 256-processor KSR-1 and a 20-processor Sequent Symmetry.

关键词： parallel processing systems

来源：评论

学校读者我要写书评

暂无评论

A Combined Arithmetic Logic Unit and Memory Element for the Design of a parallel Computer

引用

11th international conference on algorithms and architectures for parallel processing (ICA3PP)

作者： Rahman, Mohammed Ziaur Univ Malaya Dept Comp Sci & Technol Kuala Lumpur 50603 Malaysia

ISBN: (纸本)9783642246494

Memory-CPU single communication channel bottleneck of the von Neumann architecture is quickly stalling the growth of computer processors. A probable solution to this problem is to fuse processing and memory elements. A simple low latency single on-chip memory and processor cannot solve the problem as the fundamental channel bottleneck will still be there due to the logical splitting of processor and memory. this paper presents that a paradigm shift is possible by combining Arithmetic logic unit and Random Access Memory (ARAM) elements at bit level. this bit level modest ARAM is used to perform word level ALU instructions with minor modifications. this makes the ARAM cells capable of executing instructions in parallel. It is also asynchronous and hence reduces power consumption significantly. A CMOS implementation is presented that verifies the practicality of the proposed ARAM.

关键词： Computer architectures parallel architectures Memory architectures

来源：评论

学校读者我要写书评

暂无评论

Accelerating matrix decomposition with replications

Accelerating matrix decomposition with replications

引用

10th Workshop on Advances in parallel and Distributed Computational Models/22nd IEEE international parallel and Distributed processing Symposium

作者： Tai, Yi-Gang Lo, Chia-Tien Dan Psarris, Kleanthis Univ Texas San Antonio Dept Comp Sci San Antonio TX 78249 USA

ISBN: (纸本)9781424416936

Matrix decomposition applications that involve large matrix operations can take advantage of the flexibility and adaptability of reconfigurable computing systems to improve performance. the benefits come from replication, which includes vertical replication and horizontal replication. If viewed on a space-time chart, vertical replication allows multiple computations executed in parallel, and horizontal replication renders multiple functions on the same piece of hardware. In this paper, the reconfigurable architecture that supports replications for matrix decomposition applications on reconfigurable computing systems is described, and issues including the comparison of algorithms on the system and data movement between the internal computation cores and the external memory subsystem are addressed. A prototype of such a system is implemented to prove the concept. It is expected to improve the performance and scalability of matrix decomposition involving large matrices.

关键词： Reconfigurable architectures

来源：评论

学校读者我要写书评

暂无评论

parallel Training of Artificial Neural Networks Using Multithreaded and Multicore CPUs

Parallel Training of Artificial Neural Networks Using Multit...

引用

10th international conference on Adaptive and Natural Computing algorithms

作者： Schuessler, Olena Loyola, Diego German Aerosp Ctr Inst Remote Sensing D-82234 Wessling Germany

ISBN: (纸本)9783642202810

this paper reports on methods for the parallelization of artificial neural networks algorithms using multithreaded and multicore CPUs in order to speed up the training process. the developed algorithms were implemented in two common parallel programming paradigms and their performances are assessed using four datasets with diverse amounts of patterns and with different neural network architectures. All results show a significant increase in computation speed. which is reduced nearly linear with the number of cores for problems with very large training datasets.

关键词： Neural network training multithreading and multicore Pthreads and OpenMP parallelization

来源：评论

学校读者我要写书评

暂无评论

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

时间限定

文献类型

馆藏选择

核心期刊

语言

文献类型

帮助

文字说明：

检索规则说明：

检索范例：

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案：

请选择收藏分类：

通借通还

建议与咨询 留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

时间限定

文献类型

馆藏选择

核心期刊

语言

文献类型

帮助

文字说明：

检索规则说明：

检索范例：

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案： 新增检索档案 确定 取消

请选择收藏分类： 新增自定义分类 确定 取消

通借通还

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

请选择保存的检索档案：

请选择收藏分类：