检索结果-内蒙古大学图书馆

IEEE International Symposium on parallel and Distributed Processing Workshops and Phd Forum (IPDPSW)

作者： P. Chitra Sheikh K. Ghafoor Thiagarajar College of Engineering Madurai India Tennessee Tech University USA

Due to the rapid growth in the multicore and GPU based computing devices, the need to teach parallel computing in CS/CE curriculum has become almost mandatory nowadays. A course on parallel Computing Systems (PCS) has been designed to provide an understanding of the fundamental principles and engineering trade-offs involved in designing modern parallel computing systems as well as to teach parallel programming techniques necessary to effectively utilize these machines. An activity based learning approach was adopted for teaching the course and several parallel programming paradigms and technologies such OpenMP, MPI, and CUDA have been covered. This course was offered as a required course to graduate students. This paper describes the implementation of the course at Thiagarajar College of Engineering. Evaluation of the implementation of the course reveals that for students who have not been exposed to parallel and distributed computing, i) activity based learning results in better knowledge gain compared to the traditional approach, ii) learning OpenMP was much easier than MPI or CUDA, iii) some parallel and Distributed Computing (PDC) concepts such as false sharing were harder to grasp compared to basic concepts, and iv) it is essential to introduce parallel computing in the undergraduate curriculum.

关键词： parallel processing Graphics processing units Education parallel programming Distributed computing

来源：评论

学校读者我要写书评

暂无评论

Analysis and Optimization of Pipelined Broadcast Algorithms on Gigabit Ethernet and InfiniBand Networks

Analysis and Optimization of Pipelined Broadcast Algorithms ...

引用

International Asian School-Seminar Optimization Problems of Complex Systems

作者： Mikhail Kurnosov Daniil Berlizov Tatiana Tkacheva Elizaveta Tokmasheva Siberian State University of Telecommunications and Information Sciences

ISBN: (纸本)9781728129877

Theoretical and experimental analysis of MPI_Bcast algortihms is presented. The optimal tree degrees and segment sizes for pipelined versions of algorithms are obtained. Algorithms were investigated according to their implementation in the Open MPI library. Theoretical results are consistent with experiments on a computer cluster with Gigabit Ethernet and InfiniBand communication networks.

关键词： Broadcast MPI parallel programming High-performance computing Gigabit Ethernet High Performance Computing parallel programming message passing mannose phosphate isomerase remote procedure calls Infiniband broadcasting algorithms Analysis

来源：评论

学校读者我要写书评

暂无评论

Automatic runtime calculation of communications for data-parallel expressions with periodic conditions

引用

CONCURRENCY AND COMPUTATION-PRACTICE & EXPERIENCE 2019年第5期31卷

作者： Moreton-Fernandez, Ana Gonzalez-Escribano, Arturo Univ Valladolid Dept Informat Valladolid Spain

Many real-world applications feature data accesses on periodic domains. Manually implementing the synchronizations and communications associated to the data dependences on each case is cumbersome and error-prone. It is increasingly interesting to support these applications in high-level parallel programming languages or parallelizing compilers. In this paper, we present a technique that, for distributed-memory systems, calculates the specific communications derived from data-parallel codes with or without periodic boundary conditions on affine access expressions. It makes transparent to the programmer the management of aggregated communications for the chosen data partition. Our technique moves to runtime part of the compile-time analysis typically used to generate the communication code for affine expressions, introducing a complete new technique that also supports the periodic boundary conditions. We present an experimental study to evaluate our proposal using several study cases. Our experimental results show that our approach can automatically obtain communication codes as efficient as those found in MPI reference codes, reducing the development effort.

关键词： communications distributed memory parallel programming periodic boundary condition

来源：评论

学校读者我要写书评

暂无评论

Introducing parallelism to the Ranges TS 18

Introducing Parallelism to the Ranges TS

引用

International Workshop on OpenCL (IWOCL)

作者： Brown, Gordon Di Bella, Christopher Haidl, Michael Remmelg, Toomas Reyes, Ruyman Steuwer, Michel Codeplay Software Ltd Edinburgh Midlothian Scotland Univ Munster Munster Germany Univ Glasgow Glasgow Lanark Scotland

ISBN: (纸本)9781450364393

The current interface provided by the C++ 17 parallel algorithms poses some limitations with respect to parallel data access and heterogeneous systems, such as personal computers and server nodes with GPUs, smartphones, and embedded System on a Chip chipsets. In this paper, we present a summary of why we believe the Ranges TS solves these problems, and also improves both programmability and performance on heterogeneous platforms. The complete paper has been submitted to WG21 for consideration, and here we present a summary of the changes proposed alongside new performance results. To the best of our knowledge, this is the first paper presented to WG21 that unifies the Ranges TS with the parallel algorithms introduced in C++ 17. Although there are various points of intersection, we will focus on the composability of functions, and the benefit that this brings to accelerator devices via kernel fusion.

关键词： C plus parallel programming Heterogeneous Computing Kernel Fusion

来源：评论

学校读者我要写书评

暂无评论

Runtime-Assisted Cache Coherence Deactivation in Task parallel Programs

Runtime-Assisted Cache Coherence Deactivation in Task Parall...

引用

International Conference on High Performance Computing, Networking, Storage, and Analysis (SC)

作者： Caheny, Paul Alvarez, Lluc Valero, Mateo Moreto, Miquel Casas, Marc Barcelona Supercomp Ctr BSC Barcelona Spain Univ Politecn Cataluna Barcelona Spain

ISBN: (纸本)9781538683842

With increasing core counts, the scalability of directory-based cache coherence has become a challenging problem. To reduce the area and power needs of the directory, recent proposals reduce its size by classifying data as private or shared, and disable coherence for private data. However, existing classification methods suffer from inaccuracies and require complex hardware support with limited scalability. This paper proposes a hardware/software co-designed approach: the runtime system identifies data that is guaranteed by the programming model semantics to not require coherence and notifies the microarchitecture. The microarchitecture deactivates coherence for this private data and powers off unused directory capacity. Our proposal reduces directory accesses to just 26% of the baseline system, and supports a 64x smaller directory with only 2.8% performance degradation. By dynamically calibrating the directory size our proposal saves 86% of dynamic energy consumption in the directory without harming performance.

关键词： Cache memory Memory architecture parallel programming Runtime environment

来源：评论

学校读者我要写书评

暂无评论

DM-HEOM: A Portable and Scalable Solver-Framework for the Hierarchical Equations of Motion 32

DM-HEOM: A Portable and Scalable Solver-Framework for the Hi...

引用

32nd IEEE International parallel and Distributed Processing Symposium (IPDPS)

作者： Noack, Matthias Reinefeld, Alexander Kramer, Tobias Steinke, Thomas Zuse Inst Berlin Berlin Germany

ISBN: (纸本)9781538655559

Computing the Hierarchical Equations of Motion (HEOM) is by itself a challenging problem, and so is writing portable production code that runs efficiently on a variety of architectures while scaling from PCs to supercomputers. We combined both challenges to push the boundaries of simulating quantum systems, and to evaluate and improve methodologies for scientific software engineering. Our contributions are threefold: We present the first distributed memory implementation of the HEOM method (DM-HEOM), we describe an interdisciplinary development workflow, and we provide guidelines and experiences for designing distributed, performance-portable HPC applications with MPI3, OpenCL and other state-of-the-art programming models. We evaluated the resulting code on multi- and many-core CPUs as well as GPUs, and demonstrate scalability on a Cray XC40 supercomputer for the PS I molecular light harvesting complex.

关键词： parallel programming Portability Portable Performance OpenCL MPI Scientific Computing HEOM

来源：评论

学校读者我要写书评

暂无评论

Analysis and Optimisation of PBS/TORQUE Fault Tolerance Tools

Analysis and Optimisation of PBS/TORQUE Fault Tolerance Tool...

引用

International Asian School-Seminar Optimization Problems of Complex Systems

作者： Alexandr V. Efimov Kirill V. Pavsky Computer systems department of SibSUTIS Computer systems laboratory of ISP SB RAS Computer systems laboratory of ISP SB RAS Computer systems department of SibSUTIS

This work is devoted to the problem of detecting and processing faults of computing nodes during execution of parallel programs on distributed computing systems. The fault tolerance tools of PBS/TORQUE are considered.... 详细信息

ISBN: (纸本)9781728129877

关键词： Distributed computing systems Fault tolerance Resource management PBS/TORQUE distributed computing system Fault tolerance management of resources FUNCTIONAL MODELS parallel programming faults

来源：评论

学校读者我要写书评

暂无评论

Executable Modelling for Highly parallel Accelerators

Executable Modelling for Highly Parallel Accelerators

引用

ACM/IEEE International Conference on Model Driven Engineering Languages and Systems Companion

作者： Lorenzo Addazi Federico Ciccozzi Bjorn Lisper School of Innovation Design and Engineering Malardalen University

ISBN: (纸本)9781728151267

High-performance embedded computing is developing rapidly since applications in most domains require a large and increasing amount of computing power. On the hardware side, this requirement is met by the introduction of heterogeneous systems, with highly parallel accelerators that are designed to take care of the computation-heavy parts of an application. There is today a plethora of accelerator architectures, including GPUs, many-cores, FPGAs, and domain-specific architectures such as AI accelerators. They all have their own programming models, which are typically complex, low-level, and involve explicit parallelism. This yields error-prone software that puts the functional safety at risk, unacceptable for safety-critical embedded applications. In this position paper we argue that high-level executable modelling languages tailored for parallel computing can help in the software design for high performance embedded applications. In particular, we consider the data-parallel model to be a suitable candidate, since it allows very abstract parallel algorithm specifications free from race conditions. Moreover, we promote the Action Language for fUML (and thereby fUML) as suitable host language.

关键词： parallel programming fUML Alf UML Modelling languages High-performance computing Data-parallelism Executable models High Performance Computing parallel programming modelling languages UML data parallel computational power Accelerators GTF2A1L gene Software design Accelerator architectures embedded application Position Papers HETEROGENEOUS SYSTEM

来源：评论

学校读者我要写书评

暂无评论

Optimization of MPI-Process Mapping for Clusters with Angara Interconnect

引用

LOBACHEVSKII JOURNAL OF MATHEMATICS 2018年第9期39卷 1188-1198页

作者： Khalilov, M. R. Timofeev, A. V. Natl Res Univ Higher Sch Econ Ul Myasnitskaya 20 Moscow 101000 Russia Russian Acad Sci Joint Inst High Temp Ul Izhorskaya 20Str 2 Moscow 125412 Russia

An algorithm of MPI processes mapping optimization is adapted for supercomputers with interconnect Angara. The mapping algorithm is based on partitioning of parallel program communication pattern. It is performed in such a way that the processes between which the most intensive exchanges take place are tied to the nodes/processors with the highest bandwidth. The algorithm finds a near-optimal distribution of its processes for processor cores to minimize the total execution time of exchanges between MPI processes. The analysis of results of optimized placement of processes using proposed method on small supercomputers is shown. The analysis of the dependence of the MPI program execution time on supercomputer parameters and task parameters is performed. A theoretical model is proposed for estimation of effect of mapping optimization on the execution time for several types of supercomputer topologies. The prospect of using implemented optimization library for large-scale supercomputers with the interconnect Angara is discussed.

关键词： parallel programming process mapping MPI Angara interconnect

来源：评论

学校读者我要写书评

暂无评论

Accelerating the RICH Particle Detector Algorithm on Intel Xeon Phi 26

Accelerating the RICH Particle Detector Algorithm on Intel X...

引用

26th Euromicro International Conference on parallel, Distributed, and Network-Based Processing (PDP)

作者： Quast, Christina Schwemmer, Rainer Pohl, Angela Cosenza, Biagio Juurlink, Ben CERN Meyrin Canton Geneva Switzerland Tech Univ Berlin Berlin Germany

ISBN: (纸本)9781538649756

At the LHC, particles are collided in order to understand how the universe was created. Those collisions are called events and generate large quantities of data, which have to be pre-filtered before they are stored to hard disks. This paper presents a parallel implementation of these algorithms that is specifically designed for the Intel Xeon Phi Knights Landing platform, exploiting its 64 cores and AVX-512 instruction set. It shows that a linear speedup up until approximately 64 threads is attainable when vectorization is used, data is aligned to cache line boundaries, program execution is pinned to MCDRAM, mathematical expressions are transformed to a more efficient equivalent formulation, and OpenMP is used for parallelization. The code was transformed from being compute bound to memory bound. Overall, a speedup of 36.47x was reached while obtaining an error which is smaller than the detector resolution.

关键词： Intel Xeon Phi Knights Landing OpenMP Vectorization parallel programming

来源：评论

学校读者我要写书评

暂无评论

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

时间限定

文献类型

馆藏选择

核心期刊

语言

文献类型

帮助

文字说明：

检索规则说明：

检索范例：

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案：

请选择收藏分类：

通借通还

建议与咨询 留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

时间限定

文献类型

馆藏选择

核心期刊

语言

文献类型

帮助

文字说明：

检索规则说明：

检索范例：

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案： 新增检索档案 确定 取消

请选择收藏分类： 新增自定义分类 确定 取消

通借通还

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

请选择保存的检索档案：

请选择收藏分类：