检索结果-内蒙古大学图书馆

International Symposium on parallel and Distributed Processing (IPDPS)

作者： Saba Sehrish Jun Wang School of Electrical Engineering and Computer Science University of Central Florida USA

We present a case for automating the selection of MPI-IO performance optimizations, with an ultimate goal to relieve the application programmer from these details, thereby improving their productivity. Programmers productivity has always been overlooked as compared to the performance optimizations in high performance computing community. In this paper we present RFSA, a reduced function set abstraction based on an existing parallel programming interface (MPI-IO) for I/O. MPI-IO provides high performance I/O function calls to the scientists/engineers writing parallel programs; who are required to use the most appropriate optimization of a specific function, hence limits the programmer productivity. Therefore, we propose a set of reduced functions with an automatic selection algorithm to decide what specific MPI-IO function to use. We implement a selection algorithm for I/O functions like read, write, etc. RFSA replaces 6 different flavors of read and write functions by one read and write function. By running different parallel I/O benchmarks on both medium-scale clusters and NERSC supercomputers, we show that RFSA functions impose minimal performance penalties.

关键词： programming profession Productivity parallel programming Clustering algorithms Optimization Supercomputers Computer science Application software High performance computing Writing

来源：评论

学校读者我要写书评

暂无评论

Teaching and learning parallel processing through performance analysis using Prober

Teaching and learning parallel processing through performanc...

引用

Frontiers in Education (FIE) Conference

作者： L.E.S. Ramos L.F.W. Goes C.A.P.S. Martins Informatics Institute Pontifical Catholic University of Minas Gerais Belo Horizonte Minas Gerais Brazil Post Graduation Program in Electrical Engineering Informatics Institute Pontifical Catholic University of Minas Gerais Belo Horizonte Minas Gerais Brazil

In this paper we analyze the teaching and learning of parallel processing through performance analysis using a software tool called Prober. This tool is a functional and performance analyzer of parallel programs that we proposed and developed during an undergraduate research project. Our teaching and learning approach consists of a practical class where students receive explanations about some concepts of parallel processing and the use of the tool. They do some oriented and simple performance tests on parallel programs and analyze their results using Prober as a single aid tool. Finally, students answer a self-assessment questionnaire about their formation, their knowledge of parallel processing concepts and also about the usability of Prober. Our main goal is to show that students can learn concepts of parallel processing in a clearer, faster and more efficient way using our approach.

关键词： Education parallel processing Performance analysis Educational institutions Message passing parallel programming Informatics Workstations Hardware Libraries

来源：评论

学校读者我要写书评

暂无评论

Analysis and Optimisation of PBS/TORQUE Fault Tolerance Tools

Analysis and Optimisation of PBS/TORQUE Fault Tolerance Tool...

引用

International Asian School-Seminar Optimization Problems of Complex Systems

作者： Alexandr V. Efimov Kirill V. Pavsky Computer systems department of SibSUTIS Computer systems laboratory of ISP SB RAS Computer systems laboratory of ISP SB RAS Computer systems department of SibSUTIS

This work is devoted to the problem of detecting and processing faults of computing nodes during execution of parallel programs on distributed computing systems. The fault tolerance tools of PBS/TORQUE are considered.... 详细信息

ISBN: (纸本)9781728129877

关键词： Distributed computing systems Fault tolerance Resource management PBS/TORQUE distributed computing system Fault tolerance management of resources FUNCTIONAL MODELS parallel programming faults

来源：评论

学校读者我要写书评

暂无评论

Executing dynamic data-graph computations deterministically using chromatic scheduling 14

Executing dynamic data-graph computations deterministically ...

引用

Proceedings of the 26th ACM symposium on parallelism in algorithms and architectures

作者： Tim Kaler William Hasenplaugh Tao B. Schardl Charles E. Leiserson MIT CSAIL Cambridge MA USA

ISBN: (纸本)9781450328210

A data-graph computation — popularized by such programming systems as Galois, Pregel, GraphLab, PowerGraph, and GraphChi — is an algorithm that performs local updates on the vertices of a graph. During each round of a data-graph computation, an update function atomically modifies the data associated with a vertex as a function of the vertex's prior data and that of adjacent vertices. A dynamic data-graph computation updates only an active subset of the vertices during a round, and those updates determine the set of active vertices for the next *** paper introduces PRISM, a chromatic-scheduling algorithm for executing dynamic data-graph computations. PRISM uses a vertex-coloring of the graph to coordinate updates performed in a round, precluding the need for mutual-exclusion locks or other nondeterministic data synchronization. A multibag data structure is used by PRISM to maintain a dynamic set of active vertices as an unordered set partitioned by color. We analyze PRISM using work-span analysis. Let G=(V,E) be a degree-Δ graph colored with Χ colors, and suppose that Q⊆V is the set of active vertices in a round. Define size(Q)=[Q] + Σv∈Qdeg(v), which is proportional to the space required to store the vertices of Q using a sparse-graph layout. We show that a P-processor execution of PRISM performs updates in Q using O(Χ(lg (Q/Χ)+lgΔ)+ lgP) span and Θ(size(Q)+Χ+P) work. These theoretical guarantees are matched by good empirical performance. We modified GraphLab to incorporate PRISM and studied seven application benchmarks on a 12-core multicore machine. PRISM executes the benchmarks 1.2–2.1 times faster than GraphLab's nondeterministic lock-based scheduler while providing deterministic *** paper also presents PRISM-R, a variation of PRISM that executes dynamic data-graph computations deterministically even when updates modify global variables with associative operations. PRISM-R satisfies the same theoretical bounds as PRISM, but its implementation is

关键词： determinism reducers multithreading parallel programming work stealing data-graph computations chromatic scheduling multicore

来源：评论

学校读者我要写书评

暂无评论

Divide and conquer skeleton on GPU

Divide and conquer skeleton on GPU

引用

International Congress on Technology, Communication and Knowledge (ICTCK)

作者： Fahimeh Baghayeri Hossein Deldari Davoud Bahrepour Department of Computer Engineering Mashhad Branch Islamic Azad University Mashhad Iran

parallelism is a suitable approach for speeding up the massive computations of applications, but parallel programming is difficult yet. Algorithmic skeleton is a parallel programming model that provides a high level of abstraction for programmers. This approach uses the pre-defined components to facilitate easier parallel programming. Divide and conquer (DC) is an appropriate parallel pattern for implementation as a skeleton. The solution of the original problem is obtained by dividing it into smaller sub-problems and solving them in parallel. Today, graphics processor unit (GPU) is an attractive computational processor for doing tasks in parallel, because it has a large number of process units. In this paper, divide and conquer skeleton on GPU has been proposed and named OC_***_GPU is a divide and conquer skeleton that is implemented on GPU that using a consistent programming interface in C++ for easier parallel programming. Performance of this skeleton has been evaluated by mergesort and sobeledge detection. The results show that obtained speedup at this skeleton is more than 2 on GPU.

关键词： Skeleton Graphics processing units parallel processing parallel programming Libraries Computational modeling

来源：评论

学校读者我要写书评

暂无评论

A minimalistic approach for fast computation of geodesic distances on triangular meshes

arXiv

引用

arXiv 2018年

作者： Calla, Luciano A. Romero Perez, Lizeth J. Fuentes Montenegro, Anselmo A. University of Zurich Switzerland Institute of Computing Federal Fluminense University Niteroi Brazil IPRODAM3D Research Group La Salle University Arequipa Peru

The computation of geodesic distances is an important research topic in Geometry Processing and 3D Shape Analysis as it is a basic component of many methods used in these areas. In this work, we present a minimalistic parallel algorithm based on front propagation to compute approximate geodesic distances on meshes. Our method is practical and simple to implement, and does not require any heavy pre-processing. The convergence of our algorithm depends on the number of discrete level sets around the source points from which distance information propagates. To appropriately implement our method on GPUS taking into account memory coalescence problems, we take advantage of a graph representation based on a breadth-first search traversal that works harmoniously with our parallel front propagation approach. We report experiments that show how our method scales with the size of the problem. We compare the mean error and processing time obtained by our method with such measures computed using other methods. Our method produces results in competitive times with almost the same accuracy, especially for large meshes. We also demonstrate its use for solving two classical geometry processing problems: The regular sampling problem and the Voronoi tessellation on meshes. Copyright © 2018, The Authors. All rights reserved.

关键词： parallel programming

来源：评论

学校读者我要写书评

暂无评论

Matrix bidiagonalization on the Trident processor

Matrix bidiagonalization on the Trident processor

引用

International Symposium on parallel and Distributed Processing (IPDPS)

作者： M.I. Soliman S.G. Sedukhin Graduate School of Computer Science and Engineering University of Aizu Fukushima Japan

This paper discusses the implementation and evaluation of the reduction of a dense matrix to bidiagonal form on the Trident processor. The standard Golub and Kahan Householder bidiagonalization algorithm, which is rich in matrix-vector operations, and the LAPACK subroutine /spl ***/GEBRD, which is rich in a mixture of vector, matrix-vector, and matrix operations, are simulated on the Trident processor. We show how to use the Trident parallel execution units, ring, and communication registers to effectively perform vector, matrix-vector, and matrix operations needed for bidiagonalizing a matrix. The number of clock cycles per FLOP is used as a metric to evaluate the performance of the Trident processor. Our results show that increasing the number of the Trident lanes proportionally decreases the number of cycles needed per FLOP. On a 32 K/spl times/32 K matrix and 128 Trident lanes, the speedup of using matrix-vector operations on the standard Golub and Kahan algorithm is around 1.5 times over using vector operations. However, using matrix operations on the GEBRD subroutine gives speedup around 3 times over vector operations, and 2 times over using matrix-vector operations on the standard Golub and Kahan algorithm.

关键词： Registers Architecture Matrix decomposition parallel processing Algorithms Hardware parallel programming Computer science Cities and towns Clocks

来源：评论

学校读者我要写书评

暂无评论

A performance comparison between stop-the-world and multithreaded concurrent generational garbage collection for Java

A performance comparison between stop-the-world and multithr...

引用

IEEE International Conference on Performance, Computing and Communications (IPCCC)

作者： C.-T.D. Lo W. Srisa-an J.M. Chang Department of Computer Science Illinois Institute of Technology Chicago IL USA

ISBN: (纸本)0780373715

The recent popularity of the Java programming language has brought automatic dynamic memory management (a.k.a., the garbage collection) into the mainstream. Traditional garbage collectors suffer from long garbage collection pauses (stop-the-world mark-sweep algorithm) or inability of collecting cyclic garbage (reference counting approach). Generational garbage collection, however, is based only on the weak generational hypothesis that most objects die young. In this paper, the performance evaluation of a new multithreaded concurrent generational garbage collector (MCGC) based on mark-sweep with the assistance of reference counting is reported. The MCGC can take advantage of multiple CPUs in an SMP system and the merits of lightweight processes. Furthermore, the long garbage collection pause can be reduced and the garbage collection efficiency can be enhanced. Measurement results indicate that the MCGC improves the garbage collection pause time up to 96.75% over the traditional stop-the-world mark-sweep garbage collector. Moreover, the MCGC receives minimal time and space penalties as shown in the report of the total execution time, the memory footprint and the sticky reference count rate.

关键词： Java Dynamic programming Memory management Object oriented programming parallel programming Virtual machining Computer science Computer languages Technology management Time measurement

来源：评论

学校读者我要写书评

暂无评论

xLIW - a scaleable long instruction word [DSP applications]

xLIW - a scaleable long instruction word [DSP applications]

引用

IEEE International Symposium on Circuits and Systems (ISCAS)

作者： C. Panis R. Leitner H. Grunbacher J. Nurmi Carinthia Tech Institute Villach Austria Infineon Technologies Villach Austria FIN Tampere University of Technology Tampere Finland

Increasing system complexity of SOC applications leads to an increasing requirement of powerful embedded DSP processors. To increase the computational power of DSP processors, the number of pipeline stages has been increased for higher frequencies and the number of parallel executed instructions, to increase the computational bandwidth. To program the parallel units, the VLIW (very long instruction word) has been introduced. programming the parallel units at the same time leads to an expanded program memory port or to the limitation that only a few units can be used in parallel. To overcome this limitation, this paper proposes a scaleable long instruction word (xLIW). The xLIW concept allows the full usage of the available units in parallel with optimal code density. An instruction buffer included reduces the power dissipation at the program memory ports during loop handling. The xLIW concept is part of a development project of a configurable DSP.

关键词： VLIW Digital signal processing Concurrent computing Computer aided instruction Pipelines Power dissipation Computer architecture Frequency Bandwidth parallel programming

来源：评论

学校读者我要写书评

暂无评论

ARV-based Array Grouping and Data Distribution in OpenMP/JIAJIA

ARV-based Array Grouping and Data Distribution in OpenMP/JIA...

引用

IEEE International Conference on parallel and Distributed Computing, Applications and Technologies (PDCAT)

作者： Zeng Lifang Yang Xuejun H. Huangchun National Laboratory for Parallel and Distributed Processing China

In order to improve the performance of applications on OpenMP/JIAJIA, we present a new abstraction, Array Relation Vector (ARV), to describe the relation between the data elements of two consistent shared arrays accessed in one computation phase. Based on ARV, we use array grouping to eliminate the pseudo data distributing of small shared data and improve the page locality. Experimental results show that ARV-based array grouping can greatly improve the performance of applications with non-continuous data access and strict access affinity on OpenMP/JIAJIA cluster. For applications with small shared arrays, array grouping can improve the performance obviously when the processor number is small.

关键词： Phased arrays Concurrent computing Distributed computing Application software Laboratories Distributed processing parallel programming parallel architectures programming profession Emulation

来源：评论

学校读者我要写书评

暂无评论

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

时间限定

文献类型

馆藏选择

核心期刊

语言

文献类型

帮助

文字说明：

检索规则说明：

检索范例：

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案：

请选择收藏分类：

通借通还

建议与咨询 留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

时间限定

文献类型

馆藏选择

核心期刊

语言

文献类型

帮助

文字说明：

检索规则说明：

检索范例：

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案： 新增检索档案 确定 取消

请选择收藏分类： 新增自定义分类 确定 取消

通借通还

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

请选择保存的检索档案：

请选择收藏分类：