检索结果-内蒙古大学图书馆

34th IEEE International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)

作者： Bernardino, Matheus Tavares Goldman, Alfredo Univ Sao Paulo Inst Math & Stat Sao Paulo Brazil

ISBN: (数字)9781665451550

ISBN: (纸本)9781665451550

Version control systems (VCS) are tools used to track and manage the changes made to a set of files over time. Among the VCS tools available today, Git has become the most popular for software development. Being used in small personal projects of a few megabytes and massive corporate repositories with more than 300 GB and 3.5 million files, speed and scalability are among the top priorities for the tool. However, its performance sometimes falls short of what is desired on networked file systems (e.g. NFS), where input and output (I/O) operations tend to be more costly. In particular, that is the case for the checkout command, which is responsible for restoring files from specific versions of a project. Despite the optimizations implemented over the years, the sequential processing of files still carried a large time penalty for NFS, as well as being suboptimal for local file systems on SSDs. In this project, we worked to parallelize the Git checkout machinery, resulting in speedups of up to 4.5x on NFS and 3.6x on SSDs. We also studied how parallelism affects the I/O requests performed by checkout on different storage systems. The optimization was submitted upstream and made available to all Git users starting at version 2.32.0, from June 2021.

关键词： parallel programming git version control systems Network File System parallel I/O

来源：评论

学校读者我要写书评

暂无评论

parallel Simulation of Power Systems with High Penetration of Distributed Generation Using GPUs and OpenCL 13

Parallel Simulation of Power Systems with High Penetration o...

引用

13th International Symposium on Power Electronics for Distributed Generation Systems (PEDG)

作者： Zhang, Junjie Mittenbuehler, Marcel Razik, Lukas Benigni, Andrea Forschungszentrum Julich IEK 10 Energy Syst Engn D-52428 Julich Germany Rhein Westfal TH Aachen D-52056 Aachen Germany Raz InformING D-52070 Aachen Germany JARA Energy D-52425 Julich Germany

ISBN: (数字)9781665466189

ISBN: (纸本)9781665466189

In this paper, we present the use of parallel-in-space simulation approach to accelerate the dynamic simulation of power systems with high penetration of distributed generation, i. e. high number of power electronic devices. The approach is implemented using the OpenCL framework and executed on a graphic processing unit (GPU). We benchmark our prototype implementation using a distribution network with increasing number of distributed generator which is modeled by voltage source inverter. Results show that the computation time for the distributed generators solution stays nearly constant with the increasing number of the distributed generators in the network.

关键词： parallel programming power system simulation power electronics

来源：评论

学校读者我要写书评

暂无评论

From Single CPU to Multicore Systems 57

From Single CPU to Multicore Systems

引用

57th International Scientific Conference on Information, Communication and Energy Systems and Technologies (ICEST)

作者： Nikolic, Tatjana R. Nikolic, Goran S. Dimitrijevic, Bojan R. Stojcev, Mile K. Univ Nis Fac Elect Engn Aleksandra Medvedeva 14 Nish 18000 Serbia

ISBN: (纸本)9781665485005

Contemporary microprocessors are one of the more complex digital systems designed by humans. Today's microprocessor chips contain several different building blocks including processor, memory and interface logic. Thanks to Gordon Moore [4] prediction, that the transistor density will double every 18 months, and Dennard scaling [8], who pointed out that transistor power density remains constant as physical dimensions of MOSFET's become smaller, the microprocessor clock speed permanently increased up to 2003 year. However, augmentation of the CPU clock frequency cannot exceed some limits due to significant increment in power dissipation and consequently the amount of heating. Having this fact in mind, the microprocessor manufacturers have designed new types of microprocessors, referred to as multi- and many-core processors. Multicore technology has become ubiquitous today, with most personal and embedded computers. The hardware structure of computers that are based on multi-/many-core concept requires involving of parallel programming. By using this approach, the following two benefits are evident, the first one deals with achievement of scalable computer performance while the second with the possibility of large-scale data processing. But now, the skill to create efficient parallel programs requires from programmers a solid understanding of new computational principles, algorithms, and programming software tools. This article may be of help to a reader with aim to understand both the architecture of modern microprocessor systems and their programming starting from single to multicore.

关键词： CPU Multicore Heterogeneous System parallel programming

来源：评论

学校读者我要写书评

暂无评论

Impact of Design Decisions on Performance of Embarrassingly parallel .NET Database Application 1

引用

14th Asian Conference on Intelligent Information and Database Systems (ACIIDS)

作者： Karwaczynski, Piotr Sitko, Marcin Pietras, Sylwia Marczuk, Bogdan Kwiatkowski, Jan Fras, Marius Sygnity SA Strzegomska 140a PL-54429 Wroclaw Poland Wroclaw Univ Sci & Technol Wybrzeze Wyspianskiego 27 PL-50370 Wroclaw Poland

ISBN: (数字)9789811982347

ISBN: (纸本)9789811982330;9789811982347

The implementation of parallel applications is always a challenge. It embracesmany distinctive design decisions that are to be taken. The paper presents issues of parallel processing with use of .NET applications and popular database management systems. In the paper, three design dilemmas are addressed: how efficient is the auto-parallelism implemented in the .NET TPL library, how do popular DBMSes differ in serving parallel requests, and what is the optimal size of data chunks in the data parallelism scheme. All of them are analyzed in the context of the typical and practical business case originated from IT solutions which are dedicated for the energy market participants. The paper presents the results of experiments conducted in a controlled, onpremises environment. The experiments allowed to compare the performance of the TPL auto-parallelism with a wide range of manually-set numbers of worker threads. They also helped to evaluate 4 DBMSes: Oracle, MySQL, PostgreSQL, and MSSQL in the scenario of serving parallel queries. Finally, they showed the impact of data chunk sizes on the overall performance.

关键词： parallel programming TPL Performance of processing Data parallelism

来源：评论

学校读者我要写书评

暂无评论

A Utilization-based Test for Non-preemptive Gang Tasks on Multiprocessors 43

A Utilization-based Test for Non-preemptive Gang Tasks on Mu...

引用

43rd IEEE Real-Time Systems Symposium (RTSS)

作者： Dong, Zheng Liu, Cong Wayne State Univ Dept Comp Sci Detroit MI 48202 USA Univ Calif Riverside Dept Elect & Comp Engn Riverside CA USA

ISBN: (数字)9781665453462

ISBN: (纸本)9781665453462

Real-time gang task scheduling has received much recent attention due to the emerging trend of applying highly parallel accelerators (e.g., GPU) and parallel programming models (e.g., OpenMP) in many real-time computing domains. However, existing works on gang task scheduling mainly focus on the preemptive scheduling case, which contradicts a bit with the non-preemptive executing nature of applying gang scheduling techniques in practice. In this paper, we present a set of non-trivial techniques that can analyze the schedulability of scheduling a hard real-time sporadic gang task system under non-preemptive GEDF on multiprocessors. A utilization-based schedulability test (first-of-its-kind) is derived, which is shown to be rather effective via experiments. Rather interestingly, for a special case where each gang task becomes an ordinary sporadic task, our developed test is shown by experiments that it improves schedulability by 75% on average upon a state-of-the-art utilization-based test designed for non-preemptive scheduling of ordinary sporadic tasks on multiprocessors.

关键词： Processor scheduling parallel programming Computational modeling Graphics processing units parallel processing Market research Real-time systems

来源：评论

学校读者我要写书评

暂无评论

Climbing the Summit and Pushing the Frontier of Mixed Precision Benchmarks at Extreme Scale

Climbing the Summit and Pushing the Frontier of Mixed Precis...

引用

International Conference for High Performance Computing, Networking, Storage and Analysis (HPC)

作者： Lu, Hao Matheson, Michael Oles, Vladyslav Ellis, Austin Joubert, Wayne Wang, Feiyi Oak Ridge Natl Lab Oak Ridge TN 37830 USA

ISBN: (纸本)9781665454445

The rise of machine learning (ML) applications and their use of mixed precision to perform interesting science are driving forces behind AI for science on HPC. The convergence of ML and HPC with mixed precision offers the possibility of transformational changes in computational science. The HPL-AI benchmark is designed to measure the performance of mixed precision arithmetic as opposed to the HPL benchmark which measures double precision performance. Pushing the limits of systems at extreme scale is nontrivial -little public literature explores optimization of mixed precision computations at this scale. In this work, we demonstrate how to scale up the HPLAI benchmark on the pre-exascale Summit and exascale Frontier systems at the Oak Ridge Leadership Computing Facility (OLCF) with a cross-platform design. We present the implementation, performance results, and a guideline of optimization strategies employed for delivering portable performance on both AMD and NVIDIA GPUs at extreme scale.

关键词： parallel programming High performance computing Exascale computing Linear algebra

来源：评论

学校读者我要写书评

暂无评论

The OpenMP Cluster programming Model 51

The OpenMP Cluster Programming Model

引用

51st International Conference on parallel Processing (ICPP)

作者： Yviquel, Herve Pereira, Marcio Francesquini, Emilio Valarini, Guilherme Leite, Gustavo Rosso, Pedro Ceccato, Rodrigo Cusihualpa, Carla Dias, Vitoria Rigo, Sandro Souza, Alan Araujo, Guido Univ Campinas UNICAMP Campinas Brazil Petrobras SA Rio De Janeiro Brazil

ISBN: (纸本)9781450394451

Despite the various research initiatives and proposed programming models, efficient solutions for parallel programming in HPC clusters still rely on a complex combination of different programming models (e.g., OpenMP and MPI), languages (e.g., C++ and CUDA), and specialized runtimes (e.g., Charm++ and Legion). On the other hand, task parallelism has shown to be an efficient and seamless programming model for clusters. This paper introduces OpenMP Cluster (OMPC), a task-parallel model that extends OpenMP for cluster programming. OMPC leverages OpenMP's offloading standard to distribute annotated regions of code across the nodes of a distributed system. To achieve that it hides MPI-based data distribution and load-balancing mechanisms behind OpenMP task dependencies. Given its compliance with OpenMP, OMPC allows applications to use the same programming model to exploit intra- and internode parallelism, thus simplifying the development process and maintenance. We evaluated OMPC using Task Bench, a synthetic benchmark focused on task parallelism, comparing its performance against other distributed runtimes. Experimental results show that OMPC can deliver up to 1.53x and 2.43x better performance than Charm++ on CCR and scalability experiments, respectively. Experiments also show that OMPC performance weakly scales for both Task Bench and a real-world seismic imaging application.

关键词： parallel programming Concurrent programming HPC

来源：评论

学校读者我要写书评

暂无评论

Distributed Out-of-Memory SVD on CPU/GPU Architectures

Distributed Out-of-Memory SVD on CPU/GPU Architectures

引用

IEEE High Performance Extreme Computing Virtual Conference (HPEC)

作者： Boureima, Ismael Bhattarai, Manish Eren, Maksim E. Solovyev, Nick Djidjev, Hristo Alexandrov, Boian S. LANL Div Theoret Los Alamos NM 87545 USA IICT Sofia Bulgaria

ISBN: (数字)9781665497862

ISBN: (纸本)9781665497862

We propose an efficient, distributed, out-of-memory implementation of the truncated singular value decomposition (t-SVD) for heterogeneous high performance computing (HPC) systems. Various implementations of SVD have been proposed, with most only estimate the singular values as the estimation of the singular vectors can significantly increase the time and memory complexity of the algorithm. In this work, we propose an implementation of SVD based on the power method, which is a truncated singular values and singular vectors estimation method. Memory utilization bottlenecks in the power method used to decompose a matrix A are typically associated with the computation of the Gram matrix A(T) A, which can be significant when A is large and dense, or when A is super-large and sparse. The proposed implementation is optimized for out-of-memory problems where the memory required to factorize a given matrix is greater than the available GPU memory. We reduce the memory complexity of A(T) A by using a batching strategy where the intermediate factors are computed block by block, and we hide I/O latency associated with both host-to-device (H2D) and device-to-host (D2H) batch copies by overlapping each batch copy with compute using CUDA streams. Furthermore, we use optimized NCCL based communicators to reduce the latency associated with collective communications (both intra-node and inter-node). In addition, sparse and dense matrix multiplications are significantly accelerated with GPU cores (or tensors cores when available), resulting in an implementation with good scaling. We demonstrate the scalability of our distributed out of core SVD algorithm to successfully decompose dense matrix of size 1TB and sparse matrix of 1e-6 sparsity with size of 128 PB in dense format.

关键词： SVD out-of-memory latent features data compression distributed processing parallel programming big data heterogeneous HPC systems GPU CUDA NCCL cupy

来源：评论

学校读者我要写书评

暂无评论

Benchmarking Fortran DO CONCURRENT on CPUs and GPUs Using BabelStream 13

Benchmarking Fortran DO CONCURRENT on CPUs and GPUs Using Ba...

引用

13th IEEE/ACM International Workshop on Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems (PMBS)

作者： Hammond, Jeff R. Deakin, Tom Cownie, James McIntosh-Smith, Simon NVIDIA Helsinki Oy Helsinki Finland Univ Bristol Dept Comp Sci HPC Res Grp Bristol England

ISBN: (纸本)9781665451857

Fortran DO CONCURRENT has emerged as a new way to achieve parallel execution of loops on CPUs and GPUs. This paper studies the performance portability of this construct on a range of processors and compares it with the incumbent models: OpenMP, OpenACC and CUDA. To do this study fairly, we implemented the BabelStream memory bandwidth benchmark from scratch, entirely in modern Fortran, for all of the models considered, which include Fortran DO CONCURRENT, as well as two variants of OpenACC, four variants of OpenMP (2 CPU and 2 GPU), CUDA Fortran, and both loop- and array-based references. BabelStream Fortran matches the C++ implementation as closely as possible, and can be used to make language-based comparisons. This paper represents one of the first detailed studies of the performance of Fortran support on heterogeneous architectures;we include results for AArch64 and x86_64 CPUs as well as AMD, Intel and NVIDIA GPU platforms.

关键词： Fortran parallel programming GPUs multi-core memory bandwidth

来源：评论

学校读者我要写书评

暂无评论

parallelism Detection Using Graph Labelling

引用

LOBACHEVSKII JOURNAL OF MATHEMATICS 2022年第10期43卷 2893-2900页

作者： Telegin, P. N. Baranov, A. V. Shabanov, B. M. Tikhomirov, A. I. Russian Acad Sci Joint Supercomp Ctr Branch Fed State Inst Sci Res Inst Syst Anal Moscow 119334 Russia

Usage of multiprocessor and multicore computers implies parallel programming. Tools for preparing parallel programs include parallel languages and libraries as well as parallelizing compilers and convertors that can perform automatic parallelization. The basic approach for parallelism detection is analysis of data dependencies and properties of program components, including data use and predicates. In this article a suite of used data and predicates sets for program components is proposed and an algorithm for computing these sets is suggested. The algorithm is based on wave propagation on graphs with cycles and labelling. This method allows analysing complex program components, improving data localization and thus providing enhanced data parallelism detection.

关键词： parallel programming program parallelization graph wave algorithm graph labelling.

来源：评论

学校读者我要写书评

暂无评论

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

时间限定

文献类型

馆藏选择

核心期刊

语言

文献类型

帮助

文字说明：

检索规则说明：

检索范例：

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案：

请选择收藏分类：

通借通还

建议与咨询 留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

时间限定

文献类型

馆藏选择

核心期刊

语言

文献类型

帮助

文字说明：

检索规则说明：

检索范例：

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案： 新增检索档案 确定 取消

请选择收藏分类： 新增自定义分类 确定 取消

通借通还

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

请选择保存的检索档案：

请选择收藏分类：