检索结果-内蒙古大学图书馆

Global Congress on Electrical Engineering (GC-ElecEng)

作者： Carles Montero, Joan Feliu, Merce Bas, Joan Univ Autonoma Barcelona Mataro Spain Laia Arquera High Sch Sci Dept Mataro Spain Ctr Tecnol Telecomunicac CTTC Array & Multisensor Proc Dept Castelldefels Spain CTTC Ctr Tecnol Telecomunicac Castelldefels Spain Inst Laia Arquera Mataro Spain

ISBN: (纸本)9781665420389

Current societal needs demand for global high-speed networks. Toward this regard, 3GPP has included in its release 17 Non-Terrestrial Networks (NTN). In order to meet the strict requirements of 6G networks, Very Low Earth Orbit (VLEO) and Low Earth Orbit (LEO) satellites will play a key role. Optical fibers can also be used for transmitting data at high speeds. Unfortunately, the refraction index of the optical fibers and the satellite altitude penalize them. So, this paper determines the transmission distance for which it is better to use optical fiber or satellite links. For a fair comparison in terms of bandwidth, it has been assumed that the LEO/VLEO satellites should be optical too. After that an algorithm has been developed to determine the best suitable regions for locating the ground station of an optical satellite. Specifically, it computes the degree of cloudiness of a certain geographic region. Given that the determination of such regions demands a large computational burden, it has been parallelized using OpenMP libraries for Python. The Iberian Peninsula, which images were taken by the METEOSAT satellite from EUMETSAT, has been considered as a paradigmatic case of study.

关键词： New communication technologies Optical LEO Satellites Optical fiber SG 6G Propagation time reduction parallel algorithms Science education STEM

来源：评论

学校读者我要写书评

暂无评论

FULL-W2V: Fully Exploiting Data Reuse for W2V on GPU-Accelerated Systems 21

FULL-W2V: Fully Exploiting Data Reuse for W2V on GPU-Acceler...

引用

35th ACM International Conference on Supercomputing (ICS)

作者： Randall, Thomas Allen, Tyler Ge, Rong Clemson Univ Clemson SC 29631 USA

ISBN: (纸本)9781450383356

Word2Vec remains one of the highly-impactful innovations in the field of Natural Language Processing (NLP) that represents latent grammatical and syntactical information in human text with dense vectors in a lowdimension. Word2Vec has high computational cost due to the algorithm's inherent sequentiality, intensive memory accesses, and the large vocabularies it represents. While prior studies have investigated technologies to explore parallelism and improve memory system performance, they struggle to effectively gain throughput on powerful GPUs. We identify memory data access and latency as the primary bottleneck in prior works on GPUs, which prevents highly optimized kernels from attaining the architecture's peak performance. We present a novel algorithm, FULL-W2V, which maximally exploits the opportunities for data reuse in the W2V algorithm and leverages GPU architecture and resources to reduce access to low memory levels and improve temporal locality. FULL-W2V is capable of reducing accesses to GPU global memory significantly, e.g., by more than 89%, compared to prior state-of-the-art GPU implementations, resulting in significant performance improvement that scales across successive hardware generations. Our prototype implementation achieves 2.97X speedup when ported from Nvidia Pascal P100 to Volta V100 cards, and outperforms the state-of-the-art by 5.72X on V100 cards with the same embedding quality. In-depth analysis indicates that the reduction of memory accesses through register and shared memory caching and high-throughput shared memory reduction leads to a significantly improved arithmetic intensity. FULL-W2V can potentially benefit many applications in NLP and other domains.

关键词： GPU Optimization Word2Vec parallel algorithms

来源：评论

学校读者我要写书评

暂无评论

Euler Meets GPU: Practical Graph algorithms with Theoretical Guarantees 35

Euler Meets GPU: Practical Graph Algorithms with Theoretical...

引用

35th IEEE International parallel and Distributed Processing Symposium (IPDPS)

作者： Polak, Adam Siwiec, Adrian Stobierski, Michal Jagiellonian Univ Fac Math & Comp Sci Krakow Poland

ISBN: (纸本)9781665440660

The Euler tour technique is a classical tool for designing parallel graph algorithms, originally proposed for the PRAM model. We ask whether it can be adapted to run efficiently on GPU. We focus on two established applications of the technique: (1) the problem of finding lowest common ancestors (LCA) of pairs of nodes in trees, and (2) the problem of finding bridgis in undirected graphs. In our experiments, we compare theoretically optimal algorithms using the Euler tour technique against simpler heuristics supposed to perform particularly well on typical instances. We show that the Euler tour-based algorithms not only fulfill their theoretical promises and outperform practical heuristics on hard instances, but also perform on par with them on easy instances.

关键词： graph algorithms parallel algorithms

来源：评论

学校读者我要写书评

暂无评论

Communication Avoiding All-Pairs Shortest Paths Algorithm for Sparse Graphs 21

Communication Avoiding All-Pairs Shortest Paths Algorithm fo...

引用

50th International Conference on parallel Processing (ICPP)

作者： Zhu, Lin Hua, Qiang-Sheng Jin, Hai Huazhong Univ Sci & Technol Natl Engn Res Ctr Big Data Technol & Syst Serv Comp Technol & Syst Lab Cluster & Grid Comp LabSch Comp Sci & Technol Wuhan Peoples R China

ISBN: (纸本)9781450390682

In this paper, we propose a parallel algorithm for computing all-pairs shortest paths (APSP) for sparse graphs on the distributed memory system with p processors. To exploit the graph sparsity, we first preprocess the graph by utilizing several known algorithmic techniques in linear algebra such as fill-in reducing ordering and elimination tree parallelism. Then we map the preprocessed graph on the distributed memory system for both load balancing and communication reduction. Finally, we design a new scheduling strategy to minimize the communication cost. The bandwidth cost (communication volume) and the latency cost (number of messages) of our algorithm are O( n(2) log(2) p/p + |S |(2) log(2) p) and O(log(2) p), respectively, where S is a minimal vertex separator that partitions the graph into two components of roughly equal size. Compared with the state-of-the-art result for dense graphs where the bandwidth and latency costs are O( n(2) / root p) and O(root p log(2) p), respectively, our algorithm reduces the latency cost by a factor of O(root p), and reduces the bandwidth cost by a factor of O( root p/ log(2) p) for sparse graphs with |S | = O/( n/ root p). We also present the bandwidth and latency costs lower bounds for computing APSP on sparse graphs, which are Omega( n(2) / p + |S |(2)) and O(log(2) p), respectively. This implies that the bandwidth cost of our algorithm is nearly optimal and the latency cost is optimal.

关键词： parallel algorithms sparse graphs APSP communication complexity

来源：评论

学校读者我要写书评

暂无评论

CPRIC: Collaborative parallelism for Randomized Incremental Constructions

CPRIC: Collaborative Parallelism for Randomized Incremental ...

引用

35th IEEE International parallel and Distributed Processing Symposium (IPDPS)

作者： Fey, Florian Gorlatch, Sergei Univ Munster Munster Germany

ISBN: (纸本)9781665435772

Randomized algorithms often outperform their deterministic counterparts in terms of simplicity and efficiency. In this paper, we consider Randomized Incremental Constructions (RICs) that are very popular, in particular in combinatorial optimization and computational geometry. Our contribution is Collaborative parallel RIC (CPRIC) - a novel approach to parallelizing RIC for modern parallel architectures like vector processors and GPUs. We show that our approach based on a work-stealing mechanism avoids the control-flow divergence of parallel threads, thus improving the performance of parallel implementation. Our extensive experiments on CPU and GPU demonstrate the advantages of our CPRIC approach that achieves an average speedup between 4x and 5x compared to the naively parallelized RIC.

关键词： randomized algorithms control-flow divergence parallel algorithms

来源：评论

学校读者我要写书评

暂无评论

High Performance Portable Solver for Tridiagonal Toeplitz Systems of Linear Equations

High Performance Portable Solver for Tridiagonal Toeplitz Sy...

引用

26th International Conference on parallel and Distributed Computing (Euro-Par)

作者： Dmitruk, Beata Stpiczynski, Przemyslaw Marie Curie Sklodowska Univ Inst Comp Sci Ul Akad 9 PL-20033 Lublin Poland

ISBN: (纸本)9783030715939;9783030715922

We show that recently developed divide and conquer parallel algorithm for solving tridiagonal Toeplitz systems of linear equations can be easily and efficiently implemented for a variety of modern multicore and GPU architectures, as well as hybrid systems. Our new portable implementation that uses OpenACC can be executed on both CPU-based and GPU-accelerated systems. More sophisticated variants of the implementation are suitable for systems with multiple GPUs and it can use CPU and GPU cores. We consider the use of both columnwise and row wise storage formats for two dimensional double precision arrays and show how to efficiently convert between these two formats using cache memory. Numerical experiments performed on Intel CPUs and Nvidia GPUs show that our new implementation achieves relatively good performance.

关键词： Tridiagonal Toeplitz systems parallel algorithms Vectorization Portability OpenACC Hybrid systems

来源：评论

学校读者我要写书评

暂无评论

Accelerating non-power-of-2 size Fourier transforms with GPU Tensor Cores 35

Accelerating non-power-of-2 size Fourier transforms with GPU...

引用

35th IEEE International parallel and Distributed Processing Symposium (IPDPS)

作者： Pisha, Louis Ligowski, Lukasz NVIDIA Corp Santa Clara CA 95051 USA

ISBN: (纸本)9781665440660

Fourier transforms whose sizes are powers of two or have only small prime factors have been extensively studied, and optimized implementations are typically memory-bound. However, handling arbitrary transform sizes-which may be prime or have large prime factors-is difficult. Direct discrete Fourier transform (DFT) implementations involve extra computation, while fast Fourier transform (FFT)-style factorized decompositions introduce additional overheads in register use, multiprocessor occupancy, and memory traffic. Tensor Cores are hardware units included in modern GPUs which perform matrix multiply-adds at a much higher throughput than normal GPU floating-point instructions. Because of their higher throughput and better uniformity across sizes, DFT/FFT implementations using Tensor Cores can surpass the performance of existing WT/FFT implementations for difficult sizes. We present key insights in this approach, including complex number representation, efficient mapping of odd sizes to Tensor Cores (whose dimensions are all powers of 2), and adding a size 2 or size 4 epilogue transform at very low cost. Furthermore, we describe a method for emulating FP32 precision while using lower-precision Tensor Cores to accelerate the computation. For large batch sizes, our fastest Tensor Core implementation per size is at least 10% faster than the state-of-the-art cuFFT library in 49% of supported sizes for FP64 (double) precision and 42% of supported sizes for FP32 precision. The numerical accuracy of the results matches that of cuFFT for FP64 and is degraded by only about 0.3 bits on average for emulated FP32. To our knowledge, this is the first application of Tensor Cores to FIT computation which meets the accuracy and exceeds the speed of the state of the art.

关键词： Discrete Fourier transforms fast Fourier transforms mixed-precision arithmetic emergent floating-point formats Tensor Cores accelerated matrix multiply hardware parallel algorithms

来源：评论

学校读者我要写书评

暂无评论

Mesh generation and multi-scale simulation of a contracting muscle-tendon complex

引用

JOURNAL OF COMPUTATIONAL SCIENCE 2022年 59卷 101559-101559页

作者： Maier, Benjamin Schulte, Miriam Univ Stuttgart Inst Parallel & Distributed Syst Univ Str 38 D-70569 Stuttgart Germany

The multi-scale character of skeletal muscle models requires simulations with high spatial resolution to capture all relevant effects. This naturally involves high computational load that can only be tackled by parallel computations. We simulate electrophysiology and muscle contraction using a state-of-the-art, biophysical chemo-electro-mechanical model that requires meshes of the 3D domain with embedded, aligned 1D meshes for muscle fibers. We present novel algorithms to construct highly-resolved meshes with robust properties for real muscle geometries from surface triangulations. We demonstrate their use and suitability in a simulation of the biceps brachii muscle and tendons. In addition, the respective simulations showcase several functional enhancements of our simulation framework OpenDiHu.

关键词： Mesh generation Skeletal muscle mechanics Multi-scale modeling High-performance computing parallel algorithms

来源：评论

学校读者我要写书评

暂无评论

parallelising Glauber dynamics

arXiv

引用

arXiv 2023年

作者： Lee, Holden

For distributions over discrete product spaces Qni=1 Ω′i, Glauber dynamics is a Markov chain that at each step, resamples a random coordinate conditioned on the other coordinates. We show that k-Glauber dynamics, which resamples a random subset of k coordinates, mixes k times faster in χ2-divergence, and assuming approximate tensorization of entropy, mixes k times faster in KL-divergence. We apply this to obtain parallel algorithms in two settings: (1) For the Ising model (Equation presented) with kJ k Copyright © 2023, The Authors. All rights reserved.

关键词： parallel algorithms

来源：评论

学校读者我要写书评

暂无评论

Functionally Arranged Data for algorithms with Space-Time Wavefront 15th

Functionally Arranged Data for Algorithms with Space-Time Wa...

引用

15th International Scientific Conference on parallel Computational Technologies (PCT)

作者： Perepelkina, Anastasia Levchenko, Vadim D. Keldysh Inst Appl Math Moscow Russia

ISBN: (纸本)9783030816919;9783030816902

algorithms with space-time tiling increase the performance of numerical simulations by increasing data reuse and arithmetic intensity;they also improve parallel scaling by making process synchronization less frequent. The theory of Locally Recursive non-Locally Asynchronous (LRnLA) algorithms provides the performance model with account for data localization at all levels of the memory hierarchy. However, effective implementation is difficult since modern optimizing compilers do not support the required traversal methods and data structures by default. The data exchange is typically implemented by writing the updated values to the main data array. Here, we suggest a new data structure that contains the partially updated state of the simulation domain. Data is arranged within this structure for coalesced access and seamless exchange between subtasks. We demonstrate the preliminary results of its superiority over previously used methods by localizing the processed data in the L2 GPU cache for the Lattice Boltzmann Method (LBM) simulation so that the performance is not limited by the GDDR throughput but is determined by the L2 cache access rate. If we estimate the ideal stepwise code performance to be memory-bound with a read/write ratio equal to 1 and assume it is localized in the GPU memory and performs at 100% of the theoretical memory bandwidth, then the results of our benchmarks exceed that peak by a factor of the order of 1.2.

关键词： LRnLA algorithms Temporal blocking Loop skewing parallel algorithms Data structure

来源：评论

学校读者我要写书评

暂无评论

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

时间限定

文献类型

馆藏选择

核心期刊

语言

文献类型

帮助

文字说明：

检索规则说明：

检索范例：

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案：

请选择收藏分类：

通借通还

建议与咨询 留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

时间限定

文献类型

馆藏选择

核心期刊

语言

文献类型

帮助

文字说明：

检索规则说明：

检索范例：

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案： 新增检索档案 确定 取消

请选择收藏分类： 新增自定义分类 确定 取消

通借通还

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

请选择保存的检索档案：

请选择收藏分类：