Current societal needs demand for global high-speed networks. Toward this regard, 3GPP has included in its release 17 Non-Terrestrial Networks (NTN). In order to meet the strict requirements of 6G networks, Very Low E...
详细信息
ISBN:
(纸本)9781665420389
Current societal needs demand for global high-speed networks. Toward this regard, 3GPP has included in its release 17 Non-Terrestrial Networks (NTN). In order to meet the strict requirements of 6G networks, Very Low Earth Orbit (VLEO) and Low Earth Orbit (LEO) satellites will play a key role. Optical fibers can also be used for transmitting data at high speeds. Unfortunately, the refraction index of the optical fibers and the satellite altitude penalize them. So, this paper determines the transmission distance for which it is better to use optical fiber or satellite links. For a fair comparison in terms of bandwidth, it has been assumed that the LEO/VLEO satellites should be optical too. After that an algorithm has been developed to determine the best suitable regions for locating the ground station of an optical satellite. Specifically, it computes the degree of cloudiness of a certain geographic region. Given that the determination of such regions demands a large computational burden, it has been parallelized using OpenMP libraries for Python. The Iberian Peninsula, which images were taken by the METEOSAT satellite from EUMETSAT, has been considered as a paradigmatic case of study.
Word2Vec remains one of the highly-impactful innovations in the field of Natural Language Processing (NLP) that represents latent grammatical and syntactical information in human text with dense vectors in a lowdimens...
详细信息
ISBN:
(纸本)9781450383356
Word2Vec remains one of the highly-impactful innovations in the field of Natural Language Processing (NLP) that represents latent grammatical and syntactical information in human text with dense vectors in a lowdimension. Word2Vec has high computational cost due to the algorithm's inherent sequentiality, intensive memory accesses, and the large vocabularies it represents. While prior studies have investigated technologies to explore parallelism and improve memory system performance, they struggle to effectively gain throughput on powerful GPUs. We identify memory data access and latency as the primary bottleneck in prior works on GPUs, which prevents highly optimized kernels from attaining the architecture's peak performance. We present a novel algorithm, FULL-W2V, which maximally exploits the opportunities for data reuse in the W2V algorithm and leverages GPU architecture and resources to reduce access to low memory levels and improve temporal locality. FULL-W2V is capable of reducing accesses to GPU global memory significantly, e.g., by more than 89%, compared to prior state-of-the-art GPU implementations, resulting in significant performance improvement that scales across successive hardware generations. Our prototype implementation achieves 2.97X speedup when ported from Nvidia Pascal P100 to Volta V100 cards, and outperforms the state-of-the-art by 5.72X on V100 cards with the same embedding quality. In-depth analysis indicates that the reduction of memory accesses through register and shared memory caching and high-throughput shared memory reduction leads to a significantly improved arithmetic intensity. FULL-W2V can potentially benefit many applications in NLP and other domains.
The Euler tour technique is a classical tool for designing parallel graph algorithms, originally proposed for the PRAM model. We ask whether it can be adapted to run efficiently on GPU. We focus on two established app...
详细信息
ISBN:
(纸本)9781665440660
The Euler tour technique is a classical tool for designing parallel graph algorithms, originally proposed for the PRAM model. We ask whether it can be adapted to run efficiently on GPU. We focus on two established applications of the technique: (1) the problem of finding lowest common ancestors (LCA) of pairs of nodes in trees, and (2) the problem of finding bridgis in undirected graphs. In our experiments, we compare theoretically optimal algorithms using the Euler tour technique against simpler heuristics supposed to perform particularly well on typical instances. We show that the Euler tour-based algorithms not only fulfill their theoretical promises and outperform practical heuristics on hard instances, but also perform on par with them on easy instances.
In this paper, we propose a parallel algorithm for computing all-pairs shortest paths (APSP) for sparse graphs on the distributed memory system with p processors. To exploit the graph sparsity, we first preprocess the...
详细信息
ISBN:
(纸本)9781450390682
In this paper, we propose a parallel algorithm for computing all-pairs shortest paths (APSP) for sparse graphs on the distributed memory system with p processors. To exploit the graph sparsity, we first preprocess the graph by utilizing several known algorithmic techniques in linear algebra such as fill-in reducing ordering and elimination tree parallelism. Then we map the preprocessed graph on the distributed memory system for both load balancing and communication reduction. Finally, we design a new scheduling strategy to minimize the communication cost. The bandwidth cost (communication volume) and the latency cost (number of messages) of our algorithm are O( n(2) log(2) p/p + |S |(2) log(2) p) and O(log(2) p), respectively, where S is a minimal vertex separator that partitions the graph into two components of roughly equal size. Compared with the state-of-the-art result for dense graphs where the bandwidth and latency costs are O( n(2) / root p) and O(root p log(2) p), respectively, our algorithm reduces the latency cost by a factor of O(root p), and reduces the bandwidth cost by a factor of O( root p/ log(2) p) for sparse graphs with |S | = O/( n/ root p). We also present the bandwidth and latency costs lower bounds for computing APSP on sparse graphs, which are Omega( n(2) / p + |S |(2)) and O(log(2) p), respectively. This implies that the bandwidth cost of our algorithm is nearly optimal and the latency cost is optimal.
Randomized algorithms often outperform their deterministic counterparts in terms of simplicity and efficiency. In this paper, we consider Randomized Incremental Constructions (RICs) that are very popular, in particula...
详细信息
ISBN:
(纸本)9781665435772
Randomized algorithms often outperform their deterministic counterparts in terms of simplicity and efficiency. In this paper, we consider Randomized Incremental Constructions (RICs) that are very popular, in particular in combinatorial optimization and computational geometry. Our contribution is Collaborative parallel RIC (CPRIC) - a novel approach to parallelizing RIC for modern parallel architectures like vector processors and GPUs. We show that our approach based on a work-stealing mechanism avoids the control-flow divergence of parallel threads, thus improving the performance of parallel implementation. Our extensive experiments on CPU and GPU demonstrate the advantages of our CPRIC approach that achieves an average speedup between 4x and 5x compared to the naively parallelized RIC.
We show that recently developed divide and conquer parallel algorithm for solving tridiagonal Toeplitz systems of linear equations can be easily and efficiently implemented for a variety of modern multicore and GPU ar...
详细信息
ISBN:
(纸本)9783030715939;9783030715922
We show that recently developed divide and conquer parallel algorithm for solving tridiagonal Toeplitz systems of linear equations can be easily and efficiently implemented for a variety of modern multicore and GPU architectures, as well as hybrid systems. Our new portable implementation that uses OpenACC can be executed on both CPU-based and GPU-accelerated systems. More sophisticated variants of the implementation are suitable for systems with multiple GPUs and it can use CPU and GPU cores. We consider the use of both columnwise and row wise storage formats for two dimensional double precision arrays and show how to efficiently convert between these two formats using cache memory. Numerical experiments performed on Intel CPUs and Nvidia GPUs show that our new implementation achieves relatively good performance.
Fourier transforms whose sizes are powers of two or have only small prime factors have been extensively studied, and optimized implementations are typically memory-bound. However, handling arbitrary transform sizes-wh...
详细信息
ISBN:
(纸本)9781665440660
Fourier transforms whose sizes are powers of two or have only small prime factors have been extensively studied, and optimized implementations are typically memory-bound. However, handling arbitrary transform sizes-which may be prime or have large prime factors-is difficult. Direct discrete Fourier transform (DFT) implementations involve extra computation, while fast Fourier transform (FFT)-style factorized decompositions introduce additional overheads in register use, multiprocessor occupancy, and memory traffic. Tensor Cores are hardware units included in modern GPUs which perform matrix multiply-adds at a much higher throughput than normal GPU floating-point instructions. Because of their higher throughput and better uniformity across sizes, DFT/FFT implementations using Tensor Cores can surpass the performance of existing WT/FFT implementations for difficult sizes. We present key insights in this approach, including complex number representation, efficient mapping of odd sizes to Tensor Cores (whose dimensions are all powers of 2), and adding a size 2 or size 4 epilogue transform at very low cost. Furthermore, we describe a method for emulating FP32 precision while using lower-precision Tensor Cores to accelerate the computation. For large batch sizes, our fastest Tensor Core implementation per size is at least 10% faster than the state-of-the-art cuFFT library in 49% of supported sizes for FP64 (double) precision and 42% of supported sizes for FP32 precision. The numerical accuracy of the results matches that of cuFFT for FP64 and is degraded by only about 0.3 bits on average for emulated FP32. To our knowledge, this is the first application of Tensor Cores to FIT computation which meets the accuracy and exceeds the speed of the state of the art.
The multi-scale character of skeletal muscle models requires simulations with high spatial resolution to capture all relevant effects. This naturally involves high computational load that can only be tackled by parall...
详细信息
The multi-scale character of skeletal muscle models requires simulations with high spatial resolution to capture all relevant effects. This naturally involves high computational load that can only be tackled by parallel computations. We simulate electrophysiology and muscle contraction using a state-of-the-art, biophysical chemo-electro-mechanical model that requires meshes of the 3D domain with embedded, aligned 1D meshes for muscle fibers. We present novel algorithms to construct highly-resolved meshes with robust properties for real muscle geometries from surface triangulations. We demonstrate their use and suitability in a simulation of the biceps brachii muscle and tendons. In addition, the respective simulations showcase several functional enhancements of our simulation framework OpenDiHu.
For distributions over discrete product spaces Qni=1 Ω′i, Glauber dynamics is a Markov chain that at each step, resamples a random coordinate conditioned on the other coordinates. We show that k-Glauber dynamics, whi...
详细信息
algorithms with space-time tiling increase the performance of numerical simulations by increasing data reuse and arithmetic intensity;they also improve parallel scaling by making process synchronization less frequent....
详细信息
ISBN:
(纸本)9783030816919;9783030816902
algorithms with space-time tiling increase the performance of numerical simulations by increasing data reuse and arithmetic intensity;they also improve parallel scaling by making process synchronization less frequent. The theory of Locally Recursive non-Locally Asynchronous (LRnLA) algorithms provides the performance model with account for data localization at all levels of the memory hierarchy. However, effective implementation is difficult since modern optimizing compilers do not support the required traversal methods and data structures by default. The data exchange is typically implemented by writing the updated values to the main data array. Here, we suggest a new data structure that contains the partially updated state of the simulation domain. Data is arranged within this structure for coalesced access and seamless exchange between subtasks. We demonstrate the preliminary results of its superiority over previously used methods by localizing the processed data in the L2 GPU cache for the Lattice Boltzmann Method (LBM) simulation so that the performance is not limited by the GDDR throughput but is determined by the L2 cache access rate. If we estimate the ideal stepwise code performance to be memory-bound with a read/write ratio equal to 1 and assume it is localized in the GPU memory and performs at 100% of the theoretical memory bandwidth, then the results of our benchmarks exceed that peak by a factor of the order of 1.2.
暂无评论