HIL (Hardware in Loop) is an efficient and convenient tool for the test and verification of electrical drive system which requires high reliability and safety. With the application of high-frequency SiC inverter, the ...
详细信息
This paper proposes a novel hybrid parallel algorithm with multiple improved strategies. The whole population is divided into three subpopulations and each sub-population executes butterfly optimization algorithm, gre...
详细信息
The improvement of parallel computing efficiency is of great significance to the development and research of particle simulation theory. This article first compares the communication efficiency of MSMPI and MPICH2. Th...
详细信息
The development of computationally efficient algorithms and the improvement of their software implementation are urgent issues that require continuous attention due to the ongoing development of computer system archit...
详细信息
parallelism patterns (e.g., map or reduce) have proven to be effective tools for parallelizing high-performance applications. In this article, we study the recursive registration of a series of electron microscopy ima...
详细信息
parallelism patterns (e.g., map or reduce) have proven to be effective tools for parallelizing high-performance applications. In this article, we study the recursive registration of a series of electron microscopy images - a time consuming and imbalanced computation necessary for nano-scale microscopy analysis. We show that by translating the image registration into a specific instance of the prefix scan, we can convert this seemingly sequential problem into a parallel computation that scales to over thousand of cores. We analyze a variety of scan algorithms that behave similarly for common low-compute operators and propose a novel work-stealing procedure for a hierarchical prefix scan. Our evaluation shows that by identifying a suitable and well-optimized prefix scan algorithm, we reduce time-to-solution on a series of 4,096 images spanning ten seconds of microscopy acquisition from over 10 hours to less than 3 minutes (using 1024 Intel Haswell cores), enabling derivation of material properties at nanoscale for long microscopy image series.
Unit commitment (UC) is an important problem solved on a daily basis within a strict time limit. While hourly UC is currently used, they may not be flexible enough to accommodate the growing changes of demand and the ...
详细信息
Unit commitment (UC) is an important problem solved on a daily basis within a strict time limit. While hourly UC is currently used, they may not be flexible enough to accommodate the growing changes of demand and the increasing penetration of intermittent renewables. Sub-hourly UC is therefore recommended. This, however, will significantly increase problem complexity even under the deterministic setting because of the considerable increase of the number of intervals, leading to the drastic increase of the numbers of system coupling constraints and binary variables as compared to that of hourly UC. Consequently, existing methods may not be able to obtain good solutions within the time limit for large problems. In this paper, deterministic sub-hourly UC is considered with an innovative exploitation of "soft constraints" constraints that do not need to be strictly satisfied but their violations are penalized by predetermined coefficients. This, in conjunction with our recent "Surrogate Absolute Value Lagrangian Relaxation" approach where the "relaxed problem" is not required to be fully optimized, facilitates the formation and resolution of a new type of subproblems where soft system coupling constraints (e.g., reserve and transmission capacity constraints) are not relaxed. This then leads to a drastic reduction of the number of multipliers, decreased computational requirements, and improved solution quality. To further enhance the speed, a parallel version is developed. Testing results based on the Polish system demonstrate the effectiveness and robustness of both the sequential and parallel versions at finding high-quality solutions within the time limit.
The actual task of petroleum geophysics is to solve the problem of multicomponent multiphase flow in a porous medium. At the same time, the development of effective parallel algorithms is an urgent task for modeling p...
详细信息
Tensor computations are important mathematical operations for applications that rely on multidimensional data. The tensor-vector multiplication (TVM) is the most memory-bound tensor contraction in this class of operat...
详细信息
Tensor computations are important mathematical operations for applications that rely on multidimensional data. The tensor-vector multiplication (TVM) is the most memory-bound tensor contraction in this class of operations. This article proposes an open-source TVM algorithm which is much simpler and efficient than previous approaches, making it suitable for integration in the most popular BLAS libraries available today. Our algorithm has been written from scratch and features unit-stride memory accesses, cache awareness, mode obliviousness, full vectorization and multi-threading as well as NUMA awareness for non-hierarchically stored dense tensors. Numerical experiments are carried out on tensors up to order 10 and various compilers and hardware architectures equipped with traditional DDR and high bandwidth memory (HBM). For large tensors the average performance of the TVM ranges between 62% and 76% of the theoretical bandwidth for NUMA systems with DDR memory and remains independent of the contraction mode. On NUMA systems with HBM the TVM exhibits some mode dependency but manages to reach performance figures close to peak values. Finally, the higher-order power method is benchmarked with the proposed TVM kernel and delivers on average between 58% and 69% of the theoretical bandwidth for large tensors.
Counting and finding triangles in graphs is often used in real-world analytics to characterize cohesiveness and identify communities in graphs. In this paper, we propose the novel concept of a cover-edge set that can ...
详细信息
ISBN:
(纸本)9798350308600
Counting and finding triangles in graphs is often used in real-world analytics to characterize cohesiveness and identify communities in graphs. In this paper, we propose the novel concept of a cover-edge set that can be used to find triangles more efficiently. We use a breadth-first search (BFS) to quickly generate a compact cover-edge set. Novel sequential and parallel triangle counting algorithms are presented that employ cover-edge sets. The sequential algorithm avoids unnecessary triangle-checking operations, and the parallel algorithm is communication-efficient. The parallel algorithm can asymptotically reduce communication on massive graphs such as from real social networks and synthetic graphs from the Graph500 Benchmark. In our estimate from massive-scale Graph500 graphs, our new parallel algorithm can reduce the communication on a scale 36 graph by 1156x and on a scale 42 graph by 2368x.
Purpose String indexes such as the suffix array (sa) and the closely related longest common prefix (lcp) array are fundamental objects in bioinformatics and have a wide variety of applications. Despite their importanc...
详细信息
Purpose String indexes such as the suffix array (sa) and the closely related longest common prefix (lcp) array are fundamental objects in bioinformatics and have a wide variety of applications. Despite their importance in practice, few scalable parallel algorithms for constructing these are known, and the existing algorithms can be highly non-trivial to implement and *** In this paper we present caps-sa, a simple and scalable parallel algorithm for constructing these string indexes inspired by samplesort and utilizing an LCP-informed mergesort. Due to its design, caps-sa has excellent memory-locality and thus incurs fewer cache misses and achieves strong performance on modern multicore systems with deep cache *** We show that despite its simple design, caps-sa outperforms existing state-of-the-art parallel sa and lcp-array construction algorithms on modern hardware. Finally, motivated by applications in modern aligners where the query strings have bounded lengths, we introduce the notion of a bounded-context sa and show that caps-sa can easily be extended to exploit this structure to obtain further speedups. We make our code publicly available at https://***/jamshed/CaPS-SA.
暂无评论