High-performance computing (HPC) has become an essential tool for improving the efficiency and scalability of transaction processing systems, especially as data volumes continue to grow in fields like finance, e-comme...
详细信息
In this paper we present a performance study of multidimensional Fast Fourier Transforms (FFT) with GPU accelerators on modern hybrid architectures, as those expected for upcoming exascale systems. We assess and lever...
详细信息
ISBN:
(纸本)9781665497473
In this paper we present a performance study of multidimensional Fast Fourier Transforms (FFT) with GPU accelerators on modern hybrid architectures, as those expected for upcoming exascale systems. We assess and leverage features from traditional implementations of parallel FFTs and provide an algorithm that encompasses a wide range of their parameters, and adds novel developments such as FFT grid shrinking and batched transforms. Next, we create a bandwidth model to quantify the computational costs and analyze the well-known communication bottleneck for All-to-All and Point-to-Point MPI exchanges. Then, using a tuning methodology, we are able to accelerate the FFT computation and reduce the communication cost, achieving linear scalability on a large-scale system with GPU accelerators. Finally, our performance analysis is extended to show that carefully tuning the algorithm can further accelerate applications heavily relying on FFTs, such is the case of molecular dynamics software. Our experiments were performed on Summit and Spock supercomputers with IBM Power9 cores, over 3000 NVIDIA V-100 GPUs, and AMD MI-100 GPUs.
distributed storage systems typically use erasure codes for fault tolerance to reduce storage overhead. However, the data repair process in erasure-coded systems can generate heavy I/O overhead. Existing methods typic...
详细信息
Incremental graphs that change over time capture the changing relationships of different entities. Given that many real-world networks are extremely large, it is often necessary to partition the network over many dist...
详细信息
ISBN:
(纸本)9798350305487
Incremental graphs that change over time capture the changing relationships of different entities. Given that many real-world networks are extremely large, it is often necessary to partition the network over many distributedsystems and solve a complex graph problem over the partitioned network. This paper presents a distributed algorithm for identifying strongly connected components (SCC) on incremental graphs. We propose a two-phase asynchronous algorithm that involves storing the intermediate results between each iteration of dynamic updates in a novel meta-graph storage format for efficient recomputation of the SCC for successive iterations. To the best of our knowledge, this is the first attempt at identifying SCC for incremental graphs across distributed compute nodes. Our experimental analysis on real and synthesized graphs shows up to 2.8x performance improvement over the state-of-the-art by reducing the overall memory utilized and improving the communication bandwidth.
In recent years, high computational power has been required for computer platforms to support complex systems such as self-driving systems. Clustered many-core processors and directed acyclic graphs (DAGs), which can ...
详细信息
ISBN:
(数字)9781665497992
ISBN:
(纸本)9781665497992
In recent years, high computational power has been required for computer platforms to support complex systems such as self-driving systems. Clustered many-core processors and directed acyclic graphs (DAGs), which can represent dependencies and parallelism of task processing, have attracted much attention as solutions to this problem. Previous studies on scheduling DAGs on multi-core processors have attempted to reduce the makespan (i.e., time it takes for a task to complete) by increasing the number of processes that can be executed in parallel. However, in self-driving systems, such as those utilizing clustered many-core processors, it is impossible to sufficiently increase the utilization of processor cores due to high-load processing. In this paper, a scheduling method is proposed to improve the utilization of processor cores by parallel executing high-load processes in parallel across multiple cores. The proposed method can reduce the makespan of DAGs performing high-load processing on clustered many-core processors.
The Edmonds Blossom algorithm is implemented here using depth-first search, which is intrinsically serial. By streamlining the code, our serial implementation is consistently three to five times faster than the previo...
详细信息
ISBN:
(纸本)9798350364613;9798350364606
The Edmonds Blossom algorithm is implemented here using depth-first search, which is intrinsically serial. By streamlining the code, our serial implementation is consistently three to five times faster than the previously fastest general graph matching code. By extracting parallelism across iterations of the algorithm, with coarse -grain locking, we are able to further reduce the run lime on random regular graphs fourfold and obtain a two-fold reduction of run time on real-world graphs with similar topology. Solving very sparse graphs (average degree less than four) exhibiting comnwnity structure with eight threads led to a slow down of three-fold, but this slow down is replaced by marginal speed up once the average degree is greater than four. We conclude that our parallel coarse -grain locking implementation performs well when extracting parallelism from this augmenting-path-based algorithm and may work well for similar algorithms.
Conventional power sharing strategies for parallel inverters are mainly divided into two categories. One is based on interconnect lines (ILs) for power information exchange, where the whole system would be subjected t...
详细信息
distributed OLTP systems execute the high-overhead, two-phase commit (2PC) protocol at the end of every distributed transaction. Epoch-based commit proposes that 2PC be executed only once for all transactions processe...
详细信息
ISBN:
(纸本)9781665497534
distributed OLTP systems execute the high-overhead, two-phase commit (2PC) protocol at the end of every distributed transaction. Epoch-based commit proposes that 2PC be executed only once for all transactions processed within a time interval called an epoch. Increasing epoch duration allows more transactions to be processed before the common 2PC. It thus reduces 2PC overhead per transaction, increases throughput but also increases average transaction latency. Therefore, required is the ability to choose the right epoch size that offers the desired trade-off between throughput and latency. To this end, we develop two analytical models to estimate throughput and average latency in terms of epoch size taking into account load and failure conditions. Simulations affirm their accuracy and effectiveness. We then present epoch-based multi-commit which, unlike epoch-based commit, seeks to avoid all transactions being aborted when failures occur, and also performs identically when failures do not occur. Our performance study identifies workload factors that make it more effective in preventing transaction aborts and concludes that the analytical models can be equally useful in predicting its performance as well.
The aerospace industry is one of the largest users of numerical simulation, which is an essential tool in the field of aerodynamic engineering, where many fluid dynamics simulations are involved. In order to obtain th...
详细信息
ISBN:
(纸本)9798350364613;9798350364606
The aerospace industry is one of the largest users of numerical simulation, which is an essential tool in the field of aerodynamic engineering, where many fluid dynamics simulations are involved. In order to obtain the most accurate solutions, some of these simulations use unstructured finite volume solvers that cope with irregular meshes by using explicit time-adaptive integration methods. Modern parallel implementations of these solvers rely on task-based runtime systems to perform fine-grained load balancing and to avoid unnecessary synchronizations. Although such implementations greatly improve performance compared to a classical fork-join MPI+OpenMP variants, it remains a challenge to keep all cores busy throughout the simulation loop. In this article, we first investigate the origins of this lack of parallelism. We emphasize that the irregular structure of the task graph plays a major role in the inefficiency of the computation distribution. Our main contribution is to improve the shape of the task graph by using a new mesh partitioning strategy. The originality of our approach is to take the temporal level of mesh cells into account during the mesh partitioning phase. We evaluate our approach by integrating our solution in an ArianeGroup production code used by Airbus. We show that our partitioning method leads to a more balanced task graph. The resulting task scheduling is up to two times faster for meshes ranging from 200,000 to 12,000,000 components.
This paper describes the edge computing system in the network application service scenario, and analyzes the problems of user identity authentication security service using remote authentication service. The user iden...
详细信息
ISBN:
(数字)9798350349658
ISBN:
(纸本)9798350349665
This paper describes the edge computing system in the network application service scenario, and analyzes the problems of user identity authentication security service using remote authentication service. The user identity authentication method based on edge computing technology is studied, and the method is described in detail. The identity authentication process, the calculation method of identity authentication value and verification value are given. It effectively ensures the security of edge computing and can make full use of the advantages of edge computing to improve the efficiency and level of user identity authentication.
暂无评论