State estimation is the foundation for a variety of online power system applications in energy management systems, and the stability of power systems is directly impacted by the speed with which current system states ...
详细信息
ISBN:
(纸本)9781665455336
State estimation is the foundation for a variety of online power system applications in energy management systems, and the stability of power systems is directly impacted by the speed with which current system states can be obtained through state estimation. This paper proposed a fast Gaussian-Newton state estimation method for power systems based on parallel belief propagation, which implements the Gaussian belief process via multi-core and multi-thread parallel computation to achieve efficient state estimation. Simulation findings on numerous IEEE-standard power systems show that the suggested technique outperforms the traditional algorithm.
We introduce a distributed memory parallel algorithm for force-directed node embedding that places vertices of a graph into a low-dimensional vector space based on the interplay of attraction among neighboring vertice...
详细信息
ISBN:
(纸本)9798350364613;9798350364606
We introduce a distributed memory parallel algorithm for force-directed node embedding that places vertices of a graph into a low-dimensional vector space based on the interplay of attraction among neighboring vertices and repulsion among distant vertices. We develop our algorithms using two sparse matrix operations, SDDMM and SpMM. We propose a configurable pull -push -based communication strategy that optimizes memory usage and data transfers based on computing resources and asynchronous MPI communication to overlap communication and computation. Our algorithm scales up to 256 nodes on distributed supercomputers by surpassing the performance of state-of-the-art algorithms
Non-Uniform Memory Access (NUMA) systems are preva-lent in HPC, where optimal thread and page placement are crucial for enhancing performance and minimizing energy us-age [1]-[3]. Moreover, considering that NUMA syste...
详细信息
Streaming graph computation has been widely applied in many fields, e.g., social network analysis and online product recommendation. However, existing streaming graph computation approaches still present limitations o...
详细信息
The application of GPU to accelerate large-scale smoke simulation is a hot research topic in computational fluid dynamics. However, the current smoke parallel computing methods for different scale smoke flow field, th...
详细信息
distributed machine learning (DML) has recently experienced widespread application. A major performance bottleneck is the costly communication for gradients synchronization. Recently, researchers have explored the use...
详细信息
Heterogeneous systems, consisting of CPUs and GPUs, offer the capability to address the demands of compute- and data-intensive applications. However, programming such systems is challenging, requiring knowledge of var...
详细信息
ISBN:
(纸本)9783031506833;9783031506840
Heterogeneous systems, consisting of CPUs and GPUs, offer the capability to address the demands of compute- and data-intensive applications. However, programming such systems is challenging, requiring knowledge of various parallel programming frameworks. This paper introduces COMPAR, a component-based parallel programming framework that enables the exposure and selection of multiple implementation variants of components at runtime. The framework leverages compiler directive-based language extensions to annotate the source code and generate the necessary glue code for the StarPU runtime system. COMPAR provides a unified view of implementation variants and allows for intelligent selection based on runtime context. Our evaluation demonstrates the effectiveness of COMPAR through benchmark applications. The proposed approach simplifies heterogeneous parallel programming and promotes code reuse while achieving optimal performance.
In distributed training, deep neural networks (DNNs) are launched over multiple workers concurrently and aggregate their local updates on each step in bulk-synchronous parallel (BSP) training. However, BSP does not li...
详细信息
ISBN:
(纸本)9798350307924
In distributed training, deep neural networks (DNNs) are launched over multiple workers concurrently and aggregate their local updates on each step in bulk-synchronous parallel (BSP) training. However, BSP does not linearly scale-out due to high communication cost of aggregation. To mitigate this overhead, alternatives like Federated Averaging (FedAvg) and Stale-Synchronous parallel (SSP) either reduce synchronization frequency or eliminate it altogether, usually at the cost of lower final accuracy. In this paper, we present SelSync, a practical, low-overhead method for DNN training that dynamically chooses to incur or avoid communication at each step either by calling the aggregation op or applying local updates based on their significance. We propose various optimizations as part of SelSync to improve convergence in the context of semi-synchronous training. Our system converges to the same or better accuracy than BSP while reducing training time by up to 14x.
Synchronous distributed data parallel (SDDP) training is widely employed in distributed deep learning systems to train DNN models on large datasets. The performance of SDDP training essentially depends on the communic...
详细信息
ISBN:
(纸本)9789819755684;9789819755691
Synchronous distributed data parallel (SDDP) training is widely employed in distributed deep learning systems to train DNN models on large datasets. The performance of SDDP training essentially depends on the communication overhead and the statistical efficiency. However, existing approaches only optimize either the communication overhead or the statistical efficiency to accelerate SDDP training. In this paper, we adopt the advantages of those approaches and design a new approach, namely SkipSMA, that benefits from both low communication overhead and high statistical efficiency. In particular, we exploit the skipping strategy with an adaptive interval to decrease the communication frequency, which guarantees low communication overhead. Moreover, we employ the correction technique to mitigate the divergence while keeping small batch sizes, which ensures high statistical efficiency. To demonstrate the performance of SkipSMA, we integrate it into TensorFlow. Our experiments show that SkipSMA outperforms the state-of-the-art solutions for SDDP training, e.g., 6.88x speedup over SSGD.
Data partitioning is the most fundamental procedure before parallelizing complex analysis on very big graphs. As a classical NP-complete problem, graph partitioning usually employs offline or online/streaming heuristi...
详细信息
ISBN:
(纸本)9798350339864
Data partitioning is the most fundamental procedure before parallelizing complex analysis on very big graphs. As a classical NP-complete problem, graph partitioning usually employs offline or online/streaming heuristics to find approximately optimal solutions. However, they are either heavyweight in space and time overheads or suboptimal in quality measured by workload balance and the number of cutting edges across partitions, both of which cannot scale well with the ever-growing demands of quickly analyzing big graphs. This paper thereby proposes a new vertex partitioner for better scalability. It preserves the lightweight advantage of existing streaming heuristics, and more importantly, fully utilizes the knowledge embedded in the local view when streaming a vertex, which significantly improves the quality. We present a sliding window technique to compensate for the additional memory costs caused by knowledge utilization. Also, a parallel technique with dependency detection optimization is designed to further enhance efficiency. Experiments on a spread of real-world datasets validate that our proposals can achieve overall success in terms of partitioning quality, memory consumption, and runtime efficiency.
暂无评论