All-to-all communication has a wide range of ap-plications in parallel applications like FFT. On most supercom-puters, there are multiple cores in a node. Message aggregation is an efficient method for smaller message...
详细信息
All-to-all communication has a wide range of ap-plications in parallel applications like FFT. On most supercom-puters, there are multiple cores in a node. Message aggregation is an efficient method for smaller messages. Using multi-leader to aggregate message show significant improvement in intra-node overhead. However, compared to one-leader aggregation, existing multi-leader design incurs more message count and less aggregated message size. This paper proposes an Overlapped Multi-worker Multi-port all-to-all (OVALL) to scale the message size and parallelism of the aggregation algorithm. The algorithm exploits the all-to-all multi-core parallelism, concurrent commu-nication, and overlapping capabilities. Experiment results show that, compared to systems built-in MPI, OVALL'implementation achieves up to 5.9x or 18x speedup compared to MPI on different HPC systems. For the Fast Fourier Transform (FFT) application, OVALL is up to 2.7x (8192 cores, system A) or 5.6x (4800 cores, system B) faster compared to built-in MPI on peak performance.
The current power grid system in the Philippines allows the consumers to utilize any amount of electrical energy as long as they can pay for it. As a response, the utilities need to meet the demand at all times by pur...
详细信息
ISBN:
(数字)9781728199122
ISBN:
(纸本)9781728199122
The current power grid system in the Philippines allows the consumers to utilize any amount of electrical energy as long as they can pay for it. As a response, the utilities need to meet the demand at all times by purchasing energy from the generating plants whatever the cost may be. Today, these distribution utilities started to explore the usage of distributed energy resources in order to cater these varying demands, especially the peak demand. Moreover, the number of consumers which implemented transactive energy schemes also continues to increase. If this interaction between the utility and the consumers will be well-coordinated, reduction in the electricity price and energy consumption would be possible. This paper describes a computing tool and was able to (1) solve for the electricity tariff, (2) produce schedule of energy dispatch from distributed generators and storage systems and (3) perform proper demand-side management through scheduling of appliance operations, in order to reduce the cost of customer energy consumption at the same time maximize the allowable utility profit.
Large Language Models (LLMs) have achieved significant performance in various natural language processing tasks but also pose safety and ethical threats, thus requiring red teaming and alignment processes to bolster t...
The resilience and dependability of the power distribution system have been increased by interconnecting several microgrids to create interconnected microgrids. While interconnecting the microgrids, there is ambiguity...
The resilience and dependability of the power distribution system have been increased by interconnecting several microgrids to create interconnected microgrids. While interconnecting the microgrids, there is ambiguity regarding balanced power sharing across several microgrids. The power mismatch between the generating capacity of distributed energy sources and the load demands of all the microgrids is taken into consideration in this study, a smart interconnection method (SIM) is proposed for tying together two microgrids that are interconnected to the utility grid. The advantage of the suggested technique is the easy, real-time construction of circuitry that is adjustable and simple. To evaluate the effectiveness of the interconnection approach, real-time simulations are carried out using a real-time digital simulator (RTDS). The outcomes of the simulation have confirmed the effectiveness of the proposed smart interconnecting method for connecting two microgrids during grid-connected mode.
We present an efficient Monte Carlo based probabilistic fracture mechanics simulation implementation for heterogeneous high-performance (HPC) architectures including CPUs and GPUs. The specific application focuses on ...
详细信息
ISBN:
(纸本)9780791885031
We present an efficient Monte Carlo based probabilistic fracture mechanics simulation implementation for heterogeneous high-performance (HPC) architectures including CPUs and GPUs. The specific application focuses on large heavy-duty gas turbine rotor components for the energy sector. A reliable probabilistic risk quantification requires the simulation of millions to billions of Monte Carlo (MC) samples. We apply a modified Runge-Kutta algorithm in order to solve numerically the fatigue crack growth for this large number of cracks for varying initial crack sizes, locations, material and service conditions. This compute intensive simulation has already been demonstrated to perform efficiently and scalable on parallel and distributed HPC architectures including hundreds of CPUs utilizing the Message Passing Interface (MPI) paradigm. In this work, we go a step further and include GPUs in our parallelization strategy. We develop a load distribution scheme to share one or more GPUs on compute nodes distributed over a network. We detail technical challenges and solution strategies in performing the simulations on GPUs efficiently. We show that the key computation of the modified Runge-Kutta integration step speeds up over two orders of magnitude on a typical GPU compared to a single threaded CPU. This is supported by our use of GPU textures for efficient interpolation of multi-dimensional tables utilized in the implementation. We demonstrate weak and strong scaling of our GPU implementation, i.e., that we can efficiently utilize a large number of GPUs/CPUs in order to solve for more MC samples, or reduce the computational turn-around time, respectively. On seven different GPUs spanning four generations, the presented probabilistic fracture mechanics simulation tool ProbFM achieves a speed-up ranging from 16.4x to 47.4x compared to single threaded CPU implementation.
The orderly utilization of electric vehicles means that the charging time of electric vehicles is regulated to improve the overall load characteristics and reduce the adverse effects on the power grid. Due to its low ...
详细信息
We consider 2D and 3D models dealing with the transport of suspended particles. The approximation of 2D and 3D models that describe the transport of suspended particles is considered on the example of the two-dimensio...
详细信息
ISBN:
(纸本)9783030816919;9783030816902
We consider 2D and 3D models dealing with the transport of suspended particles. The approximation of 2D and 3D models that describe the transport of suspended particles is considered on the example of the two-dimensional diffusion-convection equation. We use discrete analogs of convective and diffusion transfer operators on the assumption of partial filling of cells. The geometry of the computational domain is described based on the filling function. We solve the problem of transport of suspended particles using a difference scheme that is a linear combination of the Upwind and the Standard Leapfrog difference schemes with weight coefficients obtained from the condition of minimization of the approximation error. The scheme is designed to solve the problem of transfer of impurities for large grid Peclet numbers. We have developed some parallel algorithms for the solution of this problem on multiprocessor systems with distributed memory. The results of numerical experiments give us grounds to draw conclusions about the advantages of 3D models of transport of suspended particles over 2D ones.
Recurrent neural networks (RNNs) have become common models in the field of artificial intelligence to process temporal sequence task, such as speech recognition, text analysis, natural language processing, etc. To spe...
Recurrent neural networks (RNNs) have become common models in the field of artificial intelligence to process temporal sequence task, such as speech recognition, text analysis, natural language processing, etc. To speedup RNNs inference, previous research proposed model sparse pruning techniques. However, the pruning rate of existing sparse pruning algorithms will be affected by pruning granularity and hardware friendliness. In order to approximate nonstructured pruning algorithm, this paper proposes Large Region Balanced Sparse (LRBS) pruning method, which does not limit sub-matrix shape and effectively improves pruning rate. Furthermore, we propose Sparse Matrix Vector Multiplication Accelerator for RNNs (SMVAR), which adopt non-blocking data distribution structure to solve the problem of efficient execution of large region irreg-ular matrix multiplication. To further improve the accelerator performance, SMVAR fine-grained adjusts the pipeline between macro-operations to reduce the idle of compute components. In addition, according to the coarse-grained block characteristics of LRBS algorithm, we develop the coarse-grained parallelism of accelerator with multiply compute units(CUs) structure. Experiments show that the pruning rate of our proposed LRBS is 1.25x-2.5x higher than that of the existing pruning algorithms. Compared with the existing work, the execution efficiency is improved by more than 2.02x-35.9x in the same application scenario.
It becomes obvious that traditional platforms and processing paradigms can't store and process huge amounts of data. The only solution is to use specially designed ad-hoc platform/architecture based on paralleliza...
详细信息
ISBN:
(纸本)9781728173863
It becomes obvious that traditional platforms and processing paradigms can't store and process huge amounts of data. The only solution is to use specially designed ad-hoc platform/architecture based on parallelization that distributes data across large cluster of physical machines. Data Intensive computing is a subclass of general parallelcomputing concept which is based on division of large amounts of data into independent parts and processing them in parallel. In the paper the alternative parallelization architectures are reviewed. MapReduce Programming model associated with distributed massive parallel processing of large amount of data is examined. The main objective of this study is to investigate conceptual fundament behind very popular data-drive computation model MapReduce.
distributed Denial of Service (DDoS) attacks disrupt global network services by mainly overwhelming the victim host with requests originating from multiple traffic sources. DDoS attacks are currently on the rise due t...
详细信息
ISBN:
(纸本)9781665405225
distributed Denial of Service (DDoS) attacks disrupt global network services by mainly overwhelming the victim host with requests originating from multiple traffic sources. DDoS attacks are currently on the rise due to the ease of execution and rental of distributed architectures such as the Internet of Things (IoT) and cloud infrastructures, which could potentially result in substantial revenue losses. Therefore, the detection and prevention of DDoS attacks are currently topics of high interest. In this study, we use traffic flow information to determine if a specific flow is associated with a DDoS attack. We used traditional Machine Learning (ML) methods in developing our DDoS detector and applied an exhaustive hyperparameter search to optimize their detection capability. Using lightweight approaches is suitable for resource-constrained environments such as IoT to reduce computing overhead. Our evaluation shows that most algorithms provide satisfactory results, with Random Forests achieving as high as 99% of detection accuracy, which is similar to the performance of current deep learning solutions for DDoS detection.
暂无评论