Recently, a series of parallel loop self-scheduling schemes have been proposed, especially for heterogeneous cluster systems. However, they employed the MPI programming model to construct the applications without cons...
详细信息
Recently, a series of parallel loop self-scheduling schemes have been proposed, especially for heterogeneous cluster systems. However, they employed the MPI programming model to construct the applications without considering whether the computing node is multicore architecture or not. As a result, every processor core has to communicate directly with the master node for requesting new tasks no matter the fact that the processor cores on the same node can communicate with each other through the underlying shared memory. To address the problem of higher communication overhead, in this paper we propose to adopt hybrid MPI and openmp programming model to design two-level parallel loop self-scheduling schemes. In the first level, each computing node runs an MPI process for inter-node communications. In the second level, each processor core runs an openmp thread to execute the iterations assigned for its resident node. Experimental results show that our method outperforms the previous works.
In this study, a GPU-accelerated improved mixed Lagrangian-Eulerian (IMLE) method is proposed to solve the three-dimensional incompressible Navier-Stokes equations. To improve the prediction accuracy, the proposed IML...
详细信息
In this study, a GPU-accelerated improved mixed Lagrangian-Eulerian (IMLE) method is proposed to solve the three-dimensional incompressible Navier-Stokes equations. To improve the prediction accuracy, the proposed IMLE method approximates the total derivative term in Lagragian sense, and the spatial derivative terms are approximated on Eulerian coordinates. Transfer of data from Lagrangian particles to data on Eulerian grids is accurately carried out by adopting moving least squares (MLS) interpolation method. The velocity-pressure decoupling issue is overcome by adopting pressure-free projection method in which the pressure field is calculated by solving a pressure Poisson equation (PPE). It is noted that the MLS interpolation is time consuming since this procedure belongs to a pointwise scheme in which a local matrix equation shall be solved on each grid point. In addition, the discretized PPE forms a large sparse matrix and it is computationally intensive to solve by using the conjugate gradient (CG) method. Therefore, we are aimed to resort to CUDA- and openmp-programming means to accelerate the computation. In this study, the performance of the multiple GPUs code can reach up to 27 times faster with respect to multi-threads CPU performance. (C) 2019 Elsevier Ltd. All rights reserved.
Purpose: To present the implementation of a new option for parallel processing of the EGSnrc Monte Carlo system using the openmp API, as an alternative to the provided method based on the use of a batch queuing system...
详细信息
Purpose: To present the implementation of a new option for parallel processing of the EGSnrc Monte Carlo system using the openmp API, as an alternative to the provided method based on the use of a batch queuing system (BQS). Methods: The parallel solution presented, called OMP_EGS, makes use of openmp features to control the workload distribution between the compute units. These features were inserted into the original EGSnrc source code through properly defined macros. In order to validate the platform, the possibility of producing results in exact agreement with the serial implementation was assessed. The performance of OMP_EGS was evaluated against the BQS method, in terms of parallel speedup and efficiency. Results: As the openmp features can be activated or deactivated depending on the compilation options, the implementation of the platform allowed the direct recovery of the original serial implementation. The validation tests showed that OMP_EGS was able to reproduce the exact same results as the serial implementation. The performance and scalability tests showed that OMP_EGS is a better alternative than the EGSnrc BQS parallel implementation, both in terms of runtime and parallel efficiency. Conclusions: The presented solution has several advantages over the BQS-based parallel implementation available for the EGSnrc system. One of the main advantages is that, in contrast to the BQS alternative, it can be implemented using different compilers and operative systems, which turns it into a compact and portable solution that can be used on a wide range of working environments. It does not introduce artifacts on the simulated distributions, as it only handles the distribution of work among the available computing resources and it proved to have a better performance. (C) 2017 American Association of Physicists in Medicine.
This paper presents a new parallel methodology for calculating the determinant of matrices of the order n, with computational complexity O(n), using the Gauss-Jordan Elimination Method and Chio's Rule as reference...
详细信息
This paper presents a new parallel methodology for calculating the determinant of matrices of the order n, with computational complexity O(n), using the Gauss-Jordan Elimination Method and Chio's Rule as references. We intend to present our step-by-step methodology using clear mathematical language, where we will demonstrate how to calculate the determinant of a matrix of the order n in an analytical format. We will also present a computational model with one sequential algorithm and one parallel algorithm using a pseudo-code.
Medical applications increasingly require complex calculations with constraints of accelerated processing time. These applications are therefore oriented towards the integration of high-performance embedded architectu...
详细信息
Medical applications increasingly require complex calculations with constraints of accelerated processing time. These applications are therefore oriented towards the integration of high-performance embedded architectures. In this context, the detection of cardiac abnormalities is a task that remains a high priority in emergency medicine. ECG analysis is a complex task that requires significant computing time since a large amount of information must be analyzed in parallel with high frequencies. Real-time processing is the biggest challenge for researchers, when talking about applications that require time constraints like that of cardiac activity monitoring. This work evaluates the Adaptive Dual Threshold Filter (ADTF) algorithm dedicated to ECG signal filtering using various embedded architectures: A Raspberry 3B+ and Odroid XU4. The implementation has been based on C/C++ and openmp to exploit the parallelism in the used architectures. The evaluation was validated using several ECG signals proposed in MIT-BIH Arrhythmia database with a sampling frequency of 360 Hz. Based on an algorithmic complexity study and a parallelization of the functional blocks which present significant workloads, the evaluation results show a mean execution time of 7.5 ms on the Raspberry 3B+ and 0.34 ms on the Odroid XU4. With an efficient parallelization on the Odroid XU4 architecture, real-time performance can be achieved.
This paper focuses on parallel implementations of three two-dimensional explicit numerical methods on Intel (R) Xeon (R) Scalable Processor and the coprocessor Knights Landing. In this study, the performance of a hybr...
详细信息
This paper focuses on parallel implementations of three two-dimensional explicit numerical methods on Intel (R) Xeon (R) Scalable Processor and the coprocessor Knights Landing. In this study, the performance of a hybrid parallel programming with message passing interface (MPI) and Open Multi-Processing (openmp) and a pure MPI implementation used with two thread binding policies is compared with an improved openmp-based implementation in three explicit finite-difference methods for solving partial differential equations on shared-memory multicore and manycore systems. Specifically, the improved openmp-based version is a strategy that synchronizes adjacent threads and eliminates the implicit barriers of a naive openmp-based implementation. The experiments show that the most suitable approach depends on several characteristics related to the nonuniform memory access (NUMA) effect and load balancing, such as the size of the MPI domain and the number of synchronization points used in the parallel implementation. In algorithms that use four and five synchronization points, hybrid MPI/openmp approaches yielded better speedups than the other versions did in runs performed on both systems. The pure MPI-based strategy, however, achieved better results than the other proposed approaches did in the method that employs only one synchronization point.
暂无评论