This article presents the design and optimization of the GPU kernels for numerical integration, as it is applied in the standard form in finite-element codes. The optimization process employs autotuning, with the main...
详细信息
This article presents the design and optimization of the GPU kernels for numerical integration, as it is applied in the standard form in finite-element codes. The optimization process employs autotuning, with the main emphasis on the placement of variables in the shared memory or registers. OpenCL and the first order finite-element method (FEM) approximation are selected for code design, but the techniques are also applicable to the CUDA programming model and other types of finite-element discretizations (including discontinuous Galerkin and isogeometric). The autotuning optimization is performed for four example graphics processors and the obtained results are discussed.
An increasing attention has been given to provide service level objectives (SLOs) in stream processing applications due to the performance and energy requirements, and because of the need to impose limits in terms of ...
详细信息
An increasing attention has been given to provide service level objectives (SLOs) in stream processing applications due to the performance and energy requirements, and because of the need to impose limits in terms of resource usage while improving the system utilization. Since the current and next-generation computing systems are intrinsically offering parallel architectures, the software has to naturally exploit the architecture parallelism. Implement and meet SLOs on existing applications is not a trivial task for application programmers, since the software development process, besides the parallelism exploitation, requires the implementation of autonomic algorithms or strategies. This is a system-oriented programming approach and requires the management of multiple knobs and sensors (e.g., the number of threads to use, the clock frequency of the cores, etc.) so that the system can self-adapt at runtime. In this work, we introduce a new and simpler way to define SLO in the application's source code, by abstracting from the programmer all the details relative to self-adaptive system implementation. The application programmer specifies which parts of the code to parallelize and the related SLOs that should be enforced. To reach this goal, source-to-source code transformation rules are implemented in our compiler, which automatically generates self-adaptive strategies to enforce, at runtime, the user-expressed objectives. The experiments highlighted promising results with simpler, effective, and efficient SLO implementations for real-world applications.
The categorisation of network packets according to multiple parameters such as sender and receiver addresses is called packet classification. Packet classification lies at the core of Software-Defined Networking (SDN)...
详细信息
The categorisation of network packets according to multiple parameters such as sender and receiver addresses is called packet classification. Packet classification lies at the core of Software-Defined Networking (SDN)-based network applications. Due to the increasing speed of network traffic, there is an urgent need for packet classification at higher speeds. Although it is possible to accelerate packet classification algorithms through hardware implementation, this solution imposes high costs and offers limited development capacity. On the other hand, current software methods to solve this problem are relatively slow. A practical solution to this problem is to parallelise packet classification using multi-core processors. In this study, the Thread, parallel patterns library (PPL), open multi-processing (OpenMP), and threading building blocks (TBB) libraries are examined and implemented to parallelise three packet classification algorithms, i.e. tuple space search, tuple pruning search, and hierarchical tree. According to the results, the type of algorithm and rulesets may influence the performance of parallelisation libraries. In general, the TBB-based method shows the best performance among parallelisation libraries due to using a theft mechanism and can accelerate the classification process up to 8.3 times on a system with a quad-core processor.
Building on a significant amount of current research that examines the idea of platform-portable parallel code across different types of processor families, this work focuses on two sets of related questions. First, u...
详细信息
In this article, we address an efficient solver of the Maxwell eigenvalue problem for lossy cavity resonators. The curlcurl equation for the electric field is discretized using curved tetrahedral incomplete quadratic ...
详细信息
In this article, we address an efficient solver of the Maxwell eigenvalue problem for lossy cavity resonators. The curlcurl equation for the electric field is discretized using curved tetrahedral incomplete quadratic finite elements, resulting in a nonlinear eigenvalue formulation. The eigenvalue problem is efficiently solved using a contour integral method (CIM). This method enables an accurate computation of all eigenvalues within a predefined region and is implemented in a highly parallelized framework to enhance the performance of the algorithm. Numerical results are presented to demonstrate the accuracy and efficiency of the proposed method.
Monte Carlo (MC) is known to be the most accurate dose calculation method. However, MC suffers from high computational cost as a large number of particles have to be simulated to achieve the desired statistical uncert...
详细信息
Monte Carlo (MC) is known to be the most accurate dose calculation method. However, MC suffers from high computational cost as a large number of particles have to be simulated to achieve the desired statistical uncertainty. Enhancing computational power by parallelizing the simulation with multiple GPU threads reduces the time required to reach the desired uncertainty in MC simulation. In this article, we present DOSXYZgpu, a GPU implementation of EGSnrc code which is written in CUDA Fortran as an algorithm. This article relies on a well validated and popular code among medical physicists, EGSnrc/DOSXYZnrc. In order to transport particles between two consecutive interactions, we developed an algorithm to handle several thousands of histories per warp. DOSXYZgpu implementation is evaluated with the original sequential EGSnrc/DOSXYZnrc. Maximum speedup of 205 times is achieved while the statistical uncertainty of the simulation is preserved. The t-test statistical analysis indicates that for more than 95% of the voxels there is no significant difference between the results obtained from the GPU and the CPU.
In modern filmmaking industry, image matting has been one of the common tasks in video side effects and the necessary intermediate steps in computer vision. It pulls the foreground object from the background of an ima...
详细信息
In modern filmmaking industry, image matting has been one of the common tasks in video side effects and the necessary intermediate steps in computer vision. It pulls the foreground object from the background of an image by estimating the alpha values. However, the computational speed for matting high resolution images can be significantly slow due to its complexity and computation that is proportional to the size of unknown region. In order to improve the performance, we implement a parallel alpha matting code with OpenMP from existing sequential code for running on the multicore servers. We present and discuss the algorithm and experimentation results from the perspective of the parallel application developer. The development takes less effort, and the results show significant performance improvement of the entire program.
We propose a teaching resource that uses HardKernel boards to build an MPI server with 256 cores. Although this system has a relatively low performance, the aim is to provide access to hundreds of cores for carrying o...
详细信息
We propose a teaching resource that uses HardKernel boards to build an MPI server with 256 cores. Although this system has a relatively low performance, the aim is to provide access to hundreds of cores for carrying out scalability analyses, while obtaining a good trade-off between performance, price, and energy consumption. Here, we give details about the implementation of this system at both the hardware and software levels. We also explain how it was used to teach parallel programming in a university degree course, and discuss the teachers' and students' comments about using this new system.
The development of directive based parallel programming models such as OpenACC has significantly reduced the cost in using accelerators such as GPUs. In this study, the sparse matrix vector product (SpMV), which was o...
详细信息
ISBN:
(纸本)9781665490207
The development of directive based parallel programming models such as OpenACC has significantly reduced the cost in using accelerators such as GPUs. In this study, the sparse matrix vector product (SpMV), which was often the most computationally expensive part in physics-based simulations, was accelerated by GPU porting using OpenACC. Further speed-up was achieved by introducing the element-by-element (EBE) method in SpMV, an algorithm that is suitable for GPU architecture because it requires large amount of operations but small amount of memory access. In a comparison on one compute node of the supercomputer ABCI, using GPUs resulted in a 22- fold speedup over the CPU-only case, even when using the typical SpMV algorithm, and an additional 3.4-fold speedup when using the EBE method. The results on such analysis was applied to a seismic response analysis considering soil liquefaction, and using GPUs resulted in a 42-fold speedup compared to using only CPUs.
OpenACC is a high-level directive-based parallel programming model that can manage the sophistication of heterogeneity in architectures and abstract it from the users. The portability of the model across CPUs and acce...
详细信息
暂无评论