Graphics Processing Units (GPUs) have evolved from very specialized designs geared towards computer graphics to accommodate general-purpose highly-parallel workloads. Harnessing the performance that these accelerators...
详细信息
Graphics Processing Units (GPUs) have evolved from very specialized designs geared towards computer graphics to accommodate general-purpose highly-parallel workloads. Harnessing the performance that these accelerators provide requires the use of specialized native programming interfaces, such as CUDA or OpenCL, or higher-level programming models like OpenMP or OpenACC. However, on managed programming languages, offloading execution into GPUs is much harder and error-prone, mainly due to the need to call through a native API (Application programming Interface), and because of mismatches between value and reference semantics. The Fancier framework provides a unified interface to Java, C/C++, and OpenCL C compute kernels, together with facilities to smooth the transitions between these programming languages. This combination of features makes GPU acceleration on Java much more approachable. In addition, Fancier Java code can be directly translated into equivalent C/C++ or OpenCL C code easily, which simplifies the implementation of higher-level abstractions targeting GPU or parallel execution on Java. Furthermore, it reduces the programming effort without adding significant overhead on top of the necessary OpenCL and Java Native Interface (JNI) API calls. We validate our approach on several image processing workloads running on different Android devices.
In high-performance computing, picking the right number of threads to gain a good speedup is important, as many OS-level parameters are influenced by even slight adjustments in thread count. These parameters are requi...
详细信息
One obstacle to application development on multi-FPGA systems with high-level synthesis (HLS) is a lack of support for a programming interface. Implementing and debugging an application on multiple FPGA boards is diff...
详细信息
In recent years, the research community has made great strides in alias annotations that support parallel programming [1]. Using these techniques, programmers no longer have to guess where aliased mutable state may ca...
详细信息
In this article, we propose a parallel computing method of 3-D finite-element analysis coupled with circuit equations for characteristic calculation of rotating machines. In the proposed method, the preconditioning pa...
详细信息
In this article, we propose a parallel computing method of 3-D finite-element analysis coupled with circuit equations for characteristic calculation of rotating machines. In the proposed method, the preconditioning part in the matrix solver is parallelized as well as the other part, in order to obtain the stable solution within short computational time. The proposed method is applied to the loss calculation of an interior permanent magnet synchronous motor fed by an inverter to clarify the advantages.
We present updates to the Cray Graph Engine, a high performance in-memory semantic graph database, which enable performant execution across multiple architectures as well as deployment in a container to support cloud ...
详细信息
We present updates to the Cray Graph Engine, a high performance in-memory semantic graph database, which enable performant execution across multiple architectures as well as deployment in a container to support cloud and as-a-service graph analytics. This paper discusses the changes required to port and optimize CGE to target multiple architectures, including Cray Shasta systems, large shared-memory machines such as SuperDome Flex (SDF), and cluster environments such as Apollo systems. The porting effort focused primarily on removing dependences on XPMEM and Cray PGAS and replacing these with a simplified PGAS library based upon POSIX shared memory and one-sided MPI, while preserving the existing Coarray-C++ CGE code base. We also discuss the containerization of CGE using Singularity and the techniques required to enable container performance matching native execution. We present early benchmarking results for running CGE on the SDF, Infiniband clusters and Slingshot interconnect-based Shasta systems.
With the growing amount of data, computational power has became highly required in all fields. To satisfy these requirements, the use of GPUs seems to be the appropriate solution. But one of their major setbacks is th...
详细信息
ISBN:
(纸本)9789897585883
With the growing amount of data, computational power has became highly required in all fields. To satisfy these requirements, the use of GPUs seems to be the appropriate solution. But one of their major setbacks is their varying architectures making writing efficient parallel code very challenging, due to the necessity to master the GPU's low-level design. CUDA offers more flexibility for the programmer to exploit the GPU's power with ease. However, tuning the launch parameters of its kernels such as block size remains a daunting task. This parameter requires a deep understanding of the architecture and the execution model to be well-tuned. Particularly, in the Viola-Jones algorithm, the block size is an important factor that improves the execution time, but this optimization aspect is not well explored. This paper aims to offer the first steps toward automatically tuning the block size for any input without having a deep knowledge of the hardware architecture, which ensures the automatic portability of the performance over different GPUs architectures. The main idea is to define techniques on how to get the optimum block size to achieve the best performance. We pointed out the impact of using static block size for all input sizes on the overall performance. In light of the findings, we presented two dynamic approaches to select the best block size suitable to the input size. The first one is based on an empirical search;this approach provides the optimal performance;however, it is tough for the programmer, and its deployment is time-consuming. In order to overcome this issue, we proposed a second approach, which is a model that automatically selects a block size. Experimental results show that this model can improve the execution time by up to 2.5x over the static approach.
Fully distributed intelligent building systems can be used to effectively reduce the complexity of building automation systems and improve the efficiency of the operation and maintenance management because of its self...
详细信息
Fully distributed intelligent building systems can be used to effectively reduce the complexity of building automation systems and improve the efficiency of the operation and maintenance management because of its self-organization, flexibility, and robustness. However, the parallel computing mode, dynamic network topology, and complex node interaction logic make application development complex, time-consuming, and challenging. To address the development difficulties of fully distributed intelligent building system applications, this paper proposes a user-friendly programming language called SwarmL. Concretely, SwarmL (1) establishes a language model, an overall framework, and an abstract syntax that intuitively describes the static physical objects and dynamic execution mechanisms of a fully distributed intelligent building system, (2) proposes a physical field-oriented variable that adapts the programming model to the distributed architectures by employing a serial programming style in accordance with human thinking to program parallel applications of fully distributed intelligent building systems for reducing programming difficulty, (3) designs a computational scope-based communication mechanism that separates the computational logic from the node interaction logic, thus adapting to dynamically changing network topologies and supporting the generalized development of the fully distributed intelligent building system applications, and (4) implements an integrated development tool that supports program editing and object code generation. To validate SwarmL, an example application of a real scenario and a subject-based experiment are explored. The results demonstrate that SwarmL can effectively reduce the programming difficulty and improve the development efficiency of fully distributed intelligent building system applications. SwarmL enables building users to quickly understand and master the development methods of application tasks in fully distributed intelligent
Effective and safe parallel programming is among the biggest challenges of today's software technology. The C++ 17 standard introduced parallel STL: a set of overloaded functions taking an additional 'executio...
详细信息
Heterogeneous architectures proved successful in achieving unprecedented performance and energy-efficiency. However, taking advantage of these diverse processing elements is still hard. Programmers need to code throug...
详细信息
暂无评论