This paper presents the parallelization of the multi-frequency hybrid backward/forward sweeping (BFS) technique on a graphics processor unit (GPU). Primarily, the intrinsic layer structure of a radial network, typical...
详细信息
This paper presents the parallelization of the multi-frequency hybrid backward/forward sweeping (BFS) technique on a graphics processor unit (GPU). Primarily, the intrinsic layer structure of a radial network, typical topology of distribution systems, and its multi-frequency behavior are exploited for parallelization of the hybrid BFS method on the GPU. The less computational demanding tasks, e.g., error computation and simple vectorized operations, are assigned to the CPU. The network solution is performed in the Matlab (R) environment using compute unified device architecture (CUDA). The computational time required by the GPU/CPU BFS implementation is compared with a CPU-only program by solving four networks of different sizes. Validation of the multi-frequency BFS results is made through a CPU implementation of a Newton-type solution scheme. The significant reduction in the computational time of the parallelized GPU implementation of the hybrid NS method combined with its ability to include a wide range of frequencies and to handle nonlinear components makes it suitable for real-time online applications. (C) 2014 Elsevier B.V. All rights reserved.
In current systems, while it is necessary to exploit the availability of multiple cores, it is also mandatory to consume less energy. To speed up the development process and make it as transparent as possible to the p...
详细信息
In current systems, while it is necessary to exploit the availability of multiple cores, it is also mandatory to consume less energy. To speed up the development process and make it as transparent as possible to the programmer, parallelism is exploited through the use of Application programming Interfaces (API). However, each one of these API implements different ways to exchange data using shared memory regions, and by consequence, they have different levels of energy consumption. In this paper, considering general purpose and embedded systems, we show how each API influences the performance, energy consumption and Energy-Delay Product. For example, Pthreads consumes 12 % less energy on average than OpenMP and MPI considering all benchmarks. We also demonstrate that the difference in Energy-Delay Product (EDP) among the APIs can be of up to 81 %, while the level of efficiency (e.g.: performance or energy consumption per core) changes as the number of threads increases, depending on whether the system is embedded or general purpose.
Support vector machine (SVM) is a popular algorithm for learning to rank, but the training speed of SVM is the bottleneck when dealing with large size data problems. Recently, heterogeneous computing platforms, such a...
详细信息
ISBN:
(纸本)9781509053827
Support vector machine (SVM) is a popular algorithm for learning to rank, but the training speed of SVM is the bottleneck when dealing with large size data problems. Recently, heterogeneous computing platforms, such as graphics processing unit (GPU) and Many Integrated Core (MIC), have exhibited huge superiority in High Performance Computing domain. Open Computing Language (OpenCL) and Open Multi-Processing (OpenMP) are two popular parallel programming interface for different Heterogeneous Platforms. To resolve the speed problem of RSVM, comparison of the performance of different parallel programming models on different heterogeneous platforms is important. We designed OpenMPbased parallel learning to Rank SVM (PLRSVM) for multi-core CPU and MIC, and OpenCL-based PLRSVM for multi-core CPU, GPU and MIC. The experimental result shows the different performance between OpenMP based program and OpenCL based program. The OpenCL based program significantly speeds up training process of SVM and shows good portability on heterogeneous devices. The experiment also suggests that selection of suitable programming models according to the hardware platform and the structure of serial algorithm is an important step to acquire high performance of parallel algorithm.
In this paper, we present our Concurrent Systems class, where parallel programming and parallel and distributed computing (PDC) concepts have been taught for more than 20 years. Despite several rounds of changes in ha...
详细信息
In this paper, we present our Concurrent Systems class, where parallel programming and parallel and distributed computing (PDC) concepts have been taught for more than 20 years. Despite several rounds of changes in hardware, the class maintains its goals of allowing students to learn parallel computer organizations, studying parallel algorithms, and writing code to be able to run on parallel and distributed platforms. We discuss the benefits of such a class, reveal the key elements in developing this class and receiving funding to replace outdated hardware. We will also share our activities in attracting more students to be interested in PDC and related topics.
For application programmers, reducing efforts for optimizing programs is an important issue. Our solution of this issue is an auto-tuning (AT) technique. We are developing an AT language named ppOpen-AT. We have shown...
详细信息
ISBN:
(纸本)9781509036837
For application programmers, reducing efforts for optimizing programs is an important issue. Our solution of this issue is an auto-tuning (AT) technique. We are developing an AT language named ppOpen-AT. We have shown that this language is useful for multi-and many-core parallel programming. Today, OpenACC attracts attention as an easy and useful graphics processing unit (GPU) programming environment. While OpenACC is one possible parallel programming environment, users have to spend time and energy in order to optimize OpenACC programs. In this study, we investigate the usability of ppOpen-AT for OpenACC programs and propose to expand ppOpen-AT for further optimization of OpenACC.
Ontology matching is among the core techniques used for heterogeneity resolution by information and knowledge-based systems. However, due to the excess and ever-evolving nature of data, ontologies are becoming large-s...
详细信息
Ontology matching is among the core techniques used for heterogeneity resolution by information and knowledge-based systems. However, due to the excess and ever-evolving nature of data, ontologies are becoming large-scale and complex;consequently, leading to performance bottlenecks during ontology matching. In this paper, we present our performance-based ontology matching system. Today's desktop and cloud platforms are equipped with parallelism-enabled multicore processors. Our system benefits from this opportunity and provides effectiveness-independent data parallel ontology matching resolution over parallelism-enabled platforms. Our system decomposes complex ontologies into smaller, simpler, and scalable subsets depending upon the needs of matching algorithms. Matching process over these subsets is divided from granular to finer-level abstraction of independent matching requests, matching jobs, and matching tasks, running in parallel over parallelism-enabled platforms. Execution of matching algorithms is aligned for the minimization of the matching space during the matching process. We comprehensively evaluated our system over OAEI's dataset of fourteen real world ontologies from diverse domains, having different sizes and complexities. We have executed twenty different matching tasks over parallelism-enabled desktop and Microsoft Azure public cloud platform. In a single-node desktop environment, our system provides an impressive performance speedup of 4.1, 5.0, and 4.9 times for medium, large, and very large-scale ontologies. In a single-node cloud environment, our system provides an impressive performance speedup of 5.9, 7.4, and 7.0 times for medium, large, and very large-scale ontologies. In a multi-node (3 nodes) environment, our system provides an impressive performance speedup of 15.16 and 21.51 times over desktop and cloud platforms respectively.
In this paper, an adjoint state-space dynamic neural network method for modeling nonlinear circuits and components is presented. This method is used for modeling the transient behavior of the nonlinear electronic and ...
详细信息
In this paper, an adjoint state-space dynamic neural network method for modeling nonlinear circuits and components is presented. This method is used for modeling the transient behavior of the nonlinear electronic and photonic components. The proposed technique is an extension of the existing state-space dynamic neural network (SSDNN) technique. The new method simultaneously adds the derivative information to the training patterns of nonlinear components, allowing the training to be done with less data without sacrificing model accuracy, and, consequently, makes training faster and more efficient. In addition, this method has been formulated such that it can be suitable for the parallel computation. The use of derivative information and parallelization makes training using the proposed technique much faster than the SSDNN. In addition, the models created using the proposed method are much faster to evaluate compared with the conventional models present in traditional circuit simulation tools. The validity of the proposed technique is demonstrated through the transient modeling of the physics-based CMOS driver, commercial NXP's 74LVC04A inverting buffer, and nonlinear photonic components.
The Partitioned Global Address Space (PGAS) model is a parallel programming model that aims to improve programmer productivity while at the same time aiming for high performance. The main premise of PGAS is that a glo...
详细信息
The Partitioned Global Address Space (PGAS) model is a parallel programming model that aims to improve programmer productivity while at the same time aiming for high performance. The main premise of PGAS is that a globally shared address space improves productivity, but that a distinction between local and remote data accesses is required to allow performance optimizations and to support scalability on large-scale parallel architectures. To this end, PGAS preserves the global address space while embracing awareness of nonuniform communication costs. Today, about a dozen languages exist that adhere to the PGAS model. This survey proposes a definition and a taxonomy along four axes: how parallelism is introduced, how the address space is partitioned, how data is distributed among the partitions, and finally, how data is accessed across partitions. Our taxonomy reveals that today's PGAS languages focus on distributing regular data and distinguish only between local and remote data access cost, whereas the distribution of irregular data and the adoption of richer data access cost models remain open challenges.
Fast Fourier Transform (FFT) is an important part of many applications, such as in wireless communication based on OFDM (Orthogonal Frequency Division Multiplexing). With Cloud Radio Access Networks, implementing FFTs...
详细信息
ISBN:
(纸本)9781479953424
Fast Fourier Transform (FFT) is an important part of many applications, such as in wireless communication based on OFDM (Orthogonal Frequency Division Multiplexing). With Cloud Radio Access Networks, implementing FFTs on multiprocessor clusters is a challenging task. For instance, supporting the Long Term Evolution (LTE) protocol requires processing 100 independent FFTs (with sizes ranging from 128 to 2048 points) in 66.7 μs. In this work, seven native FFT candidate implementations are compared. The considered implementation environments are: OpenMP (Open Multi-Processing) on 1 core, MPI (Message Passing Interface) on 1 core, 2 cores, and 3 cores, Hybrid OpenMP+MPI on 1 core and 3 cores, and MPI on an heterogeneous platform composed of Xeon-Phi and 3 cores. The reported experimental results show that the latter method meets the latency requirements of LTE. It is shown that the OpenMP and MPI paradigms running only on MICs (Many Integrated Cores) cannot benefit fully from the computing capability of many-core architectures. The heterogeneous combination of Xeon+MICs provides a better performance.
In general, highly parallelized programs executed on heterogeneous multiprocessor platforms may get better performance than homogeneous ones. OpenCL is one of the standards for parallel programming of heterogeneous mu...
详细信息
ISBN:
(纸本)9781509008070
In general, highly parallelized programs executed on heterogeneous multiprocessor platforms may get better performance than homogeneous ones. OpenCL is one of the standards for parallel programming of heterogeneous multiprocessor platforms and SPIR (Standard Portable Intermediate Representation) is a portable binary format for representing OpenCL kernel code. However, the programming of these programs is usually complex and error-prone for most programmers. Therefore, some standards have been proposed to simplify the programming on heterogeneous multiprocessor platforms, for example, OpenACC (a directive-based parallel programming model). In this paper, we implement our framework on Clang, the C front-end of LLVM, to automatically translate OpenACC to LLVM IR with SPIR kernels. After that, it is optional to optimize the IR code by LLVM optimizer and execute the host LLVM IR by LLVM JIT-compiler. According to the experiment results, our translated programs have significant performance enhancement for some programs while comparing with their corresponding sequential version of programs and have comparable performance while comparing with their manual OpenCL version. Therefore, our design may reduce the difficulty of writing the programs in heterogeneous multiprocessor platform and the translated OpenCL programs are portable and have good performance as that of the manual OpenCL programs written by experienced programmers.
暂无评论