Fast Fourier Transform (FFT) is a key element for wireless applications based on the OFDM (Orthogonal Frequency Division Multiplexing) and challenging for implementing on processor multicores/many-cores. As an example...
详细信息
ISBN:
(纸本)9781467365765
Fast Fourier Transform (FFT) is a key element for wireless applications based on the OFDM (Orthogonal Frequency Division Multiplexing) and challenging for implementing on processor multicores/many-cores. As an example, the Long Term Evolution (LTE) protocol establishes a requirement for processing, whereby many independent FFTs must be calculated within a limited time slot. By using Intel Math Kernel Library (MKL), in our approach to Xeon phi, we managed to reduce the maximum execution time of many independent FFTs. We proposed an implementation on processors multi-cores/many-cores using OpenMP (Open Multi-processing) reducing the mean time latency to 124 mu s on native mode after 1300 mu s with the offload. This is a challenge for shared memory projects. This paper describes how this level of performance can be obtained with multi-core Intel i7, Xeon processors and a many-core Xeon Phi. The best results were obtained with the Xeon Phi, which outperformed the Xeon Sandy-Bridge.
In this paper, we present an efficient parallel algorithm for calculating cumulative integration based on Simpson's rule. The proposed parallel algorithm exploits two Blelloch's prefix sums. The first scan is ...
详细信息
ISBN:
(纸本)9781467397971
In this paper, we present an efficient parallel algorithm for calculating cumulative integration based on Simpson's rule. The proposed parallel algorithm exploits two Blelloch's prefix sums. The first scan is used to calculate even-index, while the second scan is used to calculate odd-index cumulative integration. We implement the parallel algorithm on NVIDIA CUDA based GPUs. Performance of the proposed parallel algorithm is measured by calculating speedup. We also present accuracy performance of the proposed algorithm. Based on the performance measurements, we can conclude that the parallel proposed algorithm is faster than optimized CPU codes with 3 times speedup.
There have been several recent efforts to improve the performance of fences. The most aggressive designs allow post-fence accesses to retire and complete before the fence completes. Unfortunately, such designs present...
详细信息
ISBN:
(纸本)9781450328357
There have been several recent efforts to improve the performance of fences. The most aggressive designs allow post-fence accesses to retire and complete before the fence completes. Unfortunately, such designs present implementation difficulties due to their reliance on global state and structures. This paper's goal is to optimize both the performance and the implementability of fences. We start-off with a design like the most aggressive ones but without the global state. We call it Weak Fence or wF. Since the concurrent execution of multiple wFs can deadlock, we combine wFs with a conventional fence (i.e., Strong Fence or sF) for the less performance-critical thread(s). We call the result an Asymmetric fence group. We also propose a taxonomy of Asymmetric fence groups under TSO. Compared to past aggressive fences, Asymmetric fence groups both are substantially easier to implement and have higher average performance. The two main designs presented (WS+ and W+) speed-up workloads under TSO by an average of 13% and 21%, respectively, over conventional fences.
Extracting data dependences from programs serves as the foundation of many program analysis and transformation methods, including automatic parallelization, runtime scheduling, and performance tuning. To obtain data d...
详细信息
ISBN:
(纸本)9781479986484
Extracting data dependences from programs serves as the foundation of many program analysis and transformation methods, including automatic parallelization, runtime scheduling, and performance tuning. To obtain data dependences, more and more related tools are adopting profiling approaches because they can track dynamically allocated memory, pointers, and array indices. However, dependence profiling suffers from high runtime and space overhead. To lower the overhead, earlier dependence profiling techniques exploit features of the specific program analyses they are designed for. As a result, every program analysis tool in need of data-dependence information requires its own customized profiler. In this paper, we present an efficient and at the same time generic data-dependence profiler that can be used as a uniform basis for different dependence-based program analyses. Its lock-free parallel design reduces the runtime overhead to around 86x on average. Moreover, signature-based memory management adjusts space requirements to practical needs. Finally, to support analyses and tuning approaches for parallel programs such as communication pattern detection, our profiler produces detailed dependence records not only for sequential but also for multi-threaded code.
This paper presents a parallel collocation algorithm for the solution of a two-point boundary value problem (BVP) that involves index-1 differential-algebraic equations (DAEs) and inequality constraints due to complem...
详细信息
ISBN:
(纸本)9783952426937
This paper presents a parallel collocation algorithm for the solution of a two-point boundary value problem (BVP) that involves index-1 differential-algebraic equations (DAEs) and inequality constraints due to complementarity conditions. BVP-DAEs of this type arise from the indirect approach to the solution of optimal control problems that control variable inequality constraints. In the collocation method presented here the differential and algebraic variables of the BVP-DAEs are approximated using piecewise polynomials on a mesh that may be nonuniform. A Newton interior point method is used to solve the collocation equations, and maintain feasibility of the inequality constraints. The implementation of the algorithm involves parallel evaluation of the collocation equations, parallel evaluation of the system Jacobian, and parallel solution of a boarded almost block diagonal (BABD) system to obtain the Newton search direction. A numerical example shows that the parallel implementation provides significant speedup when compared to a sequential version of the algorithm, and when compared to a direct method.
It is becoming clear that software of all kind are growing in complexity. Production quality code plagued with bugs and security issues that are impossible to test for is becoming commonplace, and HPC is no exception....
详细信息
ISBN:
(纸本)9786067370393
It is becoming clear that software of all kind are growing in complexity. Production quality code plagued with bugs and security issues that are impossible to test for is becoming commonplace, and HPC is no exception. It is therefore necessary to grasp all means of ruling out faulty code and aiding programmers in expressing their intent. C++ is still the dominant language in HPC and with its recent rapid development, a turning point is imminent when the gains of reformulating existing code will outweigh the costs. The current study is a roundtrip of accumulated changes in C++ 11, C++ 14 and the coming C++ 17 standard, new best practices, patterns and idioms that should make their way to the foundations of HPC software. Such drastic changes will result in faster and safer programs with decreased development time.
The development of correct high performance computing applications is challenged by software defects that result from parallel programming. We present an automatic tool that provides novel correctness capabilities for...
详细信息
ISBN:
(纸本)9783319264288;9783319264271
The development of correct high performance computing applications is challenged by software defects that result from parallel programming. We present an automatic tool that provides novel correctness capabilities for application developers of OpenSHMEM applications. These applications follow a Single Program Multiple Data (SPMD) model of parallel programming. A strict form of SPMD programming requires that certain types of operations are textually aligned, i.e., they need to be called from the same source code line in every process. This paper proposes and demonstrates run-time checks that assert such behavior for OpenSHMEM collective communication calls. The resulting tool helps to check program consistency in an automatic and scalable fashion. We introduce the types of checks that we cover and include strict checks that help application developers to detect deviations from expected program behavior. Further, we discuss how we can utilize a parallel tool infrastructure to achieve a scalable and maintainable implementation for these checks. Finally, we discuss an extension of our checks towards further types of OpenSHMEM operations.
There is a phenomenon that hardware technology has developed ahead of software technology in recent years. Companies lack of software techniques that can fully utilize the modern multi-core computing resources, mainly...
详细信息
ISBN:
(纸本)9781479989379
There is a phenomenon that hardware technology has developed ahead of software technology in recent years. Companies lack of software techniques that can fully utilize the modern multi-core computing resources, mainly due to the difficulty of investigating the inherent parallelism inside a software. This problem exists in products ranging from energy-sensitive smartphones to performance-eager data centers. In this paper, we present a case study on the parallelization of the complex industry standard H.264 HDTV decoder application in multi-core systems. An optimal schedule of the tasks is obtained and implemented by a carefully-defined software parallelization framework (SPF). The parallel software framework is proposed together with a set of rules to direct parallel software programming (PSPR). A pre-processing phase based on the rules is applied to the source code to make the SPF applicable. The task-level parallel version of the H.264 decoder is implemented and tested extensively on a workstation running Linux. Significant performance improvement is observed for a set of benchmarks composed of 720p videos. The SPF and the PSPR will together serve as a reference for future parallel software implementations and direct the development of automated tools.
Developing complex computational-intensive and data-intensive scientific applications requires effective utilization of the computational power of the available computing platforms including grids, clouds, clusters, m...
详细信息
ISBN:
(纸本)9781467370820
Developing complex computational-intensive and data-intensive scientific applications requires effective utilization of the computational power of the available computing platforms including grids, clouds, clusters, multicore and many-core processors, and graphical processing units (GPUs). However, scientists who need to leverage such platforms are usually not parallel or distributed programming experts. Thus, they face numerous challenges when implementing and porting their software-based experimental tools to such platforms. In this paper, we introduce a sequential-to-parallel engineering approach to help scientists in engineering their scientific applications. Our approach is based on capturing sequential program details, planned parallelization aspects, and program deployment details using a set of domain-specific visual languages (DSVLs). Then, using code generation, we generate the corresponding parallel program using necessary parallel and distributed programming models (MPI, OpenCL, or OpenMP). We summarize three case studies (matrix multiplication, N-Body simulation, and signal processing) to evaluate our approach.
Modern HPC systems are growing in complexity, as they move towards deeper memory hierarchies and increasing use of computational heterogeneity via GPUs or other accelerators. When developing applications for these pla...
详细信息
ISBN:
(纸本)9781450335591
Modern HPC systems are growing in complexity, as they move towards deeper memory hierarchies and increasing use of computational heterogeneity via GPUs or other accelerators. When developing applications for these platforms, programmers are faced with two bad choices. On one hand, they can explicitly manage all machine resources, writing programs decorated with low level primitives from multiple APIs (e.g. Hybrid MPI / OpenMP applications). Though seemingly necessary for efficient execution, it is an inherently non-scalable way to write software. Without a separation of concerns, only small programs written by expert developers actually achieve this efficiency. Furthermore, the implementations are rigid, difficult to extend, and not portable. Alternatively, users can adopt higher level programming environments to abstract away these concerns. Extensibility and portability, however, often come at the cost of lost performance. The mapping of a user's application onto the system now occurs without the contextual information that was immediately available in the more coupled approach. In this paper, we describe a framework for the transfer of high level, application semantic knowledge into lower levels of the software stack at an appropriate level of abstraction. Using the stapl library, we demonstrate how this information guides important decisions in the runtime system (stapl-rts), such as multi-protocol communication coordination and request aggregation. Through examples, we show how generic programming idioms already known to C++ programmers are used to annotate calls and increase performance.
暂无评论