The complexity of hardware systems is currently growing faster than the productivity of system designers and programmers. This phenomenon is called Design Productivity Gap and results in inflating design costs. In thi...
详细信息
The complexity of hardware systems is currently growing faster than the productivity of system designers and programmers. This phenomenon is called Design Productivity Gap and results in inflating design costs. In this paper, the notion of Design Productivity is precisely defined, as well as a metric to assess the Design Productivity of a High-Level Synthesis (HLS) method versus a manual hardware description. The proposed Design Productivity metric evaluates the trade-off between design efficiency and implementation quality. The method is generic enough to be used for comparing several HLS methods of different natures, opening opportunities for further progress in Design Productivity. To demonstrate the Design Productivity evaluation method, an HLS compiler based on the CAPH language is compared to manual VHDL writing. The causes that make VHDL lower level than CAPH are discussed. Versions of the sub-pixel interpolation filter from the MPEG HEVC standard are implemented and a design productivity gain of 2.3× in average is measured for the CAPH HLS method. It results from an average gain in design time of 4.4× and an average loss in quality of 1.9×.
This paper presents the parallel performance achieved by a regional model of numerical weather prediction (NWP), running on thousands of computing cores in a petascale supercomputing system. It was obtained good scala...
详细信息
ISBN:
(纸本)9783319269283;9783319269276
This paper presents the parallel performance achieved by a regional model of numerical weather prediction (NWP), running on thousands of computing cores in a petascale supercomputing system. It was obtained good scalability, running with up to 13440 cores, distributed in 670 nodes. These results enables this application to solve large computational challenges, such as perform weather forecast at very high spatial resolution.
Molecular docking is a widely used computational technique that allows studying structure-based interactions complexes between biological objects at the molecular scale. The purpose of the current work is to develop a...
详细信息
Molecular docking is a widely used computational technique that allows studying structure-based interactions complexes between biological objects at the molecular scale. The purpose of the current work is to develop a set of tools that allows performing inverse docking, i.e., to test at a large scale a chemical ligand on a large dataset of proteins, which has several applications on the field of drug research. We developed different strategies to parallelize/distribute the docking procedure, as a way to efficiently exploit the computational performance of multi-core and multi-machine (cluster) environments. The experiments conducted to compare these different strategies encourage the search for decomposing strategies since it improves the execution of inverse docking. (C) 2014 Elsevier B.V. All rights reserved.
This paper introduces SPar, an internal C++ Domain-Specific Language (DSL) that supports the development of classic stream parallel applications. The DSL uses standard C++ attributes to introduce annotations tagging t...
详细信息
This paper introduces SPar, an internal C++ Domain-Specific Language (DSL) that supports the development of classic stream parallel applications. The DSL uses standard C++ attributes to introduce annotations tagging the notable components of stream parallel applications: stream sources and stream processing stages. A set of tools process SPar code (C++ annotated code using the SPar attributes) to generate FastFlow C++ code that exploits the stream parallelism denoted by SPar annotations while targeting shared memory multi-core architectures. We outline the main SPar features along with the main implementation techniques and tools. Also, we show the results of experiments assessing the feasibility of the entire approach as well as SPar's performance and expressiveness.
Performance-analysis tools are indispensable for understanding and optimizing the behavior of parallel programs running on increasingly powerful supercomputers. However, with size and complexity of hardware and softwa...
详细信息
ISBN:
(纸本)9781450340137
Performance-analysis tools are indispensable for understanding and optimizing the behavior of parallel programs running on increasingly powerful supercomputers. However, with size and complexity of hardware and software on the rise, performance data sets are becoming so voluminous that their analysis poses serious challenges. In particular, the search space that must be traversed and the number of individual performance views that must be explored to identify phenomena of interest becomes too large. To mitigate this problem, we use visual analytics. Specifically, we accelerate the analysis of performance profiles by automatically identifying (1) relevant and (2) similar data subsets and their performance views. We focus on views of the virtual-process topology, showing that their relevance can be well captured with visual-quality metrics and that they can be further assigned to topical groups according to their visual features. A case study demonstrates that our approach helps reduce the search space by up to 80%. Copyright 2015 ACM.
Extreme-scale computing is set to provide the infrastructure for the advances and breakthroughs that will solve some of the hardest problems in science and engineering. However, resilience and energy concerns loom as ...
详细信息
Extreme-scale computing is set to provide the infrastructure for the advances and breakthroughs that will solve some of the hardest problems in science and engineering. However, resilience and energy concerns loom as two of the major challenges for machines at that scale. The number of components that will be assembled in the supercomputers plays a fundamental role in these challenges. First, a large number of parts will substantially increase the failure rate of the system compared to the failure frequency of current machines. Second, those components have to fit within the power envelope of the installation and keep the energy consumption within operational margins. Extreme-scale machines will have to incorporate fault tolerance mechanisms and honor the energy and power restrictions. Therefore, it is essential to understand how fault tolerance and energy consumption interplay. This paper presents a comparative evaluation and analysis of energy consumption of three different rollback-recovery protocols: checkpoint/restart, message logging and parallel recovery. Our experimental evaluation shows parallel recovery has the minimum execution time and energy consumption. Additionally, we present an analytical model that projects parallel recovery can reduce energy consumption more than 37% compared to checkpoint/restart at extreme scale. (C) 2014 Elsevier B.V. All rights reserved.
Finite-difference methods are popular for wave simulation within the seismic exploration community, thanks to their efficiency. However, difficulties arise when encountering complex topography due to the regular grid ...
详细信息
Finite-difference methods are popular for wave simulation within the seismic exploration community, thanks to their efficiency. However, difficulties arise when encountering complex topography due to the regular grid pattern of the finitedifference schemes. Despite alternatives that can handle the free surface with little effort, such as the spectral element or discontinuous Galerkin's methods, incorporating a free-surface boundary condition within the finite-difference framework is still appealing, even at the cost of extra algorithm complexity and stronger requirement of computational resources. We present a free-surface boundary treatment within the finite-difference framework, belonging to the family of the immersed-boundary methods. Inherently, the presented boundary treatment is separated from the rest of the wave simulation, which makes it easy to be integrated in existing finite-difference codes. Specifically, we construct an extrapolation operator for each grid point above the free surface, if requested by the finite-difference stencil, to estimate its fictitious wavefield value at each time step. These operators are constructed only once and remain unchanged for all the time steps and source locations. The memory requirement of these operators is significant. Fortunately, grouping together multiple simulations concerning different source locations makes it possible to dilute the memory burden to a negligible level. Additionally, applying these operators incurs numerical noise, which may lead to long time instabilities. In such a scenario, additional numerical procedures, for instance, introducing artificial diffusion, are necessary to control the instability and obtain sensible simulation results. Successful applications of the presented boundary treatment to elastic-wave equations on domains with nontrivial topographies, in 2D and 3D, are presented. Robust and efficient numerical techniques to control high-frequency numerical noise remain to be investigat
Selection algorithms find the kth smallest element from a set of elements. Although there are optimal parallel selection algorithms available for theoretical machines, these algorithms are not only difficult to implem...
详细信息
ISBN:
(纸本)9783642552243
Selection algorithms find the kth smallest element from a set of elements. Although there are optimal parallel selection algorithms available for theoretical machines, these algorithms are not only difficult to implement but also inefficient in practice. Consequently, scalable applications can only use few special cases such as minimum and maximum, where efficient implementations exist. To overcome such limitations, we propose a general parallel selection algorithm that scales even on today's largest supercomputers. Our approach is based on an efficient, unbiased median approximation method, recently introduced as median-of-3 reduction, and Hoare's sequential QuickSelect idea from 1961. The resulting algorithm scales with a time complexity of O(log(2) n) for n distributed elements while needing only O(1) space. Furthermore, we prove it to be a practical solution by explaining implementation details and showing performance results for up to 458, 752 processor cores.
We present a simple, work-optimal and synchronization-free solution to the problem of stably merging in parallel two given, ordered arrays of m and n elements into an ordered array of m+n elements. The main contributi...
详细信息
We present a simple, work-optimal and synchronization-free solution to the problem of stably merging in parallel two given, ordered arrays of m and n elements into an ordered array of m+n elements. The main contribution is a new, simple, fast and direct algorithm that determines, for any prefix of the stably merged output array, the exact prefixes of each of the two input arrays needed to produce this output prefix. More precisely, for any given index in the resulting, but not yet constructed output array, representing the desired output prefix, the algorithm computes the indices (called co-ranks) in each of the two input arrays representing the required input prefixes without having to merge the input arrays. The co-ranking algorithm takes O(log min(m,n)) time steps, and uses 0(1) space. Co-ranking is used in parallel to partition the input arrays into a collection of as many pairs as desired, each pair with exactly the same number of elements. Any stable, sequential merge algorithm can be used to merge pairs independently. The result is a perfectly load-balanced, stable, parallel merge algorithm. Co-ranking and sequential merging of pairs can be done without synchronization. Compared to other, linear speedup approaches to the parallel merge problem, the algorithm is considerably simpler and can be up to a factor of two faster. Compared to previous algorithms for solving the co-ranking problem, the new algorithm works for arbitrary output array indices and maintains stability in the presence of repeated elements at no extra space or time cost. When the number of processing elements p does not exceed (m n)/ log min(m, n), the parallel merge algorithm has perfect, linear speedup p. Furthermore, it is easy to implement on both shared and distributed memory systems.
This special issue of Concurrency and Computation: Practice and Experience contains revised and extended versions of selected papers presented at the conference Euro-Par 2013. Euro-Par—the European Conference on Para...
详细信息
This special issue of Concurrency and Computation: Practice and Experience contains revised and extended versions of selected papers presented at the conference Euro-Par 2013. Euro-Par—the European Conference on parallel Computing—is an annual series of international conferences dedicated to the promotion and advancement of all aspects of parallel and distributed computing. Euro-Par covers a wide spectrum of topics from algorithms and theory to software technology and hardware-related issues, with application areas ranging from scientific to mobile and cloud computing. The major part of the Euro-Par audience consists of researchers in academic institutions, government laboratories and industrial organisations.
暂无评论