In this paper, we present single- and multi-node optimizations of SU2, a widely-used, open-source Computational Fluid Dynamics application, aimed at improving performance and scalability for implicit Reynolds-averaged...
详细信息
In this paper, we present single- and multi-node optimizations of SU2, a widely-used, open-source Computational Fluid Dynamics application, aimed at improving performance and scalability for implicit Reynolds-averaged Navier-Stokes calculations on unstructured grids. Typical industry-standard implementations are currently limited by unstructured accesses, variable degrees of parallelism, as well as the global synchronizations inherent in traditionally used Krylov linear solvers. Therefore, we rely on aggressive single-node optimizations, such as hierarchical parallelism, dynamic threading, compacted memory layout, and vectorization, along with a communication-friendly agglomeration (geometric) linear multi grid solver. Based on results with the well-known ONERA M6 geometry, our single core and shared memory optimizations result in a speedup of 2.6X on the latest 14-core Intel (R) Xeon (TM) (1) E5-2697v3 processor when compared to the baseline SU2 implementation with 14 MPI ranks. In multi-node settings, the hybrid OpenMP+MPI multigrid implementation achieves 2X higher parallel efficiency on 256 nodes over conventional Krylov-based (GMRES) methods. (C) 2016 Elsevier Ltd. All rights reserved.
In this paper we introduce a new method for speeding up parallel run times of discrete optimization problems which can be used for different problems. We propose that the variant of the Monte Carlo method, the Las Veg...
详细信息
ISBN:
(纸本)9783319265209;9783319265193
In this paper we introduce a new method for speeding up parallel run times of discrete optimization problems which can be used for different problems. We propose that the variant of the Monte Carlo method, the Las Vegas method can be used for overcoming some special barriers that can occur in the course of dividing such problems. Especially the problem of maximum clique and k-clique is examined, and the new algorithm with the relevant measurements is presented.
In this paper,we are presenting QR factorization algorithms that can tolerate process crashes and soft errors(bit flips).Our algorithms take advantage of structural properties of a QR factorization algorithm referred ...
详细信息
ISBN:
(纸本)9781509035946
In this paper,we are presenting QR factorization algorithms that can tolerate process crashes and soft errors(bit flips).Our algorithms take advantage of structural properties of a QR factorization algorithm referred to as "communication-avoiding".We show that,exploiting these properties,our resilient,robust algorithms modify the communication pattern of the computation but do not add any significant computation in the critical path.
We present a distributed-memory library for computations with dense structured matrices. A matrix is considered structured if its off-diagonal blocks can be approximated by a rank-deficient matrix with low numerical r...
详细信息
We present a distributed-memory library for computations with dense structured matrices. A matrix is considered structured if its off-diagonal blocks can be approximated by a rank-deficient matrix with low numerical rank. Here, we use Hierarchically Semi-Separable (HSS) representations. Such matrices appear in many applications, for example, finite-element methods, boundary element methods, and so on. Exploiting this structure allows for fast solution of linear systems and/or fast computation of matrix-vector products, which are the two main building blocks of matrix computations. The compression algorithm that we use, that computes the HSS form of an input dense matrix, relies on randomized sampling with a novel adaptive sampling mechanism. We discuss the parallelization of this algorithm and also present the parallelization of structured matrix-vector product, structured factorization, and solution routines. The efficiency of the approach is demonstrated on large problems from different academic and industrial applications, on up to 8,000 cores. This work is part of a more global effort, the STRUctured Matrices PACKage (STRUMPACK) software package for computations with sparse and dense structured matrices. Hence, although useful on their own right, the routines also represent a step in the direction of a distributed-memory sparse solver.
The skyline operator determines points in a multidimensional dataset that offer some optimal trade-off. State-of-the-art CPU skyline algorithms exploit quad-tree partitioning with complex branching to minimise the num...
详细信息
The skyline operator determines points in a multidimensional dataset that offer some optimal trade-off. State-of-the-art CPU skyline algorithms exploit quad-tree partitioning with complex branching to minimise the number of point-to-point comparisons. Branch-phobic GPU skyline algorithms rely on compute throughput rather than partitioning, but fail to match the performance of sequential algorithms. In this paper, we introduce a new skyline algorithm, SkyAlign, that is designed for the GPU, and a GPU-friendly, grid-based tree structure upon which the algorithm relies. The search tree allows us to dramatically reduce the amount of work done by the GPU algorithm by avoiding most point-to-point comparisons at the cost of some compute throughput. This trade-off allows SkyAlign to achieve orders of magnitude faster performance than its predecessors. Moreover, a NUMA-oblivious port of SkyAlign outperforms native multicore state of the art on challenging workloads by an increasing margin as more cores and sockets are utilised.
One of the most critical challenges for high-performance computing (HPC) scientific visualization is execution on massively threaded processors. Of the many fundamental changes we are seeing in HPC systems, one of the...
详细信息
One of the most critical challenges for high-performance computing (HPC) scientific visualization is execution on massively threaded processors. Of the many fundamental changes we are seeing in HPC systems, one of the most profound is a reliance on new processor types optimized for execution bandwidth over latency hiding. Our current production scientific visualization software is not designed for these new types of architectures. To address this issue, the VTK-m framework serves as a container for algorithms, provides flexible data representation, and simplifies the design of visualization algorithms on new and future computer architecture.
In this paper we analyze and extend mesh-free algorithms for three-dimensional data transfer problems in partitioned multiphysics simulations. We first provide a direct comparison between a mesh-based weighted residua...
详细信息
In this paper we analyze and extend mesh-free algorithms for three-dimensional data transfer problems in partitioned multiphysics simulations. We first provide a direct comparison between a mesh-based weighted residual method using the common-refinement scheme and two mesh-free algorithms leveraging compactly supported radial basis functions: one using a spline interpolation and one using a moving least square reconstruction. Through the comparison we assess both the conservation and accuracy of the data transfer obtained from each of the methods. We do so for a varying set of geometries with and without curvature and sharp features and for functions with and without smoothness and with varying gradients. Our results show that the mesh-based and mesh-free algorithms are complementary with cases where each was demonstrated to perform better than the other. We then focus on the mesh-free methods by developing a set of algorithms to parallelize them based on sparse linear algebra techniques. This includes a discussion of fast parallel radius searching in point clouds and restructuring the interpolation algorithms to leverage data structures and linear algebra services designed for large distributed computing environments. The scalability of our new algorithms is demonstrated on a leadership class computing facility using a set of basic scaling studies. These scaling studies show that for problems with reasonable load balance, our new algorithms for both spline interpolation and moving least square reconstruction demonstrate both strong and weak scalability using more than 100,000 MPI processes with billions of degrees of freedom in the data transfer operation. (C) 2015 Elsevier Inc. All rights reserved.
Due to the increasing complexity of software systems, there is a growing need for automated and scalable software synthesis and analysis. In the last decade, active research in the formal methods community brought int...
详细信息
Due to the increasing complexity of software systems, there is a growing need for automated and scalable software synthesis and analysis. In the last decade, active research in the formal methods community brought interesting results and valuable tools. However, there are still challenges to face and hard problems that need to be solved. We briefly outline some recent trends, and review some of the latest achievements, introducing six papers selected from the 20th International Conference on Tools and algorithms for the Construction and Analysis of Systems (TACAS 2014).
One approach to achieving correct finite element assembly is to ensure that the local orientation of facets relative to each cell in the mesh is consistent with the global orientation of that facet. Rognes et al. have...
详细信息
In this paper we provide method a collision attack on all n-bit iterated hash functions with Merkle-Damgard construction use parallel algorithm, allowing a collision to be found for a 2n block message and k-sum of com...
详细信息
ISBN:
(纸本)9781509023264
In this paper we provide method a collision attack on all n-bit iterated hash functions with Merkle-Damgard construction use parallel algorithm, allowing a collision to be found for a 2n block message and k-sum of computer with about 2n=2 k work. Davies- Meyer scheme using SIMECK-32 algorithm as an example, our attack can find a collision for a 232 bit total output with 8 computer become 213 work for each computer. The result of this research is plaintext that meets the characteristics of fixed point that does not affect the plaintext hash value because the resulting output is the used IV value itself. Plaintext is used to construct collision. Apparently the result of the application of the Davies-Meyer scheme is not resistant to collision attack because there are three fixed point in the two IV samples which are used.
暂无评论