In this paper was described the GR1 algorithm that provides feasible execution times for the subgraph isomorphism problem. It is a parallel algorithm that uses a variant of the producer–consumer pattern. It was desig...
详细信息
Since the first release in 2015, OpenTimer v1 has been used in many industrial and academic projects for analyzing the timing of custom designs. After four-year research and developments, we have announced OpenTimer v...
详细信息
Since the first release in 2015, OpenTimer v1 has been used in many industrial and academic projects for analyzing the timing of custom designs. After four-year research and developments, we have announced OpenTimer v2-a major release that efficiently supports: 1) a new task-based parallel incremental timing analysis engine to break through the performance bottleneck of existing loop-based methods;2) a new application programming interface (API) concept to exploit high degrees of parallelisms;and 3) an enhanced support for industry-standard design formats to improve user experience. Compared with OpenTimer v1, we rearchitect v2 with a modern C++ programming language and advanced parallel computing techniques to largely improve the tool performance and usability. For a particular example, OpenTimer v2 achieved up to 5.33x speedup over v1 in incremental timing, and scaled higher with increasing cores. Our contributions include both technical innovations and engineering knowledge that are open and accessible to promote timing research in the community.
In high-performance computing, picking the right number of threads to gain a good speedup is important, as many OS-level parameters are influenced by even slight adjustments in thread count. These parameters are requi...
详细信息
Sparse matrix computations are at the heart of many scientific applications and data analytics codes. The performance and memory usage of these codes depend heavily on their use of specialized sparse matrix data struc...
详细信息
Sparse matrix computations are at the heart of many scientific applications and data analytics codes. The performance and memory usage of these codes depend heavily on their use of specialized sparse matrix data structures that only store the nonzero entries. However, such compaction is done using index arrays that result in indirect array accesses such as A[B[i]] where A and B are both arrays. Numerical libraries can provide high-performance code for an individual sparse kernel however they must be manually tuned and optimized for different inputs and architectures. Alternatively, compilers are used to optimize codes that provide architecture portability. Due to these indirect array accesses, memory access information is unknown at compile-time, and thus it is challenging to vectorize a sparse matrix method or run it in parallel cores. To automate the generation of code for efficient execution of sparse code, several compile-time and runtime techniques are required. Existing techniques are either not efficient or need manual effort to extend to different sparse matrix computations. Consequently, in this dissertation, I address the problem of automating the optimization of sparse matrix code on parallel processors with a specific focus on sparse linear solvers and numerical optimizations. This dissertation presents a set of code transformations and algorithms, all implemented in a novel code generator called Sympiler, that automates the optimization of sparse matrix codes on parallel processors. Sympiler takes a sparse method, arising from a sparse linear system or sparse numerical optimization, and decouples information related to the computation pattern of the method, i.e., symbolic information, and uses this information to transform the code to vectorizable and parallel code. Sympiler also enables the reuse of symbolic information when the computation pattern remains static for a period of time in the simulations or for when it changes modestly. Evaluation result
One obstacle to application development on multi-FPGA systems with high-level synthesis (HLS) is a lack of support for a programming interface. Implementing and debugging an application on multiple FPGA boards is diff...
详细信息
In this article, we propose a parallel computing method of 3-D finite-element analysis coupled with circuit equations for characteristic calculation of rotating machines. In the proposed method, the preconditioning pa...
详细信息
In this article, we propose a parallel computing method of 3-D finite-element analysis coupled with circuit equations for characteristic calculation of rotating machines. In the proposed method, the preconditioning part in the matrix solver is parallelized as well as the other part, in order to obtain the stable solution within short computational time. The proposed method is applied to the loss calculation of an interior permanent magnet synchronous motor fed by an inverter to clarify the advantages.
We present updates to the Cray Graph Engine, a high performance in-memory semantic graph database, which enable performant execution across multiple architectures as well as deployment in a container to support cloud ...
详细信息
We present updates to the Cray Graph Engine, a high performance in-memory semantic graph database, which enable performant execution across multiple architectures as well as deployment in a container to support cloud and as-a-service graph analytics. This paper discusses the changes required to port and optimize CGE to target multiple architectures, including Cray Shasta systems, large shared-memory machines such as SuperDome Flex (SDF), and cluster environments such as Apollo systems. The porting effort focused primarily on removing dependences on XPMEM and Cray PGAS and replacing these with a simplified PGAS library based upon POSIX shared memory and one-sided MPI, while preserving the existing Coarray-C++ CGE code base. We also discuss the containerization of CGE using Singularity and the techniques required to enable container performance matching native execution. We present early benchmarking results for running CGE on the SDF, Infiniband clusters and Slingshot interconnect-based Shasta systems.
With the growing amount of data, computational power has became highly required in all fields. To satisfy these requirements, the use of GPUs seems to be the appropriate solution. But one of their major setbacks is th...
详细信息
ISBN:
(纸本)9789897585883
With the growing amount of data, computational power has became highly required in all fields. To satisfy these requirements, the use of GPUs seems to be the appropriate solution. But one of their major setbacks is their varying architectures making writing efficient parallel code very challenging, due to the necessity to master the GPU's low-level design. CUDA offers more flexibility for the programmer to exploit the GPU's power with ease. However, tuning the launch parameters of its kernels such as block size remains a daunting task. This parameter requires a deep understanding of the architecture and the execution model to be well-tuned. Particularly, in the Viola-Jones algorithm, the block size is an important factor that improves the execution time, but this optimization aspect is not well explored. This paper aims to offer the first steps toward automatically tuning the block size for any input without having a deep knowledge of the hardware architecture, which ensures the automatic portability of the performance over different GPUs architectures. The main idea is to define techniques on how to get the optimum block size to achieve the best performance. We pointed out the impact of using static block size for all input sizes on the overall performance. In light of the findings, we presented two dynamic approaches to select the best block size suitable to the input size. The first one is based on an empirical search;this approach provides the optimal performance;however, it is tough for the programmer, and its deployment is time-consuming. In order to overcome this issue, we proposed a second approach, which is a model that automatically selects a block size. Experimental results show that this model can improve the execution time by up to 2.5x over the static approach.
Fully distributed intelligent building systems can be used to effectively reduce the complexity of building automation systems and improve the efficiency of the operation and maintenance management because of its self...
详细信息
Fully distributed intelligent building systems can be used to effectively reduce the complexity of building automation systems and improve the efficiency of the operation and maintenance management because of its self-organization, flexibility, and robustness. However, the parallel computing mode, dynamic network topology, and complex node interaction logic make application development complex, time-consuming, and challenging. To address the development difficulties of fully distributed intelligent building system applications, this paper proposes a user-friendly programming language called SwarmL. Concretely, SwarmL (1) establishes a language model, an overall framework, and an abstract syntax that intuitively describes the static physical objects and dynamic execution mechanisms of a fully distributed intelligent building system, (2) proposes a physical field-oriented variable that adapts the programming model to the distributed architectures by employing a serial programming style in accordance with human thinking to program parallel applications of fully distributed intelligent building systems for reducing programming difficulty, (3) designs a computational scope-based communication mechanism that separates the computational logic from the node interaction logic, thus adapting to dynamically changing network topologies and supporting the generalized development of the fully distributed intelligent building system applications, and (4) implements an integrated development tool that supports program editing and object code generation. To validate SwarmL, an example application of a real scenario and a subject-based experiment are explored. The results demonstrate that SwarmL can effectively reduce the programming difficulty and improve the development efficiency of fully distributed intelligent building system applications. SwarmL enables building users to quickly understand and master the development methods of application tasks in fully distributed intelligent
Effective and safe parallel programming is among the biggest challenges of today's software technology. The C++ 17 standard introduced parallel STL: a set of overloaded functions taking an additional 'executio...
详细信息
暂无评论