Crashworthiness simulation system is one of the key computer-aided engineering (CAE) tools for the automobile industry and implies two potential conflicting requirements: accuracy and efficiency. A parallel crashworth...
详细信息
Crashworthiness simulation system is one of the key computer-aided engineering (CAE) tools for the automobile industry and implies two potential conflicting requirements: accuracy and efficiency. A parallel crashworthiness simulation system based on graphics processing unit (GPU) architecture and the explicit finite element (FE) method is developed in this work. Implementation details with compute unified device architecture (CUDA) are considered. The entire parallel simulation system involves a parallel hierarchy-territory contact-searching algorithm (HITA) and a parallel penalty contact force calculation algorithm. Three basic GPU-based parallel strategies are suggested to meet the natural parallelism of the explicit FE algorithm. Two free GPU-based numerical calculation libraries, cuBLAS and Thrust, are introduced to decrease the difficulty of programming. Furthermore, a mixed array and a thread map to element strategy are proposed to improve the performance of the test pairs searching. The outer loop of the nested loop through the mixed array is unrolled to realize parallel searching. An efficient storage strategy based on data sorting is presented to realize data transfer between different hierarchies with coalesced access during the contact pairs searching. A thread map to element pattern is implemented to calculate the penetrations and the penetration forces;a double float atomic operation is used to scatter contact forces. The simulation results of the three different models based on the Intel Core i7-930 and the NVIDIA GeForce GTX 580 demonstrate the precision and efficiency of this developed parallel crashworthiness simulation system. (C) 2015 Elsevier Ltd. All rights reserved.
This article provides an overview of AMD's vision for exascale computing. The authors envision exascale computing nodes that compose integrated CPUs and GPUs, along with the hardware and software support to enable...
详细信息
This article provides an overview of AMD's vision for exascale computing. The authors envision exascale computing nodes that compose integrated CPUs and GPUs, along with the hardware and software support to enable scientists to effectively run their scientific experiments on an exascale system. The authors discuss the challenges in building a heterogeneous exascale system and describe ongoing research efforts to realize AMD's exascale vision.
Peterson's solution is a classical algorithm for mutual exclusion problem. But rigorous works on analyzing its properties of safety or liveness are rare so far. In theorem prover Isabelle/HOL, we formally modelled...
详细信息
ISBN:
(纸本)9781509035403
Peterson's solution is a classical algorithm for mutual exclusion problem. But rigorous works on analyzing its properties of safety or liveness are rare so far. In theorem prover Isabelle/HOL, we formally modelled Peterson's solution for two processes, and proved that it satisfies mutual exclusion property. With Paulson's inductive approach, the algorithm is inductively defined as a set of all possible event lists of two concurrent processes, in which event is defined as atomic action of concurrent processe. All of the reasoning codes have been checked by Isabelle/HOL. Comparing with those works based on model checking, our work can be easily generalized to the analysis of Peterson's solution for n (n>2) processes. And the model we defined for Peterson's solution could be extended to analyze liveness property of Peterson's solution. The process of proving also produces some good advices on how to programming Peterson's solution.
There is a growing need for accurate and efficient real-time state estimation with increasing complexity, interconnection, and insertion of new devices in power systems. In this paper, a massively parallel dynamic sta...
详细信息
There is a growing need for accurate and efficient real-time state estimation with increasing complexity, interconnection, and insertion of new devices in power systems. In this paper, a massively parallel dynamic state estimator is developed on a graphic processing unit (GPU), which is especially designed for processing large data sets. Within the massively parallel framework, a lateral two-level dynamic state estimator is proposed based on the extended Kalman filter method, utilizing both supervisory control and data acquisition, and phasor measurement unit (PMU) measurements. The measurements at the buses without PMU installations are predicted using previous data. The results of the GPU-based dynamic state estimator are compared with a multithread CPU-based code. Moreover, the effects of direct and iterative linear solvers on the state estimation algorithm are investigated. The simulation results show a total speed-up of up to 15 times for a 4992-bus system.
In this paper, the performance of parallel computing will be thoroughly discussed in the domain of image matching. The concept of image matching is widely used in the areas of security, medical and computer vision whi...
详细信息
In this paper, the performance of parallel computing will be thoroughly discussed in the domain of image matching. The concept of image matching is widely used in the areas of security, medical and computer vision which require comparing two images for similarities. However, depending on the size of images, it is highly possible that the application computation cannot be handled in a single processor running a sequential algorithm. In order to overcome this limitation, parallel computing is introduced through the Message Passing Interface (MPI) library. In this project, for the comparison of two images, both images are first converted into grayscale and then are compared using the Sum of Square Differences (SSD) algorithm. Further, a parallel network of 12 processors was implemented for image matching and to calculate the performance of the SSD algorithm between both images. The performance gain of 12, 8, 4 and 2 processors was compared with the performance of a single processor. The comparison results presented a linear relationship between the performance gain and the number of processors used for execution. Hence, it proves that there are significant benefits of parallelism on SSD applications.
Erasure code based object storage systems are becoming popular choices for archive storage systems due to cost-effective storage space saving schemes and higher fault-resilience capabilities. Both erasure code encodin...
详细信息
Erasure code based object storage systems are becoming popular choices for archive storage systems due to cost-effective storage space saving schemes and higher fault-resilience capabilities. Both erasure code encoding and decoding procedures involve heavy array, matrix, and table-lookup compute intensive operations. With today's advanced CPU design technologies such as multi-core, many-core, and streaming SIMD instruction sets we can effectively and efficiently adapt the erasure code technology in cloud storage systems and apply it to handle very large-scale date sets. Current solutions of the erasure coding process are based on single process approach which is not capable of processing very large data sets efficient and effectively. To prevent the bottleneck of a single process erasure encoding process, we utilize the task parallelism property from a multicore computing system and improve erasure coding process with parallel processing capability. We have leveraged open source erasure coding software and implemented a concurrent and parallel erasure coding software, called parEC. The proposed parEC process is realized through MPI run time parallel I/O environment and then data placement process is applied to distribute encoded data blocks to their destination storage devices. In this paper, we present the software architecture of parEC. We conduct various performance testing cases on parEC's software components. We present our early experience of using parEC, and address parEC's current status and future development works.
Recent multi-core designs migrated from Symmetric Multi Processing to cache coherent Non Uniform Memory Access architectures. In this paper we discuss performance issues that arise when designing parallel Finite Eleme...
详细信息
Recent multi-core designs migrated from Symmetric Multi Processing to cache coherent Non Uniform Memory Access architectures. In this paper we discuss performance issues that arise when designing parallel Finite Element programs for a 64-core ccNUMA computer and explore solutions for these issues. We first present the overview of the computer architecture and show that highly parallel code that does not take into account the aspects of the system memory organization scales poorly, achieving only 2.8x speedup when running with 64 threads. Then, we discuss how we identified the sources of overhead and evaluate three possible solutions for the problem. We show that the first solution does not require the application's code to be modified, however, the speedup achieved is only 10.6x. The second solution enables the performance to scale up to 30.9x, however, it requires the programmer to manually schedule threads and allocate related data on local CPUs and memory banks and rely on ccNUMA aware libraries that are not portable across operating systems. Also, we propose and evaluate "copy-on-thread", an alternative solution that enables the performance to scale up to 25.5x without relying on specialized libraries nor requiring specific data allocation and thread scheduling. Finally, we argue that the issues reported only happen for large data sets and conclude the paper with recommendations to help programmers to design algorithms and programs that perform well on such kind of machine. (C) 2014 Civil-Comp Ltd. and Elsevier Ltd. All rights reserved.
In order to make use of the ever-improving microprocessor performance, the applications must be modified to take advantage of the parallelism of today's microprocessors. One such application that needs to be moder...
详细信息
In order to make use of the ever-improving microprocessor performance, the applications must be modified to take advantage of the parallelism of today's microprocessors. One such application that needs to be modernized is the weather research and forecasting (WRF) model, which is designed for numerical weather prediction and atmospheric research. The WRF software infrastructure consists of several components such as dynamic solvers and physics schemes. Numerical models are used to resolve the large-scale flow. However, subgrid-scale parameterizations are for an estimation of small-scale properties (e.g., boundary layer turbulence and convection, clouds, radiation). Those have a significant influence on the resolved scale due to the complex nonlinear nature of the atmosphere. For the cloudy planetary boundary layer (PBL), it is fundamental to parameterize vertical turbulent fluxes and subgrid-scale condensation in a realistic manner. A parameterization based on the total energy-mass flux (TEMF) that unifies turbulence and moist convection components produces a better result than other PBL schemes. Thus, we present our optimization results for the TEMF PBL scheme. Those optimizations included vectorization of the code to utilize multiple vector units inside each processor code. The optimizations improved the performance of the original TEMF code on Xeon Phi 7120P by a factor of 25.9x. Furthermore, the same optimizations improved the performance of the TEMF on a dual socket configuration of eight-core Intel Xeon E5-2670 CPUs by a factor of 8.3x compared to the original TEMF code.
While numerous applications, such as social networks, protein-protein interaction networks, and bibliographic networks, mainly consist of graph-structured data, massive graphs, of which the scales range from million n...
详细信息
ISBN:
(纸本)9781509037117
While numerous applications, such as social networks, protein-protein interaction networks, and bibliographic networks, mainly consist of graph-structured data, massive graphs, of which the scales range from million nodes to billion nodes, are common-place. Searching within these kinds of graphs is urged to be efficient. Unfortunately, since the subgraph isomorphism problem is NP-complete, querying on large graphs is still challenging. Most of existing approaches employ various pruning rules to facilitate the matching process on a single machine. When a data graph is large and dense, auxiliary information, known as index, and intermediate results could easily run out of computational resources. Recently, inspired by the popularity of parallel programming models, such as MapReduce and Pregel, there is a trend to solve the subgraph matching problem upon them. However, caused by the incompleteness of graph data in each cluster machine, parallel solutions of subgraph matching are often proposed in a brute-force way. In this paper, we propose a parallel subgraph matching framework which uses k-hop replication based partitioning approach to distribute the graph data across cluster machines. Benefiting from the proposed framework, the completeness of local searching can be ensured. Hence, previous studies on indexing graph data become usable and valuable for the parallel graph querying problem. For the consideration of efficiency, taking a light-weight neighborhood-based index as an example, we also propose two potential optimization opportunities for reducing intermediate results. We implement the proposed framework on Hadoop/MapReduce. Our experimental results on real-world data sets demonstrate its effectiveness on very large graphs.
暂无评论