We show that a careful parallelization of statistical multiresolution estimation (SMRE) improves the phase reconstruction in X-ray near-field holography. The central step in, and the computationally most expensive par...
详细信息
We show that a careful parallelization of statistical multiresolution estimation (SMRE) improves the phase reconstruction in X-ray near-field holography. The central step in, and the computationally most expensive part of, SMRE methods is Dykstra's algorithm. It projects a given vector onto the intersection of convex sets. We discuss its implementation on NVIDIA's compute unified device architecture (CUDA). Compared to a CPU implementation parallelized with OpenMP, our CUDA implementation is up to one order of magnitude faster. Our results show that a careful parallelization of Dykstra's algorithm enables its use in large-scale statistical multiresolution analyses.
The demand for fast solution of nonlinear optimization problems, coupled with the emergence of new concurrent computing architectures, drives the need for parallel algorithms to solve challenging nonlinear programming...
详细信息
The demand for fast solution of nonlinear optimization problems, coupled with the emergence of new concurrent computing architectures, drives the need for parallel algorithms to solve challenging nonlinear programming (NLP) problems. In this paper, we propose an augmented Lagrangian interior-point approach for general NLP problems that solves in parallel on a Graphics processing unit (GPU). The algorithm is iterative at three levels. The first level replaces the original problem by a sequence of bound-constrained optimization problems using an augmented Lagrangian method. Each of these bound-constrained problems is solved using a nonlinear interior-point method. Inside the interior-point method, the barrier sub-problems are solved using a variation of Newton's method, where the linear system is solved using a preconditioned conjugate gradient (PCG) method, which is implemented efficiently on a GPU in parallel. This algorithm shows an order of magnitude speedup on several test problems from the COPS test set. (C) 2015 Elsevier Ltd. All rights reserved.
Due to severe weather events, there is a growing need for more accurate weather predictions. Climate change has increased both frequency and severity of such events. Optimizing weather model source code would result i...
详细信息
Due to severe weather events, there is a growing need for more accurate weather predictions. Climate change has increased both frequency and severity of such events. Optimizing weather model source code would result in reduced run times or more accurate weather predictions. One such weather model is the weather research and forecasting (WRF) model, which is designed for both numerical weather prediction (NWP) and atmospheric research. The WRF software infrastructure consists of several components such as dynamic solvers and physics schemes. Purdue-Lin scheme is a relatively sophisticated microphysics scheme in the WRF model. The scheme includes six classes of hydro meteors: 1) water vapor;2) cloud water;3) raid;4) cloud ice;5) snow;and 6) graupel. The scheme is very suitable for massively parallel computation as there are no interactions among horizontal grid points. Thus, we present our optimization results for the Purdue-Lin microphysics scheme. Those optimizations included improved vectorization of the code to utilize multiple vector units inside each processor code better. Performed optimizations improved the performance of the original unmodified Purdue-Lin microphysics code running natively on Xeon Phi 7120P by a factor of 4.7x. Similarly, the same optimizations improved the performance of the Purdue-Lin microphysics scheme on a dual socket configuration of eight core Intel Xeon E5-2670 CPUs by a factor of 1.3x compared to the original code.
parallel tree skeletons are basic computational patterns that can be used to develop parallel programs for manipulating trees. In this paper, we propose an efficient implementation of parallel tree skeletons on distri...
详细信息
The shift towards multicore processing has led to a much wider population of developers being faced with the challenge of exploiting parallel cores to improve software performance. Debugging and optimizing parallel pr...
详细信息
The shift towards multicore processing has led to a much wider population of developers being faced with the challenge of exploiting parallel cores to improve software performance. Debugging and optimizing parallel programs is a complex and demanding task. Tools which support development of parallel programs should provide salient information to allow programmers of multicore systems to diagnose and distinguish performance problems. Appropriate design of such tools requires a systematic analysis of the problems which might be identified, and the information used to diagnose them. Building on the literature, we put forward a potential taxonomy of parallel performance problems, and an observational model which links measurable performance data to these problems. We present a validation of this model carried out with parallel programming experts, identifying areas of agreement and disagreement. This is accompanied with a survey of the prevalence of these problems in software development. From this we can identify contentious areas worthy of further exploration, as well as those with high prevalence and strong agreement, which are natural candidates for initial moves towards better tool support.
With the increasing of data at an incredible rate, the development of cloud computing technologies is of critical importance to the advances of researches. MapReduce is a widely adopted computing framework for data-in...
详细信息
With the increasing of data at an incredible rate, the development of cloud computing technologies is of critical importance to the advances of researches. MapReduce is a widely adopted computing framework for data-intensive applications running on clusters. Traditional parallel XML parsing and indexing approaches are inadequate for processing large-scale XML datasets on clusters and;therefore, we propose an approach to exploit data parallelisms in XML processing using MapReduce in Hadoop. Our solution seamlessly integrates data storage, labeling, indexing, and parallel queries to process a massive amount of XML data. Specifically, we introduce an SDN labeling algorithm and a distributed hierarchical index using DHTs. More importantly, we design an advanced two phase MapReduce solution that is able to efficiently address the issues of labeling, indexing, and query processing on big XML data. The first MapReduce phase applies filtering, labeling, index building techniques, in which each DataNode performs elements labeling using a map function and a reduce function to merge and build indexes. In the second phase, local XML queries in multiple partitions are performed in parallel using index-table-enabled B-SLCA. Our experimental results show the efficiency and effectiveness of our proposed parallel XML data approach using MapReduce Framework.
Near-duplicate document detection attracts much attention from researchers since the growth of documents production is very high. The main problem confronted while looking for duplicate or near-duplicate document dete...
详细信息
Near-duplicate document detection attracts much attention from researchers since the growth of documents production is very high. The main problem confronted while looking for duplicate or near-duplicate document detection is a very high dimensional data which increases the time and space requirements for processing the data. With the trend of production of new documents, the system to detect similarity among documents becomes almost impracticable. We are proposing a new approach for solving this problem which consists in reducing the dimensionality of data and also use efficiently parallel programming to fully maximize the available capacity of the hardware. The intuition we have by using parallel programming is that more processors/core will perform better than only one processor if their management is well done. We have implemented our method and tested it empirically and experimental results have demonstrated that our algorithm performs better than other methods used for All Pairs Similarity Search (APSS) which employ multi-core and multi-programming to deduct the similarity of the documents. The results show that our method can reduce up to 65% terms to be used in similarity computation and its execution time is better than Partition-based Similarity Search method which uses parallel processing for document similarity.
HTS (Hash Type System) is a type system designed for component-based high performance computing (CBHPC) platforms, aimed at reconciling portability, modularity by separation of concerns, a high-level of abstraction an...
详细信息
HTS (Hash Type System) is a type system designed for component-based high performance computing (CBHPC) platforms, aimed at reconciling portability, modularity by separation of concerns, a high-level of abstraction and high performance. Portability and modularity are properties of component-based systems that have been extensively validated. For improving the performance of HPC applications, HTS introduces an automated approach for dynamically discovering, loading and binding parallel components tuned for the characteristics of the parallel computing platforms where the application will execute. To do so, it is based on contextual abstraction, where the performance of components that encapsulate parallel computations, communication patterns and data structures may be tuned according to the features of parallel computing platforms and the application requirements. In turn, for providing a higher level of abstraction in parallel programming, HTS supports an expressive approach for skeleton-based programming. A study of the safety properties of HTS using a calculus of component composition has provided solid foundations for the design of configuration languages for the safe specification and deployment of parallel components. The features of HTS are validated with three case studies that exercise the programming techniques behind contextual abstraction, including skeletons and performance tuning. (C) 2016 Elsevier B.V. All rights reserved.
Massive amounts of data generated in large-scale grids poses a formidable challenge for real-time monitoring of power systems. Dynamic state estimation which is a prerequisite for normal operation of power systems inv...
详细信息
Massive amounts of data generated in large-scale grids poses a formidable challenge for real-time monitoring of power systems. Dynamic state estimation which is a prerequisite for normal operation of power systems involves the time-constrained solution of a large set of equations which requires significant computational resources. In this study, an efficient and accurate relaxation-based parallel processing technique is proposed in the presence of phasor measurement units. A combination of different types of parallelism is used on both single and multiple graphic processing units to accelerate large-scale joint dynamic state estimation simulation. The estimation results for both generator and network states verify that proper massive-thread parallel programming makes the entire implementation scalable and efficient with high accuracy.
The need for accelerating power grid simulation through high performance computing (HPC) has long been recognized, and prior efforts have been devoted to developing one-off parallel computing applications for particul...
详细信息
The need for accelerating power grid simulation through high performance computing (HPC) has long been recognized, and prior efforts have been devoted to developing one-off parallel computing applications for particular power grid functions. Non-transferable software codes and duplicated implementations in these prior efforts are a major barrier to more widespread HPC adoption in power grid applications. Modern HPC hardware and architecture require significant computing expertise for application development. The GridPACK (TM) software framework described in this paper provides an HPC-compatible software structure to access modern parallel solvers and HPC-ready modules for common components in power grid simulation applications. GridPACK hides the HPC details and enables power system developers to focus on applications instead of computational details. Several example applications of GridPACK are presented to demonstrate the capabilities of GridPACK and the performance of HPC simulations with large power grid networks. Examples discussed include: a dynamic simulation application capable of running a 17,156-bus Western Electricity Coordinating Council (WECC) system in a computational speed faster than real time (e.g., under 30 s for a 30-s simulation), a static contingency analysis application using a task manager, and a dynamic contingency analysis application utilizing two levels of parallelism. These example applications illustrate GridPACK's capabilities to support different types of simulations within a unified framework and. to support reuse of transferable software codes across power grid applications. The computational results indicate strong performance improvements for power grid simulations with GridPACK. (C) 2016 Elsevier B.V. All rights reserved.
暂无评论