the growing speed gap between CPU and memory makes I/O the main bottleneck of many industrial applications. Some applications need to perform I/O operations for very large volume of data frequently, which will harm th...
详细信息
ISBN:
(纸本)9780769549798
the growing speed gap between CPU and memory makes I/O the main bottleneck of many industrial applications. Some applications need to perform I/O operations for very large volume of data frequently, which will harm the performance seriously. this work's motivation are geophysical applications used for oil and gas exploration. these applications process Terabyte size datasets in HPC facilities. the datasets represent subsurface models and field recorded data. In general term, these applications read as inputs and write as intermediate/final results huge amount of data, where the underlying algorithms implement seismic imaging techniques. the traditional sequential I/O, even when couple with advance storage systems, cannot complete all I/O operations for so large volumes of data in an acceptable time range. parallel I/O is the general strategy to solve such problems. However, because of the dynamic property of many of these applications, each parallel process does not know the data size it needs to write until its computation is done, and it also cannot identify the position in the file to write. In order to write correctly and efficiently, communication and synchronization are required among all processes to fully exploit the parallel I/O paradigm. To tackle these issues, we use a dynamic load balancing framework that is general enough for most of these applications. And to reduce the expensive synchronization and communication overhead, we introduced a I/O node that only handles I/O request and let compute nodes perform I/O operations in parallel. By using both POSIX I/O and memory-mapping interfaces, the experiment indicates that our approach is scalable. For instance, with 16 processes, the bandwidth of parallel reading can reach the theoretical peak performance (2.5 GB/s) of the storage infrastructure. Also, the parallel writing can be up to 4.68x (speedup, POSIX I/O) and 7.23x (speedup, memory-mapping) more efficient than the serial I/O implementation. Since, mo
Although fences are designed for low-overhead concurrency coordination, they can be expensive in current machines. If fences were largely free, faster fine-grained concurrent algorithms could be devised, and compilers...
详细信息
ISBN:
(纸本)9781450320795
Although fences are designed for low-overhead concurrency coordination, they can be expensive in current machines. If fences were largely free, faster fine-grained concurrent algorithms could be devised, and compilers could guarantee Sequential Consistency (SC) at little cost. In this paper, we present WeeFence (or WFence for short), a fence that is very cheap because it allows post-fence accesses to skip it. Such accesses can typically complete and retire before the pre-fence writes have drained from the write buffer. Only when an incorrect reordering of accesses is about to happen, does the hardware stall to prevent it. In the paper, we present the WFence design for TSO, and compare it to a conventional fence with speculation for 8-processor multicore simulations. We run parallel kernels that contain explicit fences and parallel applications that do not. For the kernels, WFence eliminates nearly all of the fence stall, reducing the kernels' execution time by an average of 11%. For the applications, a conservative compiler algorithm places fences in the code to guarantee SC. In this case, on average, WFences reduce the resulting fence overhead from 38% of the applications' execution time to 2% (in a centralized WFence design), or from 36% to 5% (in a distributed WFence design). Copyright 2013 ACM.
Since data rate in wireless communication systems has exponentially increased during the last decades, serious efforts must be considered in developing mobile terminals to provide enough processing capability. For ins...
详细信息
PDSLin is a general-purpose algebraic parallel hybrid (direct/iterative) linear solver based on the Schur complement method. the most challenging step of the solver is the computation of a preconditioner based on the ...
详细信息
ISBN:
(纸本)9780769549798
PDSLin is a general-purpose algebraic parallel hybrid (direct/iterative) linear solver based on the Schur complement method. the most challenging step of the solver is the computation of a preconditioner based on the global Schur complement. Efficient parallel computation of the preconditioner gives rise to partitioning problems with sophisticated constraints and objectives. In this paper, we identify two such problems and propose hyper graph partitioning methods to address them. the first problem is to balance the work loads associated with different sub domains to compute the preconditioner. We first formulate an objective function and a set of constraints to model the preconditioner computation time. then, to address these complex constraints, we propose a recursive hyper graph bisection method. the second problem is to improve the data locality during the parallel solution of a sparse triangular system with multiple sparse right-hand sides. We carefully analyze the objective function and show that it can be well approximated by a standard hyper graph partitioning method. Moreover, an ordering compatible with a post ordering of the sub domain elimination tree is shown to be very effective in preserving locality. To evaluate the two proposed methods in practice, we present experimental results using linear systems arising from some applications of our interest. First, we show that in comparison to a commonly-used nested graph dissection method, the proposed recursive hyper graph partitioning method reduces the preconditioner construction time, especially when the number of sub domains is moderate. this is the desired result since PDSLin is based on a two-level parallelization to keep the number of sub domains small by assigning multiple processors to each sub domain. We also show that our second proposed hyper graph method improves the data locality during the sparse triangular solution and reduces the solution time. Moreover, we show that partitioning time can be
Conflict detection and resolution are among the most fundamental issues in transactional memory systems. Hardware transactional memory (HTM) systems such as AMD's Advanced Synchronization Facility (ASF) employ inh...
详细信息
A fast algorithm of calculating LRCS of a complex target is presented and computational unified device architecture(CUDA) is used for accelerating the color-flagged Graphical Electromagnetic Computing(GRECO) *** on th...
详细信息
ISBN:
(纸本)9781467360760
A fast algorithm of calculating LRCS of a complex target is presented and computational unified device architecture(CUDA) is used for accelerating the color-flagged Graphical Electromagnetic Computing(GRECO) *** on the five parameters bidirectional reflectance distribution function(BRDF) model,one target's LRCS can be divided into millions of pixels' computation which are processed on a great many of threads running in *** obtained comparison of CPU calculation and CUDA parallel calculation of one aircraft shows that the speed of CUDA parallel calculation improves several *** the help of CUDA and GRECO method,electromagnetic simulation of LRCS can be processed much more efficiently.
Searching a Peer-to-Peer (P2P) network without using a central index has been widely investigated but proved to be very difficult. Various strategies have been proposed, however no practical solution to date also addr...
详细信息
ISBN:
(纸本)9781467355919
Searching a Peer-to-Peer (P2P) network without using a central index has been widely investigated but proved to be very difficult. Various strategies have been proposed, however no practical solution to date also addresses privacy concerns. By clustering peers which have similar interests, a semantic overlay provides a method for achieving scalable search. Traditionally, in order to find similar peers, a peer is required to fully expose its preferences for items or content, therefore disclosing this private information. However, in a hostile environment, such as a P2P system, a peer can not know the true identity or intentions of fellow peers. In this paper, we propose two protocols for building a semantic overlay in a privacy-preserving manner by modifying existing solutions to the Private Set Intersection (PSI) problem. Peers in our overlay compute their similarity to other peers in the encrypted domain, allowing them to find similar peers. Using homomorphic encryption, peers can carrying out computations on encrypted values, without needing to decrypt them first. We propose two protocols, one based on the inner product of vectors, the other on multivariate polynomial evaluation, which are able to compute a similarity value between two peers. Both protocols are implemented on top of an existing P2P platform and are designed for actual deployment. Using a supercomputer and a dataset extracted from a real world instance of a semantic overlay, we emulate our protocols in a network consisting of a thousand peers. Finally, we show the actual computational and bandwidth usage of the protocols as recorded during those experiments.
暂无评论