the logical structure of a forest of octrees can be used to create scalable algorithms for parallel adaptive mesh refinement (AMR), which has recently been demonstrated for several petascale applications. Among variou...
详细信息
ISBN:
(纸本)9780769546759
the logical structure of a forest of octrees can be used to create scalable algorithms for parallel adaptive mesh refinement (AMR), which has recently been demonstrated for several petascale applications. Among various frequently used octree-based mesh operations, including refinement, coarsening, partitioning, and enumerating nodes, ensuring a 2:1 size balance between neighboring elements has historically been the most expensive in terms of CPU time and communication volume. the 2:1 balance operation is thus a primary target to optimize. One important component of a parallel balance algorithm is the ability to determine whether any two given octants have a consistent distance/size relation. Based on new logical concepts we propose fast algorithms for making this decision for all types of 2: 1 balance conditions in 2D and 3D. Since we are able to achieve this without constructing any parent nodes in the tree that would otherwise need to be sorted and communicated, we can significantly reduce the required memory and communication volume. In addition, we propose a lightweight collective algorithm for reversing the asymmetric communication pattern induced by non-local octant interactions. We have implemented our improvements as part of the open-source "p4est" software. Benchmarking this code with both synthetic and simulation-driven adapted meshes we are able to demonstrate much reduced runtime and excellent weak and strong scalability. On our largest benchmark problem with5.13 x 10(11) octants the new 2:1 balance algorithm executes in less than 8 seconds on 112,128 CPU cores of the Jaguar Cray XT5 supercomputer.
With growing of Semantic Web data, especially RDF data, managing large RDF dataset on a single machine does not scale well. Previous work has explored how to distribute RDF triples to multiple machines. However due to...
详细信息
In a previous paper [14], we investigated the expressiveness of Time Petri Nets extended with dynamic Priorities and showed that it is able to analyze the schedulability of a partitioned real-time system over a multip...
详细信息
the proceedings contain 55 papers. the topics discussed include: an equilibrium-based approach for determining winners in combinatorial auctions;a parallel 3D delaunay triangulation method;efficient resource selection...
ISBN:
(纸本)9780769544281
the proceedings contain 55 papers. the topics discussed include: an equilibrium-based approach for determining winners in combinatorial auctions;a parallel 3D delaunay triangulation method;efficient resource selection algorithm for enterprise grid systems;performance evaluation of wireless sensor networks for mobile sensor nodes considering goodput and depletion metrics;improving performance on atmospheric models through a hybrid OpenMP/MPI implementation;on using pattern matching algorithms in MapReduce applications;dynamic memory demand estimating based on the guest operating system behaviors for virtual machines;a priority-aware NoC to reduce squashes in thread level speculation for chip multiprocessors;a novel predictive and self-adaptive dynamic thread pool management;and Vega LingCloud: a resource single leasing point system to support heterogeneous application modes on shared infrastructure.
We discuss some performance issues of the tiled Cholesky factorization on non-uniform memory access-time (NUMA) shared memory machines. We show how to optimize thread placement and data placement in order to achieve p...
详细信息
ISBN:
(纸本)9780769548982;9781467345668
We discuss some performance issues of the tiled Cholesky factorization on non-uniform memory access-time (NUMA) shared memory machines. We show how to optimize thread placement and data placement in order to achieve performance gain up to 50% compared to state-of-the-art libraries such as Plasma or MKL.
Systems intended for the execution of long-running parallel applications require fault tolerant capabilities, since the probability of failure increases withthe execution time and the number of nodes. Checkpointing a...
详细信息
Graph-based structures are being increasingly used to model data and relations among data in a number of fields. Graph-based databases are becoming more popular as a means to better represent such data. Graph traversa...
详细信息
ISBN:
(纸本)9780769546759
Graph-based structures are being increasingly used to model data and relations among data in a number of fields. Graph-based databases are becoming more popular as a means to better represent such data. Graph traversal is a key component in graph algorithms such as reachability and graph matching. Since the scale of data stored and queried in these databases is increasing, it is important to obtain high performing implementations of graph traversal that can efficiently utilize the processing power of modern processors. In this work, we present a scalable Breadth-First Search Traversal algorithm for modern multi-socket, multi-core CPUs. Our algorithm uses lock- and atomic-free operations on a cache-resident structure for arbitrary sized graphs to filter out expensive main memory accesses, and completely and efficiently utilizes all available bandwidth resources. We propose a work distribution approach for multi-socket platforms that ensures load-balancing while keeping cross-socket communication low. We provide a detailed analytical model that accurately projects the performance of our single- and multi-socket traversal algorithms to within 5-10% of obtained performance. Our analytical model serves as a useful tool to analyze performance bottlenecks on modern CPUs. When measured on various synthetic and real-world graphs with a wide range of graph sizes, vertex degrees and graph diameters, our implementation on a dual-socket Intel (R) Xeon (R) X5570 (Intel microarchitecture code name Nehalem) system achieves 1.5X-13.2X performance speedup over the best reported numbers. We achieve around 1 Billion traversed edges per second on a scale-free R-MAT graph with 64M vertices and 2 Billion edges on a dual-socket Nehalem system. Our optimized algorithm is useful as a building block for efficient multi-node implementations and future exascale systems, thereby allowing them to ride the trend of increasing per-node compute and bandwidth resources.
the modulation methods are discussed in parallel combinatory spread spectrum (PCSS) system. Several modulation methods which can be demodulated without threshold judging (that is soft decision) is mainly studied. the ...
详细信息
the Non-Uniform Fast Fourier Transform (NUFFT) is a generalization of FFT to non-equidistant samples. It has many applications which vary from medical imaging to radio astronomy to the numerical solution of partial di...
详细信息
ISBN:
(纸本)9780769546759
the Non-Uniform Fast Fourier Transform (NUFFT) is a generalization of FFT to non-equidistant samples. It has many applications which vary from medical imaging to radio astronomy to the numerical solution of partial differential equations. Despite recent advances in speeding up NUFFT on various platforms, its practical applications are still limited, due to its high computational cost, which is significantly dominated by the convolution of a signal between a non-uniform and uniform grids. the computational cost of the NUFFT is particularly detrimental in cases which require fast reconstruction times, such as iterative 3D non-Cartesian MRI reconstruction. We propose novel and highly scalable parallel algorithm for performing NUFFT on x86-based multi-core CPUs. the high performance of our algorithm relies on good SIMD utilization and high parallel efficiency. On convolution, we demonstrate on average 90% SIMD efficiency using SSE, as well up to linear scalability using a quad-socket 40-core Intel (R) Xeon (R) E7-4870 Processors based system. As a result, on dual socket Intel (R) Xeon (R) X5670 based server, our NUFFT implementation is more than 4x faster compared to the best available NUFFT3D implementation, when run on the same hardware. On Intel (R) Xeon (R) E5-2670 processor based server, our NUFFT implementation is 1.5X faster than any published NUFFT implementation today. Such speed improvement opens new usages for NUFFT. For example, iterative multichannel reconstruction of a 240x240x240 image could execute in just over 3 minutes, which is on the same order as contemporary non-iterative (and thus less-accurate) 3D NUFFT-based MRI reconstructions.
Multi-core CPUs are very efficient at executing multiple threads at the same time without significant performance penalty;this capability, however, results in increasing demand for the memory and the caches, which not...
详细信息
暂无评论