IBM Blue Gene/Q is the next generation Blue Gene machine that can scale to tens of Peta Flops with 16 cores and 64 hardware threads per node. However, significant efforts are required to fully exploit its capacity on ...
详细信息
ISBN:
(纸本)9780769549712
IBM Blue Gene/Q is the next generation Blue Gene machine that can scale to tens of Peta Flops with 16 cores and 64 hardware threads per node. However, significant efforts are required to fully exploit its capacity on various applications, spanning multiple programming models. In this paper, we focus on the asynchronous message driven parallel programming model - Charm++. Since its behavior (asynchronous) is substantially different from MPI, that presents a challenge in porting it efficiently to BG/Q. On the other hand, the significant synergy between BG/Q software and Charm++ creates opportunities for effective utilization of BG/Q resources. We describe various novel fine-grained threading techniques in Charm++ to exploit the hardware features of the BG/Q compute chip. these include the use of L2 atomics to implement lockless producer-consumer queues to accelerate communication between threads, fast memory allocators, hardware communication threads that are awakened via low overhead interrupts from the BG/Q wakeup unit. Burst of short messages is processed by using the ManytoMany interface to reduce runtime overhead. We also present techniques to optimize NAMD computation via Quad processing Unit (QPX) vector instructions and the acceleration of message rate via communication threads to optimize the Particle Mesh Ewald (PME) computation. We demonstrate the benefits of our techniques via two benchmarks, 3D Fast Fourier Transform, and the molecular dynamics application NAMD. For the 92,000-atom ApoA1 molecule, we achieved 683 mu s/step with PME every 4 steps and 782 mu s/step with PME every step.
Polygon overlay is one of the complex operations in Geographic Information Systems (GIS). In GIS, a typical polygon tends to be large in size often consisting of thousands of vertices. Sequential algorithms for this p...
详细信息
We present techniques to process large scale-free graphs in distributed memory. Our aim is to scale to trillions of edges, and our research is targeted at leadership class supercomputers and clusters with local non-vo...
详细信息
ISBN:
(纸本)9780769549712
We present techniques to process large scale-free graphs in distributed memory. Our aim is to scale to trillions of edges, and our research is targeted at leadership class supercomputers and clusters with local non-volatile memory, e. g., NAND Flash. We apply an edge list partitioning technique, designed to accommodate high-degree vertices (hubs) that create scaling challenges when processing scale-free graphs. In addition to partitioning hubs, we use ghost vertices to represent the hubs to reduce communication hotspots. We present a scaling study withthree important graph algorithms: Breadth-First Search (BFS), K-Core decomposition, and Triangle Counting. We also demonstrate scalability on BG/P Intrepid by comparing to best known Graph500 results [1]. We show results on two clusters with local NVRAM storage that are capable of traversing trillion-edge scale-free graphs. By leveraging node-local NAND Flash, our approach can process thirty-two times larger datasets with only a 39% performance degradation in Traversed Edges Per Second (TEPS).
In this paper, we approach the design of energy-And security-critical distributed real-time embedded systems from the early mapping and scheduling phases. Modern distributed Embedded Systems (DESs) are common to be co...
详细信息
the Weather Research and Forecasting (WRF) model has been widely employed for weather prediction and atmospheric simulation with dual purposes in forecasting and research. Land-surface models (LSMs) are parts of the W...
详细信息
ISBN:
(纸本)9781479920815
the Weather Research and Forecasting (WRF) model has been widely employed for weather prediction and atmospheric simulation with dual purposes in forecasting and research. Land-surface models (LSMs) are parts of the WRF model, which is used to provide information of heat and moisture fluxes over land and sea-ice points. the 5-layer thermal diffusion simulation is an LSM based on the MM5 soil temperature model with an energy budget made up of sensible, latent, and radiative heat fluxes. Owing to the feature of no interactions among horizontal grid points, the LSMs are very favorable for massively parallelprocessing. the study presented in this article demonstrates the parallel computing efforts on the WRF 5-layer thermal diffusion scheme using Graphics processing Unit (GPU). Since this scheme is only one intermediate module of the entire WRF model, the involvement of the I/O transfer does not occur in the intermediate process. By employing one NVIDIA GTX 680 GPU in the case without I/O transfer, our optimization efforts on the GPU-based 5-layer thermal diffusion scheme can reach a speedup as high as 247.5x with respect to one CPU core, whereas the speedup for one CPU socket with respect to one CPU core is only 3.1x. We can even boost the speedup to 332x with respect to one CPU core when three GPUs are applied.
It is well known that the method of parallel downloading can be used to reduce file download times in a peer-to-peer (P2P) network. there has been little investigation on parallel download and chunk allocation for sou...
详细信息
We study in this paper the influence of the restart policy on the sequential and parallel performance of combinatorial search problems. Our evaluation relies on several experiments using a constraint-based local searc...
详细信息
parallel programming patterns provide enduring principles that serve as a conceptual framework to orient students when they set out to solve problems. Learning patterns enables students to quickly gain the intellectua...
详细信息
Diffusion LMS algorithm has been extensively studied during the last few years. this efficient approach allows to address distributed optimization problems over sensor networks in the case where the nodes have to coll...
详细信息
ISBN:
(纸本)9781467331463;9781467331449
Diffusion LMS algorithm has been extensively studied during the last few years. this efficient approach allows to address distributed optimization problems over sensor networks in the case where the nodes have to collaboratively estimate a single parameter vector. Nevertheless, real-life problems are often multitask-oriented in the sense that the optimum parameter vector may not be the same for every node. In this paper, we conduct a theoretical analysis on the stochastic behavior of diffusion LMS when, either intentionally or unintentionally, applied to multitask problems, that is, in a situation where the founding hypothesis of this algorithm is violated. Simulation results validate our theoretical model. theoretical analysis and simulation show that collaboration can be still beneficial, and depends on antagonistic effects of the estimation bias-variance trade-off. this work provides a theoretical justification for the need to derive new cooperative algorithms specifically dedicated to multitask problems.
Cloud computing infrastructure offers the computing resources as a homogeneous collection of virtual machine instances by different hardware on figurations, which is transparent to end users. In fact, the computationa...
详细信息
暂无评论