Although many existing distributed computing platforms have simplified the development of parallel programs to various degrees, none of them have good notions of software reuse, which is vital to reducing development ...
详细信息
ISBN:
(纸本)9781424497621
Although many existing distributed computing platforms have simplified the development of parallel programs to various degrees, none of them have good notions of software reuse, which is vital to reducing development cost and decreasing the bug rate. On those platforms, code reuse is strongly tied to the experience and skills of developers, effectively leading to low productivity. This paper describes Pomelo, a distributed computing platform designed to aid code reuse in parallel programming. Pomelo provides support for reusing software at different granularities with its task-oriented architecture. Equipped with infrastructural mechanisms for tasks to communicate with each other, it especially facilitates component-based programming with several built-in types of tasks. Preliminary experiments suggest that Pomelo has promising performance and good scalability. Our experience also shows that it is helpful for writing parallel programs productively.
Sequential Consistency (SC) is the most intuitive memory model for parallel programs. However, modern architectures aggressively reorder and overlap memory accesses, causing SC violations. An SC violation is virtually...
详细信息
ISBN:
(纸本)9781467390026
Sequential Consistency (SC) is the most intuitive memory model for parallel programs. However, modern architectures aggressively reorder and overlap memory accesses, causing SC violations. An SC violation is virtually always a bug. Most prior schemes either search the entire state space of a program, or use a constraint solver to find SC violations. A promising recent scheme uses active testing technique but fails to be effective for SC violations involving larger number of threads and variables, and larger codebases. We propose Orion, the first active testing technique that can detect, expose, and classify any arbitrary SC violations in any program. Orion works in two phases. In the first phase, it finds potential SC violation cycles by focusing on racing accesses. In the second phase, it exposes each SC violation cycle by enforcing the exact scheduling order. We present a detailed design of Orion in the paper. We tested different concurrent algorithms, bug kernels, SPLASH2, PARSEC applications, and an open source program, Apache. We experimented with TSO and PSO memory models. We detected and exposed 60 SC violations of which 15 violations involve more than two processors and variables. Orion exposes SC violations quickly and with high probability. Compared to a state-of-the-art active testing technique, it has a much better SC violation detection ability.
OpenACC has been on development for a few years now. The OpenACC 2.5 specification was recently made public and there are some initiatives for developing full implementations of the standard to make use of accelerator...
详细信息
ISBN:
(纸本)9783319499567;9783319499550
OpenACC has been on development for a few years now. The OpenACC 2.5 specification was recently made public and there are some initiatives for developing full implementations of the standard to make use of accelerator capabilities. There is much to be done yet, but currently, OpenACC for GPUs is reaching a good maturity level in various implementations of the standard, using CUDA and OpenCL as backends. Nvidia is investing in this project and they have released an OpenACC Toolkit, including the PGI Compiler. There are, however, more developments out there. In this work, we analyze different available OpenACC compilers that have been developed by companies or universities during the last years. We check their performance and maturity, keeping in mind that OpenACC is designed to be used without extensive knowledge about parallel programming. Our results show that the compilers are on their way to a reasonable maturity, presenting different strengths and weaknesses.
Increasingly data-intensive scientific and commercial applications require frequent movement of large datasets from one site to the other. Despite the growing capacity of the networking capacity, these data movements ...
详细信息
ISBN:
(纸本)9781467388153
Increasingly data-intensive scientific and commercial applications require frequent movement of large datasets from one site to the other. Despite the growing capacity of the networking capacity, these data movements rarely achieve the promised data transfer rates of the underlying physical network due to the poorly tuned data transfer protocols. Accurately and efficiently tuning the data transfer protocol parameters in a dynamically changing network environment is a big challenge and still an open research problem. In this paper, we present predictive end-to-end data transfer optimization algorithms based on historical data analysis and real-time background traffic probing, dubbed HARP. Most of the existing work in this area is solely based on real time network probing, which either cause too much sampling overhead or fail to accurately predict the correct transfer parameters. Combining historical data analysis with real time sampling enables our algorithms to tune the application level data transfer parameters accurately and efficiently to achieve close-to-optimal end-to-end data transfer throughput with very low overhead. Our experimental analysis over a variety of network settings shows that HARP outperforms existing solutions by up to 50% in terms of the achieved throughput.
Adaptive Mesh Refinement methods reduce computational requirements of problems by increasing resolution for only areas of interest. However, in practice, efficient AMR implementations are difficult considering that th...
详细信息
ISBN:
(纸本)9781467388153
Adaptive Mesh Refinement methods reduce computational requirements of problems by increasing resolution for only areas of interest. However, in practice, efficient AMR implementations are difficult considering that the mesh hierarchy management must be optimized for the underlying hardware. Architecture complexity of GPUs can render efficient AMR to be particularity challenging in GPU-accelerated supercomputers. This paper presents a compiler-based high-level framework that can automatically transform serial uniform mesh code annotated by the user into parallel adaptive mesh code optimized for GPU-accelerated supercomputers. We also present a method for empirical analysis of a uniform mesh to project an upper-bound on achievable speedup of a GPU-optimized AMR code. We show experimental results on three production applications. The speedups of code generated by our framework are comparable to hand-written AMR code while achieving good and weak scaling up to 1000 GPUs.
Key foundational components of Big Data frameworks include efficient large-scale storage and high-performance linear algebra. This paper discusses efficient implementations that utilize compression techniques inspired...
详细信息
ISBN:
(纸本)9781467390057
Key foundational components of Big Data frameworks include efficient large-scale storage and high-performance linear algebra. This paper discusses efficient implementations that utilize compression techniques inspired by columnar relational databases for improving space and time profiles for vector and matrix operations. In addition, linear algebra operations are integrated with columnar relational algebra operations both in dense and compressed forms. For several of the operations substantial speedups are obtained by operating directly on the compressed relations, vectors and matrices. Advantages of mixing and matching relational and linear algebra operations are also pointed out. Both serial and parallel implementations are provided in the ScalaTion Big Data Analytics Framework.
Partitioned Global Address Space (PGAS) programming models combine shared and distributed memory features, and provide a foundation for high-productivity parallel programming using lightweight one-sided communications...
详细信息
ISBN:
(纸本)9783319509952;9783319509945
Partitioned Global Address Space (PGAS) programming models combine shared and distributed memory features, and provide a foundation for high-productivity parallel programming using lightweight one-sided communications. The OpenSHMEM programming interface has recently begun gaining popularity as a lightweight library-based approach for developing PGAS applications, in part through its use of a symmetric heap to realize more efficient implementations of global pointers than in other PGAS systems. However, current approaches to hybrid inter-node and intra-node parallel programming in OpenSHMEM rely on the use of multithreaded programming models (e.g., pthreads, OpenMP) that harness intra-node parallelism but are opaque to the OpenSHMEM runtime. This OpenSHMEM+X approach can encounter performance challenges such as bottlenecks on shared resources, long pause times due to load imbalances, and poor data locality. Furthermore, OpenSHMEM+X requires the expertise of hero-level programmers, compared to the use of just OpenSHMEM. All of these are hard challenges to mitigate with incremental changes. This situation will worsen as computing nodes increase their use of accelerators and heterogeneous memories. In this paper, we introduce the AsyncSHMEM PGAS library which supports a tighter integration of shared and distributed memory parallelism than past OpenSHMEM implementations. AsyncSHMEM integrates the existing OpenSHMEM reference implementation with a thread-pool-based, intra-node, work-stealing runtime. It aims to prepare OpenSHMEM for future generations of HPC systems by enabling the use of asynchronous computation to hide data transfer latencies, supporting tight interoperability of OpenSHMEM with task parallel programming, improving load balance (both of communication and computation), and enhancing locality. In this paper we present the design of AsyncSHMEM, and demonstrate the performance of our initial AsyncSHMEM implementation by performing a scalability analysis of
There is a significant interest in the computational physics community to perform lattice quantum chromodynamics (LQCD) simulations, which can run into the trillions of operations. LQCD computations solve a sparse lin...
详细信息
ISBN:
(纸本)9783319460796;9783319460789
There is a significant interest in the computational physics community to perform lattice quantum chromodynamics (LQCD) simulations, which can run into the trillions of operations. LQCD computations solve a sparse linear system using a Wilson Dslash kernel, which has an arithmetic intensity of 0.88-2.29. This makes Dslash memory bandwidth-bound on most architectures, including Intel Xeon Phi Knights Corner (KNC). Most research optimizing the Dslash operator has been focused on single right-hand side (SRHS) linear solvers. There is a class of LQCD computations which aims to solve systems with multiple right-hand sides (MRHS), presenting additional opportunities for data reuse and vectorization. We present two approaches to MRHS Dslash: a vector register blocking approach and one using the software package QPhiX with a custom code generator for low-level intrinsics. We observed significant speedups using our approaches, with sustained performance of over 700 GFLOPS (single precision) in one instance. We achieved up to 29% of theoretical peak performance compared to a maximum of 13% obtained by the previous SRHS method using QPhiX.
Dry eye syndrome is a public health problem, and one of the most common conditions seen by eye care specialists. Among the clinical tests for its diagnosis, the evaluation of the interference patterns observed in the ...
详细信息
Dry eye syndrome is a public health problem, and one of the most common conditions seen by eye care specialists. Among the clinical tests for its diagnosis, the evaluation of the interference patterns observed in the tear film lipid layer is often employed. In this sense, tear film maps illustrate the spatial distribution of the patterns over the whole tear film and provide useful information to practitioners. However, the creation of a single map usually takes tens of minutes. Medical experts currently demand applications with lower response time in order to provide a faster diagnosis for their patients. In this work, we explore different parallel approaches to accelerate the definition of the tear film map by exploiting the power of today's ubiquitous multicore systems. They can be executed on any multicore system without special software or hardware requirements. The experimental evaluation determines the best approach (on-demand with dynamic seed distribution) and proves that it can significantly decrease the runtime. For instance, the average runtime of our experiments with 50 real-world images on a system with AMD Opteron processors is reduced from more than 20 minutes to one minute and 12 seconds.
parallelizing industrial simulation codes like the EUROPLEXUS software dedicated to the analysis of fast transient phenomena, is challenging. In this paper we focus on the efficient parallelization on a multi-core sha...
详细信息
parallelizing industrial simulation codes like the EUROPLEXUS software dedicated to the analysis of fast transient phenomena, is challenging. In this paper we focus on the efficient parallelization on a multi-core shared memory node. We propose to have each thread gather the data it needs for processing a given iteration range, before to actually advance the computation by one time step on this range. This lazy cache aware layout construction enables to keep the original data structure and leads to very localised code modifications. We show that this approach can improve the execution time by up to 40% when the task size is set to have the data fit in the L2 cache.
暂无评论