the proceedings contain 39 papers. the topics discussed include: Fast greedy algorithms in MapReduce and streaming;reduced hardware transactions: a new approach to hybrid transactional memory;recursive design of hardw...
the proceedings contain 39 papers. the topics discussed include: Fast greedy algorithms in MapReduce and streaming;reduced hardware transactions: a new approach to hybrid transactional memory;recursive design of hardware priority queues;drop the anchor: lightweight memory management for non-blocking data structures;scalable statistics counters;storage and search in dynamic peer-to-peer networks;expected sum and maximum of displacement of random sensors for coverage of a domain;on dynamics in selfish network creation;brief announcement: truly parallel burrows-wheeler compression and decompression;brief announcement: locality in wireless scheduling;brief announcement: universally truthful secondary spectrum auctions;and brief announcement: online batch scheduling for flow objectives.
the increasing use of runtime-compiled applications provides an opportunity for coarse-grained reconfigurable architecture (CGRA) accelerators to be used in a user-transparent way. the challenge is to provide efficien...
详细信息
ISBN:
(纸本)9781479941162
the increasing use of runtime-compiled applications provides an opportunity for coarse-grained reconfigurable architecture (CGRA) accelerators to be used in a user-transparent way. the challenge is to provide efficient runtime translation for CGRAs. Despite the apparent difficulties stemming from the heterogeneous nature of CGRAs, this paper demonstrates that it is possible to speed up runtime-compiled applications using CGRAs in a runtime-complied way. In particular this paper presents a runtime translation framework, called RBTVM, for CGRA accelerators, based on the LLVM Just-In-Time (JIT) compiler. Also two optimizations for the RBTVM are proposed. Experimental results show that the proposed RBTVM approach can improve the performance of runtime-compiled applications by 1.44 times on average compared to using the baseline JIT compiler only that does not take advantage of the accelerator, demonstrating the efficacy of the proposed approach.
the Peripheral Component Interconnect Express (PCIe) is the predominant interconnect enabling the CPU to communicate with attached input/output and storage devices. Considering its high performance and capabilities to...
详细信息
ISBN:
(纸本)9781479941162
the Peripheral Component Interconnect Express (PCIe) is the predominant interconnect enabling the CPU to communicate with attached input/output and storage devices. Considering its high performance and capabilities to connect different address domains via the so-called Non-Transparent Bridging (NTB) technology, it starts to be an alternative or addition to traditional interconnects. the PCIe technology enables devices to communicate in a peer-to-peer manner allowing for new implementation possibilities of tomorrow's high-performance systems. Components being attached to the same computer rack are connected by means of PCIe and the racks themselves by using traditional network technologies. this leads to a heterogeneous landscape of compute nodes and high-performance interconnects. the Socket Wheeled Intelligent Fabric Transport (SWIFT) takes up the challenge of programming these systems. the presented implementation is highly portable due to a hardware abstraction layer allowing for bringing the implemented concepts to new interconnects with minimal effort. It is evaluated on a test system exposing different compute nodes equipped with coprocessors, which take part in a PCIe non-transparent bridging architecture. Besides low-level benchmarks investigating principal performance characteristics of the communication layer, MPI benchmark results are presented illustrating how scientific applications may be ported to heterogeneous environments.
the realistic simulation of ultrasound wave propagation is computationally intensive. the large size of the grid and low degree of reuse of data means that it places a great demand on memory bandwidth. Graphics Proces...
详细信息
ISBN:
(纸本)9781479941162
the realistic simulation of ultrasound wave propagation is computationally intensive. the large size of the grid and low degree of reuse of data means that it places a great demand on memory bandwidth. Graphics Processing Units (GPUs) have attracted attention for performing scientific calculations due to their potential for efficiently performing large numbers of floating point computations. However, many applications may be limited by memory bandwidth, especially for data sets whose size is larger than that of the GPU platform. this problem is only partially mitigated by applying the standard technique of breaking the grid into regions and overlapping the computation of one region withthe host-device memory transfer of another. In this paper, we implement a memory-bound GPU-based ultrasound simulation and evaluate the use of a technique for improving performance by compressing the data into a fixed-point representation that reduces the time required for inter-host- device transfers. We demonstrate a speedup of 1.5 times on a simulation where the data is broken into regions that must be copied back and forth between the CPU and GPU. We develop a model that can be used to determine the amount of temporal blocking required to achieve near optimal performance, without extensive experimentation. this technique may also be applied to GPU-based scientific simulations in other domains such as computational fluid dynamics and electromagnetic wave simulation.
An optimized parallel algorithm is proposed to solve the problem occurred in the process of complicated backward substitution of cyclic reduction during solving tridiagonal linear systems. Adopting a hybrid parallel m...
详细信息
ISBN:
(纸本)9781479941162
An optimized parallel algorithm is proposed to solve the problem occurred in the process of complicated backward substitution of cyclic reduction during solving tridiagonal linear systems. Adopting a hybrid parallel model, this algorithm combines the cyclic reduction method and the partition method. this hybrid algorithm has simple backward substitution on parallel computers comparing withthe cyclic reduction method. In this paper, the operation count and execution time are obtained to evaluate and make comparison for these methods. On the basis of results of these measured parameters, the hybrid algorithm using the hybrid approach with a multi-threading implementation achieves better efficiency than the other parallel methods, i.e., the cyclic reduction and the partition methods. Among them, the cyclic reduction method is previously found to be the fastest algorithm in many ways for solutions. In particular, the approach involved in this paper has the least scalar operation count and the shortest execution time on multi-core computer when the size of an equation is large enough. the hybrid parallel algorithm improves the performance of the cyclic reduction and partition methods by 30% and 20% respectively.
Asynchronous variational integrators (AVIs) are used in computational mechanics and graphics to solve complex contact mechanics problems. the parallelization of AVI is difficult problem because it is not possible to b...
详细信息
ISBN:
(纸本)9781450328210
Asynchronous variational integrators (AVIs) are used in computational mechanics and graphics to solve complex contact mechanics problems. the parallelization of AVI is difficult problem because it is not possible to build a dependence graph for AVI either at compile-time or at runtime. However, we show that if the dependence graph for AVI can be updated incrementally as the computation is performed, it is possible to parallelize AVI in a systematic way. Using this approach, we are able to obtain speedups of up to 20 on 24 cores for relatively small AVI problems.
Support Vector Machine (SVM) has been widely used in data-mining and Big Data applications as modern commercial databases start to attach an increasing importance to the analytic capabilities. In recent years, SVM was...
详细信息
ISBN:
(纸本)9781479938018
Support Vector Machine (SVM) has been widely used in data-mining and Big Data applications as modern commercial databases start to attach an increasing importance to the analytic capabilities. In recent years, SVM was adapted to the field of High Performance Computing for power/performance prediction, auto-tuning, and runtime scheduling. However, even at the risk of losing prediction accuracy due to insufficient runtime information, researchers can only afford to apply offline model training to avoid significant runtime training overhead. Advanced multi- and many-core architectures offer massive parallelism with complex memory hierarchies which can make runtime training possible, but form a barrier to efficient parallel SVM design. To address the challenges above, we designed and implemented MIC-SVM, a highly efficient parallel SVM for x86 based multi-core and many-core architectures, such as the Intel Ivy Bridge CPUs and Intel Xeon Phi co-processor (MIC). We propose various novel analysis methods and optimization techniques to fully utilize the multilevel parallelism provided by these architectures and serve as general optimization methods for other machine learning tools. MIC-SVM achieves 4.4-84x and 18-47x speedups against the popular LIBSVM, on MIC and Ivy Bridge CPUs respectively, for several real-world data-mining datasets. Even compared with GPUSVM, run on a top of the line NVIDIA k20x GPU, the performance of our MIC-SVM is competitive. We also conduct a cross-platform performance comparison analysis, focusing on Ivy Bridge CPUs, MIC and GPUs, and provide insights on how to select the most suitable advanced architectures for specific algorithms and input data patterns.
the smart cities concept arises from the need to manage, automate, optimize and explore all aspects of a city that could be improved. For this purpose it is necessary to build a robust architecture that satisfies a mi...
详细信息
ISBN:
(纸本)9781450316569
the smart cities concept arises from the need to manage, automate, optimize and explore all aspects of a city that could be improved. For this purpose it is necessary to build a robust architecture that satisfies a minimal number of requirements such as distributed sensing, integrated management and flexibility. Several architectures have been proposed with different goals, but none of them met satisfactorily the needs that permeate smart cities. In this work various architectures are discussed, highlighting the main requirements that they aim to fulfill. Furthermore, based on different architectures withthe most varied purposes, a set of requirements for the implementation of a smart city is presented and discussed. Copyright 2013 acm.
Problem domains are commonly decomposed hierarchically to fully utilize parallel resources in modern microprocessors. Such decompositions can be provided as library routines, written by experienced experts, for genera...
详细信息
ISBN:
(纸本)9781467355254;9781467355247
Problem domains are commonly decomposed hierarchically to fully utilize parallel resources in modern microprocessors. Such decompositions can be provided as library routines, written by experienced experts, for general algorithmic patterns. But such APIs tend to be constrained to certain architectures or data sizes. Integrating them with application code is often an unnecessarily daunting task, especially when these routines need to be closely coupled with user code to achieve better performance. this paper contributes HiDP, a high-level hierarchical data parallel language. the purpose of HiDP is to improve the coding productivity of integrating hierarchical data parallelism without significant loss of performance. HiDP is a source-to-source compiler that converts a very concise data parallel language into CUDA C++ source code. Internally, it performs necessary analysis to compose user code with efficient and architecture-aware code snippets. this paper discusses various aspects of HiDP systematically: the language, the compiler and the run-time system with built-in tuning capabilities. they enable HiDP users to express algorithms in less code than low-level SDKs require for native platforms. HiDP also exposes abundant computing resources of modern parallel architectures. Improved coding productivity tends to come with a sacrifice in performance. Yet, experimental results show that the generated code delivers performance very close to handcrafted native GPU code.
Modern throughput processors such as GPUs achieve high performance and efficiency by exploiting data parallelism in application kernels expressed as threaded code. One draw-back of this approach compared to convention...
详细信息
ISBN:
(纸本)9781467355254;9781467355247
Modern throughput processors such as GPUs achieve high performance and efficiency by exploiting data parallelism in application kernels expressed as threaded code. One draw-back of this approach compared to conventional vector architectures is redundant execution of instructions that are common across multiple threads, resulting in energy inefficiency due to excess instruction dispatch, register file accesses, and memory operations. this paper proposes to alleviate these overheads while retaining the threaded programming model by automatically detecting the scalar operations and factoring them out of the parallel code. We have developed a scalarizing compiler that employs convergence and variance analyses to statically identify values and instructions that are invariant across multiple threads. Our compiler algorithms are effective at identifying convergent execution even in programs with arbitrary control flow, identifying two-thirds of the opportunity captured by a dynamic oracle. the compile-time analysis leads to a reduction in instructions dispatched by 29%, register file reads and writes by 31%, memory address counts by 47%, and data access counts by 38%.
暂无评论