The ongoing many-core design aims at core counts where cache coherence becomes a serious challenge. Therefore, this paper discusses how one-sided communication and the required process synchronization can be realized ...
详细信息
The ongoing many-core design aims at core counts where cache coherence becomes a serious challenge. Therefore, this paper discusses how one-sided communication and the required process synchronization can be realized on a non-cache-coherent many-core CPU. The Intel Single-chip Cloud Computer serves as an exemplary hardware architecture. The presented approach is based on software-managed cache coherence for MPI one-sided communication. The prototype implementation delivers a PUT performance of up to 5 times faster than the default message-based approach and reveals a reduction of the communication costs for the NAS parallel Benchmarks 3-D fast Fourier Transform by a factor of 5. Further, the paper derives conclusions for future non-cache-coherent architectures.
Client + cloud computing is a disruptive, new computing platform, combining diverse client devices -- PCs, smartphones, sensors, and single-function and embedded devices -- with the unlimited, on-demand computation an...
详细信息
ISBN:
(纸本)9781450301190
Client + cloud computing is a disruptive, new computing platform, combining diverse client devices -- PCs, smartphones, sensors, and single-function and embedded devices -- with the unlimited, on-demand computation and data storage offered by cloud computing services such as Amazon's AWS or Microsoft's Windows Azure. As with every advance in computing, programming is a fundamental challenge as client + cloud computing combines many difficult aspects of software development. Systems built for this world are inherently parallel and distributed, run on unreliable hardware, and must be continually available -- a challenging programming model for even the most skilled programmers. How then do ordinary programmers develop software for the Cloud? This talk presents one answer, Orleans, a software framework for building client + cloud applications. Orleans encourages use of simple concurrency patterns that are easy to understand and implement correctly, building on an actor-like model with declarative specification of persistence, replication, and consistency and using lightweight transactions to support the development of reliable and scalable client + cloud software.
Two trends changed the computing landscape over the past decade: (1) hardware vendors started delivering chip multiprocessors (CMPs) instead of uniprocessors, and (2) software developers increasingly chose managed lan...
详细信息
ISBN:
(纸本)9781450301190
Two trends changed the computing landscape over the past decade: (1) hardware vendors started delivering chip multiprocessors (CMPs) instead of uniprocessors, and (2) software developers increasingly chose managed languages instead of native languages. Unfortunately, the former change is disrupting the virtuous-cycle between performance improvements and software innovation. Establishing a new parallel performance virtuous cycle for managed languages will require scalable applications executing on scalable Virtual Machine (VM) services, since the VM schedules, monitors, compiles, optimizes, garbage collects, and executes together with the application. This talk describes current progress, opportunities, and challenges for scalable VM services. The parallel computing revolution urgently needs more innovations.
While the high-performance computing world is dominated by distributed memory computer systems, applications that require random access into large shared data structures continue to motivate development of ever larger...
详细信息
ISBN:
(纸本)9781595931894
While the high-performance computing world is dominated by distributed memory computer systems, applications that require random access into large shared data structures continue to motivate development of ever larger shared-memory parallel computers such as Cray's MTA and SGI's Altix *** support scalable application performance on such architectures, the memory allocator must be able to satisfy requests at a rate proportional to system size. For example, a 40 processor Cray MTA-2 can experience over 5000 concurrent requests, one from each of its 128 streams per processor. Cray's Eldorado, to be built upon the same network as Sandia's 10,000 processor Red Storm system, will sport thousands of multithreaded processors leading to hundreds of thousands of concurrent *** this paper, we present MAMA, a scalable shared-memory allocator designed to service any rate of concurrent requests. MAMA is distinguished from prior work on shared-memory allocators in that it employs software combining to aggregate requests serviced by a single heap structure: Hoard and MTA malloc necessitate repetition of the underlying heap data structures in proportion to processor count. Unlike Hoard, MAMA does not exploit processor-local data structures, limiting its applicability today to systems that sustain high utilization in the presence of global references such as Cray's MTA systems. We believe MAMA's relevance to other shared-memory systems will grow as they become increasingly multithreaded and, consequently, more tolerant of references to non-local *** show not only that MAMA scales on Cray MTA systems, but also that it delivers absolute performance competitive with allocators employing heap repetition. In addition, we demonstrate that performance of repetition-based allocators does not scale under heavy loads. We also argue more generally that methods using repetition alone to support concurrency are subject to an impractical tradeoff of scalability against space c
The rapid progress of multi/many-core architectures has caused data-intensive parallel applications not yet fully optimized to deliver the best performance. In the advent of concurrent programming, frameworks offering...
详细信息
The rapid progress of multi/many-core architectures has caused data-intensive parallel applications not yet fully optimized to deliver the best performance. In the advent of concurrent programming, frameworks offering structured patterns have alleviated developers' burden adapting such applications to multithreaded architectures. While some of these patterns are implemented using synchronization primitives, others avoid them by means of lock-free data mechanisms. However, lock-free programming is not straightforward, ensuring an appropriate use of their interfaces can be challenging, since different memory models plus instruction reordering at compiler/processor levels can interfere in the occurrence of data races. The benefits of race detectors are formidable in this sense;however, they may emit false positives if are unaware of the underlying lock-free structure semantics. To mitigate this issue, this paper extends ThreadSanitizer, a race detection tool, with the semantics of 2 lock-free data structures: the single-producer/single-consumer and the multiple-producer/multiple-consumer queues. With it, we are able to drop false positives and detect potential semantic violations. The experimental evaluation, using different queue implementations on a set of benchmarks and real applications, demonstrates that it is possible to reduce, on average, 60% the number of data race warnings and detect wrong uses of these structures.
In 2015, a language based fundamentally on substructural typing–Rust–hit its 1.0 release, and less than a year later it has been put into production use in a number of tech companies, including some household names....
详细信息
ISBN:
(纸本)9781450346603
In 2015, a language based fundamentally on substructural typing–Rust–hit its 1.0 release, and less than a year later it has been put into production use in a number of tech companies, including some household names. The language has started a trend, with several other mainstream languages, including C++ and Swift, in the early stages of incorporating ideas about ownership. How did this come about? Rust’s core focus is safe systems programming. It does not require a runtime system or garbage collector, but guarantees memory safety. It does not stipulate any particular style of concurrent programming, but instead provides the tools needed to guarantee data race freedom even when doing low-level shared-state concurrency. It allows you to build up high-level abstractions without paying a tax; its compilation model ensures that the abstractions boil away. These benefits derive from two core aspects of Rust: its ownership system (based on substructural typing) and its trait system (a descendant of Haskell’s typeclasses). The talk will cover these two pillars of Rust design, with particular attention to the key innovations that make the language usable at scale. It will highlight the implications for concurrency, where Rust provides a unique perspective. It will also touch on aspects of Rust’s development that tend to get less attention within the POPL community: Rust’s governance and open development process, and design considerations around language and library evolution. Finally, it will mention a few of the myriad open research questions around Rust.
Computational systems are nowadays composed of basic computational components that share multiprocessors and coprocessors of different types, typically several graphics processing units (GPUs) or many integrated cores...
详细信息
Computational systems are nowadays composed of basic computational components that share multiprocessors and coprocessors of different types, typically several graphics processing units (GPUs) or many integrated cores (MICs), and those computational components are combined in heterogeneous clusters of nodes with different characteristics, including coprocessors of different types, with varying numbers of nodes at different speeds. The software previously developed and optimized for simpler system needs to be redesigned and reoptimized for these new, more complex systems. The adaptation to hybrid multicore+multiGPU and multicore+multiMIC of autotuning techniques for basic linear algebra routines is analyzed. The matrix-matrix multiplication kernel, which is optimized for different computational system components through guided experimentation, is studied. The routine is installed for each node in the cluster, and the information generated from individual installations may be used for a hierarchical installation in a cluster. The basic matrix-matrix multiplication may, in turn, be used inside higher level routines, which delegate their efficient execution to the optimization of the lower level routine. Experimental results are satisfactory in different multicore+multiGPU and multicore+multiMIC systems. So the guided search of execution configurations for satisfactory execution times proves to be a useful tool for heterogeneous systems, where the complexity of the system means a correct use of highly efficient routines and libraries is difficult.
The JPEG format employs Huffman codes to compress the entropy data of an image. Huffman codewords are of variable length, which makes parallel entropy decoding a difficult problem. To determine the start position of a...
详细信息
The JPEG format employs Huffman codes to compress the entropy data of an image. Huffman codewords are of variable length, which makes parallel entropy decoding a difficult problem. To determine the start position of a codeword in the bitstream, the previous codeword must be decoded first. We present JParEnt, a new approach to parallel entropy decoding for JPEG decompression on heterogeneous multicores. JParEnt conducts JPEG decompression in two steps: (1)an efficient sequential scan of the entropy data on the CPU to determine the start-positions (boundaries) of coefficient blocks in the bitstream, followed by (2)a parallel entropy decoding step on the graphics processing unit (GPU). The block boundary scan constitutes a reinterpretation of the Huffman-coded entropy data to determine codeword boundaries in the bitstream. We introduce a dynamic workload partitioning scheme to account for GPUs of low compute power relative to the CPU. This configuration has become common with the advent of SoCs with integrated graphics processors (IGPs). We leverage additional parallelism through pipelined execution across CPU and GPU. For systems providing a unified address space between CPU and GPU, we employ zero-copy to completely eliminate the data transfer overhead. Our experimental evaluation of JParEnt was conducted on six heterogeneous multicore systems: one server and two desktops with dedicated GPUs, one desktop with an IGP, and two embedded systems. For a selection of more than 1000JPEG images, JParEnt outperforms the SIMD-implementation of the libjpeg-turbo library by up to a factor of 4.3x, and the previously fastest JPEG decompression method for heterogeneous multicores by up to a factor of 2.2x. JParEnt's entropy data scan consumes 45% of the entropy decoding time of libjpeg-turbo on average. Given this new ratio for the sequential part of JPEG decompression, JParEnt achieves up to97% of the maximum attainable speedup (95% on average). On the IGP-based desktop platform,
暂无评论