Graphs are powerful data representations favored in many computational domains. Modern GPUs have recently shown promising results in accelerating computationally challenging graph problems but their performance suffer...
详细信息
ISBN:
(纸本)9781450301190
Graphs are powerful data representations favored in many computational domains. Modern GPUs have recently shown promising results in accelerating computationally challenging graph problems but their performance suffers heavily when the graph structure is highly irregular, as most real-world graphs tend to be. In this study, we first observe that the poor performance is caused by work imbalance and is an artifact of a discrepancy between the GPU programming model and the underlying GPU architecture. We then propose a novel virtual warp-centric programming method that exposes the traits of underlying GPU architectures to users. Our method significantly improves the performance of applications with heavily imbalanced workloads, and enables trade-offs between workload imbalance and ALU underutilization for fine-tuning the performance. Our evaluation reveals that our method exhibits up to 9x speedup over previous GPU algorithms and 12x over single thread CPU execution on irregular graphs. When properly configured, it also yields up to 30% improvement over previous GPU algorithms on regular graphs. In addition to performance gains on graph algorithms, our programming method achieves 1.3x to 15.1x speedup on a set of GPU benchmark applications. Our study also confirms that the performance gap between GPUs and other multi-threaded CPU graph implementations is primarily due to the large difference in memory bandwidth.
In this paper, we propose an OpenCL framework that treats multiple GPUs as a single compute device. Providing the single GPU image makes an OpenCL application written for a single GPU portable to the GPGPU systems wit...
详细信息
ISBN:
(纸本)9781450301190
In this paper, we propose an OpenCL framework that treats multiple GPUs as a single compute device. Providing the single GPU image makes an OpenCL application written for a single GPU portable to the GPGPU systems with multiple GPUs. It also makes the application exploit the full computing power of the multiple GPUs and the entire amount of GPU memories available in the system. Our OpenCL framework automatically distributes at run time an OpenCL kernel written for a single GPU into multiple CUDA kernels that execute on the multiple GPUs. It applies a run-time memory access range analysis to the kernel by performing a sampling run and identifies an optimal workload distribution for the kernel. To achieve a single compute device image, the runtime maintains a virtual device memory that is allocated in the main memory of the GPGPU system. the OpenCL runtime treats the memory as if it were the memory of a single GPO device and keeps it consistent to the memories of the multiple GPUs. Our OpenCL-C-to-C translator generates the sampling code from the OpenCL kernel code and our OpenCL-C-to-CUDA-C translator generates the CUDA kernel code for the distributed OpenCL kernel. We show the effectiveness of our OpenCL framework by implementing the OpenCL runtime and the two source-to-source translators. We evaluate its performance with a GPGU system that contains eight GPUs using eleven OpenCL benchmark applications.
Outside of computational science, most problems are formulated in terms of irregular data structures such as graphs, trees and sets. Unfortunately, we understand relatively little about the structure of parallelism an...
详细信息
ISBN:
(纸本)9781450301190
Outside of computational science, most problems are formulated in terms of irregular data structures such as graphs, trees and sets. Unfortunately, we understand relatively little about the structure of parallelism and locality in irregular algorithms. In this paper, we study several algorithms for four such problems: discrete-event simulation, single-source shortest path, breadth-first search, and minimal spanning trees. We show that these algorithms can be classified into two categories that we call unordered and ordered, and demonstrate experimentally that there is a trade-off between parallelism and work efficiency: unordered algorithms usually have more parallelism than their ordered counterparts for the same problem, but they may also perform more work. Nevertheless, our experimental results show that unordered algorithms typically lead to more scalable implementations, demonstrating that less work-efficient irregular algorithms may be better for parallel execution.
the Standard Template Adaptive parallel Library (STAPL) is a parallelprogramming infrastructure that extends C++ with support for parallelism. It includes a collection of distributed data structures called pContainer...
详细信息
ISBN:
(纸本)9781450301190
the Standard Template Adaptive parallel Library (STAPL) is a parallelprogramming infrastructure that extends C++ with support for parallelism. It includes a collection of distributed data structures called pContainers that are thread-safe, concurrent objects, i.e., shared objects that provide parallel methods that can be invoked concurrently. In this work, we present the STAPL parallel Container Framework (PCF), that is designed to facilitate the development of generic parallel containers. We introduce a set of concepts and a methodology for assembling a pContainer from existing sequential or parallel containers, without requiring the programmer to deal with concurrency or data distribution issues. the PCF provides a large number of basic parallel data structures (e.g., pArray, pList, pVector, pMatrix, pGraph, pMap, pSet). the PCF provides a class hierarchy and a composition mechanism that allows users to extend and customize the current container base for improved application expressivity and performance. We evaluate STAPL pContainer performance on a CRAY XT4 massively parallel system and show that pContainer methods, generic pAlgorithms, and different applications provide good scalability on more than 16,000 processors.
this paper presents the open-source COREMU, a scalable and portable parallel emulation framework that decouples the complexity of parallelizing full-system emulators from building a mature sequential one. the key obse...
详细信息
ISBN:
(纸本)9781450301190
this paper presents the open-source COREMU, a scalable and portable parallel emulation framework that decouples the complexity of parallelizing full-system emulators from building a mature sequential one. the key observation is that CPU cores and devices in current (and likely future) multiprocessors are loosely-coupled and communicate through well-defined interfaces. Based on this observation, COREMU emulates multiple cores by creating multiple instances of existing sequential emulators, and uses a thin library layer to handle the inter-core and device communication and synchronization, to maintain a consistent view of system resources. COREMU also incorporates lightweight memory transactions, feedback-directed scheduling, lazy code invalidation and adaptive signal control to provide scalable performance. To make COREMU useful in practice, we also provide some preliminary tools and APIs that can help programmers to diagnose performance problems and (concurrency) bugs. A working prototype, which reuses the widely-used QEMU as the sequential emulator, is with only 2500 lines of code (LOCs) changes to QEMU. It currently supports x64 and ARM platforms, and can emulates up to 255 (1) cores running commodity OSes with practical performance, while QEMU cannot scale above 32 cores. A set of performance evaluation against QEMU indicates that, COREMU has negligible uniprocessor emulation overhead, performs and scales significantly better than QEMU. We also show how COREMU could be used to diagnose performance problems and concurrency bugs of both OS kernel and parallel applications.
Exploiting heterogeneous parallel hardware currently requires mapping application code to multiple disparate programming models. Unfortunately, general-purpose programming models available today can yield high perform...
详细信息
Exploiting heterogeneous parallel hardware currently requires mapping application code to multiple disparate programming models. Unfortunately, general-purpose programming models available today can yield high performance but are too low-level to be accessible to the average programmer. We propose leveraging domain-specific languages (DSLs) to map high-level application code to heterogeneous devices. To demonstrate the potential of this approach we present OptiML, a DSL for machine learning. OptiML programs are implicitly parallel and can achieve high performance on heterogeneous hardware with no modification required to the source code. For such a DSL-based approach to be tractable at large scales, better tools are required for DSL authors to simplify language creation and parallelization. To address this concern, we introduce De lite, a system designed specifically for DSLs that is both a framework for creating an implicitly parallel DSL as well as a dynamic runtime providing automated targeting to heterogeneous parallel hardware. We show that OptiML running on De lite achieves single-threaded, parallel, and GPU performance superior to explicitly parallelized MATLAB code in nearly all cases.
On shared-memory systems, Cilk-style work-stealing [5] has been used to effectively parallelize irregular task-graph based applications such as Unbalanced Tree Search (UTS) [24, 28]. there are two main difficulties in...
详细信息
ISBN:
(纸本)9781450301190
On shared-memory systems, Cilk-style work-stealing [5] has been used to effectively parallelize irregular task-graph based applications such as Unbalanced Tree Search (UTS) [24, 28]. there are two main difficulties in extending this approach to distributed memory. In the shared memory approach, thieves (nodes without work) constantly attempt to asynchronously steal work from randomly chosen victims until they find work. In distributed memory, thieves cannot autonomously steal work from a victim without disrupting its execution. When work is sparse, this results in performance degradation. In essence, a direct extension of traditional work-stealing to distributed memory violates the work-first principle underlying work-stealing. Further, thieves spend useless CPU cycles attacking victims that have no work, resulting in system inefficiencies in multi-programmed contexts. Second, it is non-trivial to detect active distributed termination (detect that programs at all nodes are looking for work, hence there is no work). this problem is well-studied and requires careful design for good performance. Unfortunately, in most existing languages/frameworks, application developers are forced to implement their own distributed termination detection. In this paper, we develop a simple set of ideas that allow work-stealing to be efficiently extended to distributed memory. First, we introduce lifeline graphs: low-degree, low-diameter, fully-connected directed graphs. Such graphs can be constructed from k-dimensional hypercubes. When a node is unable to find work after w unsuccessful steals, it quiesces after informing the outgoing edges in its lifeline graph. Quiescent nodes do not disturb other nodes. A quiesced node is reactivated when work arrives from a lifeline, and itself shares this work withthose of its incoming lifelines that are activated. Termination occurs precisely when computation at all nodes has quiesced. In a language such as X10, such passive distributed terminati
In this paper, we propose the PSP practice Support System using Defect Types based on Phenomenon. this system can transmit programming to specific human among many software processes using a Multiagent technology. the...
详细信息
ISBN:
(纸本)9784990288051
In this paper, we propose the PSP practice Support System using Defect Types based on Phenomenon. this system can transmit programming to specific human among many software processes using a Multiagent technology. the system is also synthesized to do parallel and cooperative proposing internally. Applying the proposed method to a personal process-removing task, a flexible programming for quality of software. Software developments depend on information, which is possible to collection of personal process. Agent planning has get use working data on user action and other communication. therefore collection of all user data is necessary for agent learning. Agent studies the best transmission programming, planning and quality according to the makes planning in the personal process.
the Java language and runtime environment has had a profound worldwide impact on computer software since its introduction nearly two decades ago. It has enabled the creation of a rich ecosystem of libraries, framework...
详细信息
Two trends changed the computing landscape over the past decade: (1) hardware vendors started delivering chip multiprocessors (CMPs) instead of uniprocessors, and (2) software developers increasingly chose managed lan...
详细信息
ISBN:
(纸本)9781450301190
Two trends changed the computing landscape over the past decade: (1) hardware vendors started delivering chip multiprocessors (CMPs) instead of uniprocessors, and (2) software developers increasingly chose managed languages instead of native languages. Unfortunately, the former change is disrupting the virtuous-cycle between performance improvements and software innovation. Establishing a new parallel performance virtuous cycle for managed languages will require scalable applications executing on scalable Virtual Machine (VM) services, since the VM schedules, monitors, compiles, optimizes, garbage collects, and executes together withthe application. this talk describes current progress, opportunities, and challenges for scalable VM services. the parallel computing revolution urgently needs more innovations.
暂无评论