Sparse Matrix-Vector Multiplication (SpMV) kernel dominates the computing cost in numerous scientific applications. Many implementations based on different sparse formats were proposed recently for optimizing this ker...
详细信息
The data volume of many scientific applications has substantially increased in the past decade and continues to increase due to the rising needs of high-resolution and fine-granularity scientific discovery. The data m...
详细信息
ISBN:
(纸本)9781509033157
The data volume of many scientific applications has substantially increased in the past decade and continues to increase due to the rising needs of high-resolution and fine-granularity scientific discovery. The data movement between storage and compute nodes has become a critical performance factor and has attracted intense research and development attention in recent years. In this paper, we propose a novel solution, named Active burst-buffer, to reduce the unnecessary data movement and to speed up scientific workflow. Active burst-buffer enhances the existing burst-buffer concept with analysis capabilities by reconstructing the cached data to a logic file and providing a MapReduce-like computing framework for programming and executing the analysis codes. An extensive set of experiments were conducted to evaluate the performance of Active burst-buffer by comparing it against existing mainstream schemes, and more than 30% improvements were observed. The evaluations confirm that Active burst-buffer is capable of enabling efficient data analysis in-transit on burst-buffer nodes and is a promising solution to scientific discoveries with large-scale data sets.
Visualization of experimental data and results of numerical simulations belongs to most basic skills that need to be mastered by students at undergraduate,graduate and postgraduate *** faced with many visualization te...
详细信息
ISBN:
(纸本)9781509035946
Visualization of experimental data and results of numerical simulations belongs to most basic skills that need to be mastered by students at undergraduate,graduate and postgraduate *** faced with many visualization techniques and available tools,deep understanding of the fundamental concepts,such as visualization pipeline,is the necessary foundation for using data visualization *** article shows how concept of pipeline processing can be explained to students and how the tool such as OpenDX,with its visual programming capabilities,can be used to teach basic visualization concepts and techniques.
Porting applications to new hardware or programming models is a tedious and error prone process. Every help that eases these burdens is saving developer time that can then be invested into the advancement of the appli...
详细信息
ISBN:
(纸本)9781509036837
Porting applications to new hardware or programming models is a tedious and error prone process. Every help that eases these burdens is saving developer time that can then be invested into the advancement of the application itself instead of preserving the status-quo on a new platform. The Alpaka library defines and implements an abstract hierarchical redundant parallelism model. The model exploits parallelism and memory hierarchies on a node at all levels available in current hardware. By doing so, it allows to achieve platform and performance portability across various types of accelerators by ignoring specific unsupported levels and utilizing only the ones supported on a specific accelerator. All hardware types (multi-and many-core CPUs, GPUs and other accelerators) are supported for and can be programmed in the same way. The Alpaka C++ template interface allows for straightforward extension of the library to support other accelerators and specialization of its internals for optimization. Running Alpaka applications on a new (and supported) platform requires the change of only one source code line instead of a lot of #ifdefs.
Graph partitioning has important applications in multiple areas of computing, including scheduling, social networks, and parallelprocessing. In recent years, GPUs have proven successful at accelerating several graph ...
详细信息
ISBN:
(纸本)9781509036837
Graph partitioning has important applications in multiple areas of computing, including scheduling, social networks, and parallelprocessing. In recent years, GPUs have proven successful at accelerating several graph algorithms. However, the irregular nature of the real-world graphs poses a problem for GPUs, which favor regularity. In this paper, we discuss the design and implementation of a parallel multilevel graph partitioner for a CPU-GPU system. The partitioner aims to overcome some of the challenges arising due to memory constraints on GPUs and maximizes the utilization of GPU threads through suitable load-balancing schemes. We present a lock-free shared-memory scheme since fine-grained synchronization among thousands of threads imposes too high a performance overhead. The partitioner, implemented in CUDA, outperforms serial Metisand parallel MPI-based ParMetis. It performs similar to theshared-memory CPU-based parallel graph partitioner mt-metis.
Summary form only given. Approximate computing has recently received a great deal of attention from a range of researchers including circuit designers, hardware architects, and programming language designers. This tal...
详细信息
ISBN:
(纸本)9781509036837
Summary form only given. Approximate computing has recently received a great deal of attention from a range of researchers including circuit designers, hardware architects, and programming language designers. This talk discusses some of the recent trends in approximate computing and then argues that really approximation is something that application developers have been doing all along. So, perhaps the biggest insight in the current trend in approximation is that by exposing the things applications developers approximate to the rest of the computer system, there is the opportunity to do even more. We then investigate one of those things that is possible when the computer system can coordinate with an approximate application. Specifically, we discuss JouleGuard: a framework that coordinates approximate applications with system resource usage to meet user-defined energy goals with control theoretic formal guarantees. We show results of using JouleGuard on three different platforms (a mobile, tablet, and server) with eight different approximate applications created from two different frameworks. We find that JouleGuard respects energy budgets, provides near optimal accuracy, adapts to phases in application workload, and provides better outcomes than application approximation or system resource adaptation alone. JouleGuard is general with respect to the applications and systems it controls, making it a suitable runtime for a number of approximate computing frameworks.
The ability to design effective solutions using parallelprocessing should be a required competency for every computing student. However, teaching parallel concepts is sometimes challenging and costly, specially at ea...
详细信息
ISBN:
(纸本)9781509036837
The ability to design effective solutions using parallelprocessing should be a required competency for every computing student. However, teaching parallel concepts is sometimes challenging and costly, specially at early stages of a computer science degree. For such reasons we present a set of modules to teach parallel computing paradigms using as examples problems that are computationally intensive, but easy to understand and can be easily implemented using the Python parallelization libraries MPI for Python and Disco.
The proceedings contain 40 papers. The topics discussed include: fault tolerant frequent pattern mining;parallel performance-energy predictive modeling of browsers: case study of servo;optimization of brain mobile int...
ISBN:
(纸本)9781509054114
The proceedings contain 40 papers. The topics discussed include: fault tolerant frequent pattern mining;parallel performance-energy predictive modeling of browsers: case study of servo;optimization of brain mobile interface applications using IoT;Mizan-RMA: accelerating Mizan graph processing framework with MPI RMA;CUDA M3: designing efficient CUDA managed memory-aware MPI by exploiting GDR and IPC;parallel implementation of lossy data compression for temporal data sets;scalable parallel algorithms for shared nearest neighbor clustering;DCRoute: speeding up inter-datacenter traffic allocation while guaranteeing deadlines;efficient data redistribution to speedup big data analytics in large systems;MEC: the memory elasticity controller;Phoenix: memory speed HPC I/O with NVM;dynamic data layout optimization for high performance parallel I/O;read consistency in distributed database based on DMVCC;data elevator: low-contention data movement in hierarchical storage system;CMT-Bone - A Proxy application for compressible multiphase turbulent flows;tensor contractions with extended BLAS kernels on CPU and GPU;memory-efficient parallel simulation of electron beam dynamics using GPUs;cache-friendly design for complex spatially-variable coefficient stencils on many-core architectures;using message logs and resource use data for cluster failure diagnosis;and PRESAGE: protecting structured address generation against soft errors.
We present a novel trace-based analysis tool that rapidly classifies an MPI application as bandwidth-bound, latency-bound, load-imbalance-bound, or computation-bound for different interconnection networks. The tool us...
详细信息
ISBN:
(纸本)9781509021413
We present a novel trace-based analysis tool that rapidly classifies an MPI application as bandwidth-bound, latency-bound, load-imbalance-bound, or computation-bound for different interconnection networks. The tool uses an extension of Lamport's logical clock to track application progress in the trace replay. Ithas two unique features. First, it predicts application performance for many latency and bandwidth parameters from a single replay of the trace. Second, it infers the performance characteristics of an application and classifies the application using the predicted performance trend for a range of network configurations instead of using the predicted performance for a particular network configuration. We describe the techniques used in the tool and its design and implementation, and report our performance study of the tool and our experience with classifying nine applications and mini-apps from the DOE Design Forward project as well as the NAS parallel Benchmarks.
暂无评论