MapReduce is a very popular programming model to support parallel and distributed large-scale data processing. there have been a lot of efforts to implement this model on commodity GPU-based systems. However, most of ...
详细信息
MapReduce is a very popular programming model to support parallel and distributed large-scale data processing. there have been a lot of efforts to implement this model on commodity GPU-based systems. However, most of these implementations can only work on a single GPU. And they can not be used to process large-scale datasets. In this paper, we present a new approach to design the MapReduce framework on GPU clusters for handling large-scale data processing. We have used Compute Unified Device architectures (CUDA) and MPI parallel programming models to implement this framework. To derive an efficient mapping onto GPU clusters, we introduce a two-level parallelization approach: the inter node level and intra node level parallelization. Furthermore in order to improve the overall MapReduce efficiency, a multi-threading scheme is used to overlap the communication and computation on a multi-GPU node. Compared to previous GPU-based MapReduce implementations, our implementation, called GCMR, achieves speedups up to 2.6 on a single node and up to 9.1 on 4 nodes of a Tesla S1060 quad-GPU cluster system for processing small datasets. It also shows very good scalability for processing large-scale datasets on the cluster system.
Breadth-First Search (BFS) is a basis for many graph traversal and analysis algorithms. In this paper, we present a direction-optimizing BFS implementation on CPU-GPU heterogeneous platforms to fully exploit the compu...
详细信息
Breadth-First Search (BFS) is a basis for many graph traversal and analysis algorithms. In this paper, we present a direction-optimizing BFS implementation on CPU-GPU heterogeneous platforms to fully exploit the computing power of boththe multi-core CPU and GPU. For each level of the BFS algorithm, we dynamically choose the best implementation from: a sequential top-down execution on CPU, a parallel top-down execution on CPU, and a cooperative bottom-up execution on CPU and GPU. By adapting to the runtime variability of vertex frontiers, such a hybrid approach provides the best performance for the exploration of each BFS level while avoiding poor worst case performance. Our implementation demonstrates speedups of 1.37 to 1.44 compared to the highest published performance for shared memory systems.
A high performance VLSI architecture for integer motion estimation (IME) in High Efficiency Video Coding (HEVC) is presented in this paper. It supports coding tree block (CTB) structure withthe asymmetric motion part...
详细信息
A high performance VLSI architecture for integer motion estimation (IME) in High Efficiency Video Coding (HEVC) is presented in this paper. It supports coding tree block (CTB) structure withthe asymmetric motion partition (AMP) mode. the architecture contains two parallel sub-architectures to meet 1080p@30fps real-time video coding. the size L×L of CTB in the architecture is set to L=32 pixels by default, and it can be extended to L=64 and L=16 pixels. A serial mode decision module to find optimal partition mode for the architecture has also been implemented.
this paper addresses the non-preemptive scheduling on two parallel identical machines sharing a single server in charge of loading and unloading jobs. Each job has to be loaded by the server before being processed on ...
详细信息
A number of services for scientific computing based on cloud resources have recently drawn significant attention in both research and infrastructure provider communities. Most cloud resources currently available lack ...
详细信息
ISBN:
(纸本)9781479908981
A number of services for scientific computing based on cloud resources have recently drawn significant attention in both research and infrastructure provider communities. Most cloud resources currently available lack true high performance characteristics, such as high-speed interconnects or storage. Researchers studying cloud systems have pointed out that many cloud services do not provide service level agreements that may meet the needs of the research community. Furthermore, the lack of location information provided to the user and the shared nature of the systems use may create risk for users of the system, in the instance that their data is moved to an unknown location with an unknown level of security. Indiana University and Penguin Computing have partnered to create a system, Rockhopper, which addresses many of these issues. this system is a true high performance resource, with on-demand allocations, control, and tracking of jobs, situated at Indiana University's high-security datacenter facility. Rockhopper allows researchers to flexibly conduct their work under a number of use cases while also serving as an extension of cyberinfrastructure that scales from the researcher's local environment all the way up through large national resources. We describe the architecture and ideas behind the creation of the system, present a use case for campus bridging, and provide a typical example of system usage. In a comparison of Rockhopper to a cloud-based system, we run the Trinity RNA-seq software against a number of datasets on boththe Rockhopper system and on Amazon's EC2 service.
Graphics processing units (GPUs) are capable of achieving remarkable performance improvements for a broad range of applications. However, they have not been widely adopted in embedded systems and mobile devices as acc...
详细信息
Graphics processing units (GPUs) are capable of achieving remarkable performance improvements for a broad range of applications. However, they have not been widely adopted in embedded systems and mobile devices as accelerators mainly due to their relatively higher power consumption compared with embedded microprocessors. In this work, we conduct a comprehensive analysis regarding the feasibility and potential of accelerating applications using GPUs in low-power domains. We use two different categories of benchmarks: (1) the Level 3 BLAS subroutines, and (2) the computer vision algorithms, i.e., mean shift image segmentation and scale-invariant feature transform (SIFT). We carried out our experiments on the Nvidia CARMA development kit, which consists of a Nvidia Tegra 3 quad-core CPU and a Nvidia Quadro 1000M GPU. It is found that the GPU can deliver a remarkable performance speedup compared withthe CPU while using a significantly less energy for most benchmarks. Further we propose a hybrid approach to developing applications on platform with GPU accelerators. this approach optimally distributes workload between the parallel GPU and the sequential CPU to achieve the best performance while using the least energy.
Accelerators have become critical in the process to develop supercomputers with exascale computing capability. In this work, we examine the potential of two latest acceleration technologies, Nvidia K20 Kepler GPU and ...
详细信息
Accelerators have become critical in the process to develop supercomputers with exascale computing capability. In this work, we examine the potential of two latest acceleration technologies, Nvidia K20 Kepler GPU and Intel Many Integrated Core (MIC) Architecture, for accelerating geospatial applications. We first apply a set of benchmarks under 3 different configurations, i.e, MPI+CPU, MPI+GPU, and MPI+MIC. this set of benchmarks include embarrassingly parallel application, loosely communicating application, and intensely communicating application. It is found that the straightforward MPI implementation on MIC cores can achieve the same amount of performance speedup as hybrid MPI+GPU implementation when the same number of processors are used. Further, we demonstrate the potentials of hardware accelerators for advancing the scientific research using an urban sprawl simulation application. the parallel implementation of the urban sprawl simulation using 16 Tesla M2090 GPUs can realize a 155× speedup compared withthe single-node implementation, while achieving a good strong scalability.
Multiprocessing modular exponentiation has a variety of uses, including cryptography, prime testing and computational number theory. It is also a very costly operation to compute. GPU parallelism can be used to accele...
详细信息
Withthe widespread use of multi-core processors, specialized applications requiring parallelprocessing can be run on general desktops. In this study, we measure and analyze the parallel performance of LAMMPS, a clas...
详细信息
Withthe widespread use of multi-core processors, specialized applications requiring parallelprocessing can be run on general desktops. In this study, we measure and analyze the parallel performance of LAMMPS, a classical well-known molecular dynamics code, on single multi-core systems. In order to check the parallel efficiency on single machines, the various types of simulations with LAMMPS were performed with MPI and OpenMP. Although LAMMPS was run on a single machine, MPI based LAMMPS showed higher performance, because LAMMPS has been designed based on MPI and only some part of LAMMPS was parallelized with OpenMP. the preliminary experiments also showed that the memory sub-system affects the performance of LAMMPS.
Exact string matching algorithms are critical components in computational biology applications regarding the nucleotide or aminoacid sequences. the current paper presents a slightly modified version Boyer-Moore algori...
详细信息
Exact string matching algorithms are critical components in computational biology applications regarding the nucleotide or aminoacid sequences. the current paper presents a slightly modified version Boyer-Moore algorithm for string matching (in particular genomes) that has been optimized for a dual Xeon 5860. We highlight several ways to improve execution time and conclude based on our measurements that STTNI hardware instructions alongside core parallelism achieve most of the performance gain.
暂无评论