In numerous real applications, uncertainty is inherently introduced when massive data are generated. Modern database management systems aim to incorporate and handle data with uncertainties as a first-class citizen, w...
详细信息
Cloud computing has delivered unprecedented compute capacity to NASA missions at affordable rates. Missions like the Mars Exploration Rovers (MER) and Mars Science Lab (MSL) are enjoying the elasticity that enables th...
详细信息
In the last few years, many scientific applications have been developed for powerful graphics processing units (GPUs) and have achieved remarkable speedups. This success can be partially attributed to high performance...
详细信息
ISBN:
(纸本)9780769546766
In the last few years, many scientific applications have been developed for powerful graphics processing units (GPUs) and have achieved remarkable speedups. This success can be partially attributed to high performance host callable GPU library routines that are offloaded to GPUs at runtime. These library routines are based on C/C++-like programming toolkits such as CUDA from NVIDIA and have the same calling signatures as their CPU counterparts. Recently, with the sufficient support of C++ templates from CUDA, the emergence of template libraries have enabled further advancement in code reusability and rapid software development for GPUs. However, Expression Templates (ET), which have been very popular for implementing data parallel scientific software for host CPUs because of their intuitive and mathematics-like syntax, have been underutilized by GPU development libraries. The lack of ET usage is caused by the difficulty of offloading expression templates from hosts to GPUs due to the inability to pass instantiated expressions to GPU kernels as well as the absence of the exact form of the expressions for the templates at the time of coding. This paper presents a general approach that enables automatic offloading of C++ expression templates to CUDA enabled GPUs by using the C++ metaprogramming technique and Just-In-Time (JIT) compilation methodology to generate and compile CUDA kernels for corresponding expression templates followed by executing the kernels with appropriate arguments. This approach allows developers to port applications to run on GPUs with virtually no code modifications. More specifically, this paper uses a large ET based data parallel physics library called QDP++ as an example to illustrate many aspects of the approach to offload expression templates automatically and to demonstrate very good speedups for typical QDP++ applications running on GPUs against running on CPUs using this method of offloading. In addition, this approach of automatic offloa
Techniques based on metaheuristics and nature-inspired paradigms can provide efficient solutions to a wide variety of problems. Moreover, parallel and distributed metaheuristics can be used to provide more powerful pr...
详细信息
Techniques based on metaheuristics and nature-inspired paradigms can provide efficient solutions to a wide variety of problems. Moreover, parallel and distributed metaheuristics can be used to provide more powerful problem solving environments in a variety of fields, ranging, for example, from finance to bio- and health-informatics. This workshop seeks to provide an opportunity for researchers to explore the connection between metaheuristics and the development of solutions to problems that arise in operations research, parallelcomputing, telecommunications, and many others.
Computer vision and image processing has always been an active research domain that requires an enormous computational effort. In this paper we present the architecture of a modular object retrieval system. It is base...
详细信息
Computer vision and image processing has always been an active research domain that requires an enormous computational effort. In this paper we present the architecture of a modular object retrieval system. It is based on a dataflow concept which allows flexible adaptations to different tasks. This concept facilitates parallel processing as well as distributedcomputing. We also present a dynamic load balancing service for heterogeneous environments that has been integrated to improve system performance. First experiments show that the developed balancer performs better than standard balancing techniques in this environment.
Desktop computing remains indispensable in scientific exploration, largely because it provides people with devices for human interaction and environments for interactive job execution. However, with today's rapidl...
详细信息
Desktop computing remains indispensable in scientific exploration, largely because it provides people with devices for human interaction and environments for interactive job execution. However, with today's rapidly growing data volume and task complexity, it is increasingly hard for individual workstations to meet the demands of interactive scientific data processing. The increasing cost of such interactive processing is hindering the productivity of end-to-end scientific computing workflows. While existing distributedcomputing systems allow people to aggregate desktop workstation resources for parallelcomputing, the burden of explicit parallel programming and parallel job execution often prohibits scientists to take advantage of such platforms. In this paper, we discuss the need for transparent desktop parallelcomputing in scientific data processing. As an initial step toward this goal, we present our on-going work on the automatic parallelization of the scripting language R, a popular tool for statistical computing. Our preliminary results suggest that a reasonable speedup can be achieved on real-world sequential R programs without requiring any code modification.
An MPI library, called MPICH-PM/CLUMP, has been implemented on a cluster of SMPs. MPICH-PM/CLUMP realizes zero copy message passing between nodes while using one copy message passing within a node to achieve high perf...
详细信息
The aim of this paper is to present two new portable and high performance implementations of routines that can be used for piecewise cubic interpolation. The first one (sequential) is based on LAPACK routines, while t...
详细信息
The aim of this paper is to present two new portable and high performance implementations of routines that can be used for piecewise cubic interpolation. The first one (sequential) is based on LAPACK routines, while the next, based on ScaLAPACK is designed for distributed memory parallel computers and clusters. The results of experiments performed on a cluster of twenty Itanium 2 processors and on Cray XI are also presented and shortly discussed
暂无评论