Scientific distributed applications have an increasing need to process and move large amounts of data across wide area networks. Existing systems either closely couple computation and data movement, or they require su...
详细信息
Scientific distributed applications have an increasing need to process and move large amounts of data across wide area networks. Existing systems either closely couple computation and data movement, or they require substantial human involvement during the end-to-end process. We propose a framework that enables scientists to build reliable and efficient data transfer and processing pipelines. Our framework provides a universal interface to different data transfer protocols and storage systems. It has sophisticated How control and recovers automatically from network, storage system, software and hardware failures. We successfully used data pipelines to replicate and process three terabytes of DPOSS astronomy image dataset and several terabytes of WCER educational video dataset. In both cases, the entire process was performed without any human intervention and the data pipeline recovered automatically from various failures. Copyright (c) 2005 John Wiley & Sons, Ltd.
Scientific distributed applications have an increasing need to process and move large amounts of data across wide area networks. Existing systems either closely couple computation and data movement, or they require su...
详细信息
Scientific distributed applications have an increasing need to process and move large amounts of data across wide area networks. Existing systems either closely couple computation and data movement, or they require substantial human involvement during the end-to-end process. We propose a framework that enables scientists to build reliable and efficient data transfer and processing pipelines. Our framework provides a universal interface to different data transfer protocols and storage systems. It has sophisticated How control and recovers automatically from network, storage system, software and hardware failures. We successfully used data pipelines to replicate and process three terabytes of DPOSS astronomy image dataset and several terabytes of WCER educational video dataset. In both cases, the entire process was performed without any human intervention and the data pipeline recovered automatically from various failures. Copyright (c) 2005 John Wiley & Sons, Ltd.
The main goal of oil reservoir management is to provide more efficient, cost-effective and environmentally safer production of oil from reservoirs. Numerical simulations can aid in the design and implementation of opt...
详细信息
The main goal of oil reservoir management is to provide more efficient, cost-effective and environmentally safer production of oil from reservoirs. Numerical simulations can aid in the design and implementation of optimal production strategies. However, traditional simulation-based approaches to optimizing reservoir management are rapidly overwhelmed by data volume when large numbers of realizations are sought using detailed geologic descriptions. In this paper, we describe a software architecture to facilitate large-scale simulation studies, involving ensembles of long-running simulations and analysis of vast volumes of output data. Copyright (c) 2005 John Wiley & Sons, Ltd.
When designing SAMGrid, a project for distributing high-energy physics computations on. a grid, we discovered that it was challenging to decide where to place user's jobs. Jobs typically need to access hundreds of...
详细信息
ISBN:
(纸本)0780386949
When designing SAMGrid, a project for distributing high-energy physics computations on. a grid, we discovered that it was challenging to decide where to place user's jobs. Jobs typically need to access hundreds of files, and each site has a different subset of the files. Our data system SAM knows what portion of a user's data may be at each site, but does not know how to submit grid jobs. Our job submission system Condor-G knows how to submit grid jobs, but originally it required users to choose grid sites and gave them no assistance in choosing. This paper describes how we enhanced Condor G to interact with SAM to make good decisions about where jobs should be executed, and thereby improve the performance of grid jobs that access large amounts of data. All these enhancements are general enough to be applicable to grid computing beyond the data-intensivecomputing with SAMGrid.
Effective high-level data management is becoming an important issue with more and more scientific applications manipulating huge amounts of secondary-storage and tertiary-storage data using parallel processors. A majo...
详细信息
Effective high-level data management is becoming an important issue with more and more scientific applications manipulating huge amounts of secondary-storage and tertiary-storage data using parallel processors. A major problem facing the current solutions to this data management problem is that these solutions either require a deep understanding of specific data storage architectures and file layouts to obtain the best performance (as in high-performance storage management systems and parallel file systems), or they sacrifice significant performance in exchange for ease-of-use and portability (as in traditional database management systems). In this paper, we discuss the design, implementation, and evaluation of a novel application development environment for scientific computations. This environment includes a number of components that make it easy for the programmers to code and run their applications without much programming effort and, at the same time, to harness the available computational and storage power on parallel architectures.
With increases in the amount of data available for analysis in commercial settings, on line analytical processing (OLAP) and decision support have become important applications for high performance computing. Implemen...
详细信息
With increases in the amount of data available for analysis in commercial settings, on line analytical processing (OLAP) and decision support have become important applications for high performance computing. Implementing such applications on clusters requires a lot of expertise and effort, particularly because of the sizes of input and output datasets. In this paper, we describe our experiences in developing one such application using a cluster middleware, called ADR. We focus on the problem of data cube construction, which commonly arises in multi-dimensional OLAP. We show how ADR, originally developed for scientific dataintensive applications, can be used for carrying out an efficient and scalable data cube construction implementation. A particular issue with the use of ADR is tiling of output datasets. We present new algorithms that combine interprocessor communication and tiling within each processor. These algorithms preserve the important properties that are desirable from any parallel data cube construction algorithm. We have carried out a detailed evaluation of our implementation. The main results from our experiments are as follows: (1) high speedups are achieved on both dense and sparse datasets, even though we have used simple algorithms that sequentialize a part of the computation;(2) the execution time depends only upon the amount of computation, and does not increase in a super-linear fashion as the dataset size or the number of tiles increases;and (3) as the datasets become more sparse, sequential performance degrades, but the parallel speedups are still quite good. As part of our on-going work in this area, we are also looking at handling a larger number of dimensions and multi-dimensional partitionings. We describe our preliminary theoretical and experimental work in this direction. (C) 2003 Elsevier Science B.V. All rights reserved.
With increases in the amount of data available for analysis in commercial settings, On Line Analytical Processing (OLAP) and decision support have become important applications for high performance computing. Implemen...
详细信息
ISBN:
(纸本)0769515827
With increases in the amount of data available for analysis in commercial settings, On Line Analytical Processing (OLAP) and decision support have become important applications for high performance computing. Implementing such applications on clusters requires a lot of expertise and effort, particularly because of the sizes of input and output datasets. In this paper we describe our experiences in developing one such application using a cluster middleware, called ADR. We focus on the problem of data cube construction, which commonly arises in multi-dimensional OLAP. We show how ADR, originally developed for scientific dataintensive applications, can be used for carrying out an efficient and scalable data cube construction implementation. A particular issue with the use of ADR is tiling of output datasets. We present new algorithms that combine inter-processor communication and tiling within each processor These algorithms preserve the important properties that are desirable front any parallel data cube construction algorithm. We have carried out a detailed evaluation of our implementation. The main results from our experiments are as follows: 1) High speedups are achieved on both dense and sparse datasets, even though we have used simple algorithms that sequentialize a part of the computation, 2) The execution time depends only upon the amount of computation, and does not increase in a super-linear fashion as the dataset size or the number of tiles increases, and 3) As the datasets become more sparse, sequential performance degrades, but the parallel speedups are still quite good.
暂无评论