Scientific I/O libraries, like PnetCDF, ADIOS, and HDF5, have been commonly used to facilitate the array-based scientific dataset processing. The underlying physical data layout information, however, is usually hidden...
详细信息
ISBN:
(纸本)9781479912926;9781479912933
Scientific I/O libraries, like PnetCDF, ADIOS, and HDF5, have been commonly used to facilitate the array-based scientific dataset processing. The underlying physical data layout information, however, is usually hidden from the upper layer's logical access. Such mismatching can lead to poor I/O. In this research, we have observed performance degradation in the case of concurrent sub-array accesses, where overlaps among calls that access sub-arrays led to high contention on storage servers due to the logical-physical mismatching. We propose a locality-driven high-level I/O aggregation approach to address these issues in this work. By designing a logical-physical mapping scheme, we try to utilize the scientific dataset's structured formats and the file systems' data distribution to resolve the mismatching issue. Therefore the I/O can be carried out in a locality-driven fashion. The proposed approach is effective and complements the existing I/O strategies, such as the independent I/O and collective I/O strategy. We have also carried out experimental tests and the results confirm the performance improvement compared to existing I/O strategies. The proposed locality-driven high-level I/O aggregation approach holds a promise for efficiently processing scientific datasets, which is critical for the dataintensive or big datacomputing era.
Machine learning has seen increasing implementation as a predictive tool in the chemical and physical sciences in recent years. It offers a route to accelerate the process of scientific discovery through a computation...
详细信息
Machine learning has seen increasing implementation as a predictive tool in the chemical and physical sciences in recent years. It offers a route to accelerate the process of scientific discovery through a computational datadriven approach. Whilst machine learning is well established in other fields, such as pharmaceutical research, it is still in its infancy in supercritical fluids research, but will likely accelerate dramatically in coming years. In this review, we present a basic introduction to machine learning and discuss its current uses by supercritical fluids researchers. In particular, we focus on the most common machine learning applications;including: (1) The estimation of the thermodynamic properties of supercritical fluids. (2) The estimation of solubilities, miscibilities, and extraction yields. (3) Chemical reaction optimization. (4) Materials synthesis optimization. (5) Supercritical power systems. (6) Fluid dynamics simulations of supercritical fluids. (7) Molecular simulation of supercritical fluids and (8) Geosequestration of CO2 using supercritical fluids.
Scientific datasets and libraries, such as HDF5, ADIOS, and NetCDF, have been used widely in many dataintensive applications. These libraries have their special file formats and I/O functions to provide efficient acc...
详细信息
ISBN:
(纸本)9781467362184
Scientific datasets and libraries, such as HDF5, ADIOS, and NetCDF, have been used widely in many dataintensive applications. These libraries have their special file formats and I/O functions to provide efficient access to large datasets. When the data size keeps increasing, these high level I/O libraries face new challenges. Recent studies have started to utilize database techniques such as indexing and subsetting, and data reorganization to manage the increasing datasets. In this work, we present a new approach to boost the data analysis performance, namely Fast Analysis with Statistical Metadata (FASM), via data subsetting and integrating a small amount of statistics into the original datasets. The added statistical information illustrates the data shape and provides knowledge of the data distribution; therefore the original I/O libraries can utilize these statistical metadata to perform fast queries and analyses. The proposed FASM approach is currently evaluated with the PnetCDF on Lustre file systems, but can also be implemented with other scientific libraries. The FASM can potentially lead to a new dataset design and can have an impact on big data analysis.
In this paper we describe the design, and implementation of the Open Science data Cloud, or OSDC. The goal of the OSDC is to provide petabyte-scale data cloud infrastructure and related services for scientists working...
详细信息
ISBN:
(纸本)9781467362184
In this paper we describe the design, and implementation of the Open Science data Cloud, or OSDC. The goal of the OSDC is to provide petabyte-scale data cloud infrastructure and related services for scientists working with large quantities of data. Currently, the OSDC consists of more than 2000 cores and 2 PB of storage distributed across four data centers connected by 10G networks. We discuss some of the lessons learned during the past three years of operation and describe the software stacks used in the OSDC. We also describe some of the research projects in biology, the earth sciences, and social sciences enabled by the OSDC.
Scientific distributed applications have an increasing need to process and move large amounts of data across wide area networks. Existing systems either closely couple computation and data movement, or they require su...
详细信息
Scientific distributed applications have an increasing need to process and move large amounts of data across wide area networks. Existing systems either closely couple computation and data movement, or they require substantial human involvement during the end-to-end process. We propose a framework that enables scientists to build reliable and efficient data transfer and processing pipelines. Our framework provides a universal interface to different data transfer protocols and storage systems. It has sophisticated How control and recovers automatically from network, storage system, software and hardware failures. We successfully used data pipelines to replicate and process three terabytes of DPOSS astronomy image dataset and several terabytes of WCER educational video dataset. In both cases, the entire process was performed without any human intervention and the data pipeline recovered automatically from various failures. Copyright (c) 2005 John Wiley & Sons, Ltd.
Hadoop has emerged as an important platform for data intensive computing. The shuffle and sort phases of a MapReduce computation often saturate top of the rack switches, as well as switches that aggregate multiple rac...
详细信息
ISBN:
(纸本)9781467362184
Hadoop has emerged as an important platform for data intensive computing. The shuffle and sort phases of a MapReduce computation often saturate top of the rack switches, as well as switches that aggregate multiple racks. In addition, MapReduce computations often have "hot spots" in which the computation is lengthened due to inadequate bandwidth to some of the nodes. In principle, OpenFlow enables an application to adjust the network topology as required by the computation, providing additional network bandwidth to those resources requiring it. We describe Hadoop-OFE, which is an OpenFlow enabled version of Hadoop that dynamically modifies the network topology in order to improve the performance of Hadoop.
In this paper, we present cuLib, a R package that provides an easy-to-access interface for utilizing the computing power of NVIDIA GPU. The cuLib package aims to make GPU-based parallel programming easier, flexible an...
详细信息
ISBN:
(纸本)9781509018949
In this paper, we present cuLib, a R package that provides an easy-to-access interface for utilizing the computing power of NVIDIA GPU. The cuLib package aims to make GPU-based parallel programming easier, flexible and high-performance. It allows the use of GPU computing in R without further knowledge because the syntax for definition and manipulation of GPU data is similar to formal R language. cuLib is compatible with device and operation system. The data interface is very flexible, enabling the users manipulate data freely. The cuLib package provides an R wrapper for libraries of NVIDIA's CUDA toolkit and numerous operations. More importantly, it is not only a mathematical tool but also practicable for the algorithm in dealing with dataintensive computation.
暂无评论