Graph databases are becoming a critical tool for the analysis of graph-structured data in the context of multiple scientific and technical domains, including cybersecurity and computational biology. In particular, the...
详细信息
ISBN:
(纸本)9781509036820
Graph databases are becoming a critical tool for the analysis of graph-structured data in the context of multiple scientific and technical domains, including cybersecurity and computational biology. In particular, the storage, analysis and querying of attributed graphs is a very important capability. Attributed graphs contain properties attached to the vertices and edges of the graph structure. Queries over attributed graphs do not only include structural pattern matching, but also conditions over the values of the attributes. In this work, we present GraQL, a query language designed for high-performance attributed graph databases hosted on a high memory capacity cluster. GraQL is designed to be the front-end language for the attributed graph data model for the GEMS database system.
Due to individual unreliable commodity components, failures are common in large-scale distributed storage systems. Erasure codes are widely deployed in practical storage systems to provide fault tolerance with low sto...
详细信息
ISBN:
(纸本)9781728112466
Due to individual unreliable commodity components, failures are common in large-scale distributed storage systems. Erasure codes are widely deployed in practical storage systems to provide fault tolerance with low storage overhead. However, the commonly used random data placement in storage systems based on erasure codes induces to heavy cross-rack traffic, load imbalance, and random access, which slow down the recovery process upon failures. In this paper, with orthogonal arrays, we define a Deterministic Data Distribution (D-3) of blocks to nodes and racks, and propose an efficient failure recovery approach based on D-3. D3 not only uniformly distributes data/parity blocks among storage servers, but also balances the repair traffic among racks and storage servers for failure recovery. Furthermore, D-3 also minimizes the cross-rack repair traffic for data layouts against a single rack failure and provides sequential access for failure recovery. We implement D-3 in Hadoop distributed File System (HDFS) with a cluster of 28 machines. Our experiments show that D-3 significantly speeds up the failure recovery process compared with random data distribution, e.g., 2.21 times for (6, 3)-RS code in a system consisting of eight racks and three nodes in each rack.
While large-scale scientific experiments and simulations produce massive amounts of data, a small fraction of data contains useful information. Efficient querying on such volume of data to extract that information inc...
详细信息
ISBN:
(纸本)9781728174457
While large-scale scientific experiments and simulations produce massive amounts of data, a small fraction of data contains useful information. Efficient querying on such volume of data to extract that information increases the productivity of the scientific discovery process. Although querying has been explored extensively in relational databases, research and adoption of querying tools for scientific data that is stored in parallel file systems on high performance computing (HPC) systems are still in infancy. In this paper, we introduce a parallel query service, called PDC-Query, for an object data management systems (ODMS) on HPC systems. It operates on partitioned objects in parallel, and provides several optimization strategies for fast query evaluation. The ODMS paradigm for HPC systems is promising in reducing the burden on users in data management and in moving data transparently across the deep memory hierarchy in modern HPC systems. We propose a 'global histogram'-based approach to accelerate query evaluation, through selectivity estimation and reducing the amount of data that needs to be loaded from storage and processed. We compare querying performance and demonstrate the efficiency and scalability of different approaches PDC-Query supports, including using global histograms, bitmap indexes, sorting, and full scan, in performing various queries on top of a plasma physics dataset with 125 billion particles and an astronomy dataset with 25 million objects.
Applications in many domains search moving object trajectory databases. The distance threshold search finds all trajectories within a given distance of a query trajectory. We develop three GPU distance threshold searc...
详细信息
ISBN:
(纸本)9781479986484
Applications in many domains search moving object trajectory databases. The distance threshold search finds all trajectories within a given distance of a query trajectory. We develop three GPU distance threshold search implementations that use indexing techniques significantly different from those used in CPU implementations. We determine experimentally under which conditions each approach performs well using one real-world astrophysics dataset and two synthetic datasets. Overall, we find that the GPU is an attractive technology for a broad range of relevant trajectory database scenarios.
Recent years the Hadoop distributed File System(HDFS) has been deployed as the bedrock for many parallel big data processing systems, such as graph processing systems, MPI-based parallel programs and scala/java-based ...
详细信息
ISBN:
(纸本)9781479980062
Recent years the Hadoop distributed File System(HDFS) has been deployed as the bedrock for many parallel big data processing systems, such as graph processing systems, MPI-based parallel programs and scala/java-based Spark frameworks, which can efficiently support iterative and interactive data analysis in memory. The first part of my dissertation mainly focuses on studying parallel data access on distributed file systems, e.g, HDFS. Since the distributed I/O resources and global data distribution are often not taken into consideration, the data requests from parallel processes/executors will unfortunately be served in a remote or imbalanced fashion on the storage servers. In order to address these problems, we develop I/O middleware systems and matching-based algorithms to map parallel data requests to storage servers such that local and balanced data access can be achieved. The last part of my dissertation presents our plans to improve the performance of interactive data access in big data analysis. Specifically, most interactive analysis programs will scan through the entire data set regardless of which data is actually required. We plan to develop a content-aware method to quickly access required data without this laborious scanning process.
distributeddatabasessystems need commit processing so that transactions executing on them still preserve the ACID property. With the advance of main memory database systems which become possible due to dropping pric...
The large software applications of today provide abstractions of the real-life systems that they support. A digital model of the system, and of the changes that occur within, are being maintained and updated, as trigg...
详细信息
ISBN:
(纸本)9781538608623
The large software applications of today provide abstractions of the real-life systems that they support. A digital model of the system, and of the changes that occur within, are being maintained and updated, as triggered by real-life events. Morphologically, such applications contain several distinct architectural entities: databases holding the state, central components describing how the system reacts to external events and mechanisms through which the user can view the current state and issue new commands. Each of these entities may use distinct paradigms and employ different technologies. A production-ready software application ends up assembling a relatively high technology stack and provides the final abstractions for both the problem and its solution. In this paper we propose a short-circuit for the long chain of technologies that are usually employed in large, production-ready software applications. The resulting architecture is a distributed, message-based system which behaves as a hybrid between a database and a runtime environment. The system operates with persistent and live entities, encapsulating both state and operations and therefore easily assimilated with OOP classes.
RedisGraph is a Redis module developed by Redis Labs to add graph database functionality to the Redis database. RedisGraph represents connected data as adjacency matrices. By representing the data as sparse matrices a...
详细信息
ISBN:
(纸本)9781538655559
RedisGraph is a Redis module developed by Redis Labs to add graph database functionality to the Redis database. RedisGraph represents connected data as adjacency matrices. By representing the data as sparse matrices and employing the power of GraphBLAS (a highly optimized library for sparse matrix operations), RedisGraph delivers a fast and efficient way to store, manage and process graphs. Initial benchmarks indicate that RedisGraph is significantly faster than comparable graph databases.
An important aim of a database system is to guarantee database consistency, which means that the data contained in a database is both accurate and valid. Integrity constraints represent knowledge about data with which...
详细信息
ISBN:
(纸本)0769526411
An important aim of a database system is to guarantee database consistency, which means that the data contained in a database is both accurate and valid. Integrity constraints represent knowledge about data with which a database must be consistent. The process of checking constraints to ensure that update operations or transactions which alter the database will preserve its consistency has proved to be extremely difficult to implement, particularly in distributed and paralleldatabases. In distributeddatabases the aim of the constraint checking is to reduce the amount of data needing to be accessed, the number of sites involved and the amount of data transferred across the network. In paralleldatabases the focus is on the total execution time taken in checking the constraints. This paper highlights the differences between centralized, distributed and paralleldatabases with respect to constraint checking.
暂无评论