The scalability of systems such as Hive and Spark SQL that are built on top of big data platforms have enabled query processing over very large data sets. However, the per-node performance of these systems is typicall...
详细信息
ISBN:
(纸本)9781728112466
The scalability of systems such as Hive and Spark SQL that are built on top of big data platforms have enabled query processing over very large data sets. However, the per-node performance of these systems is typically low compared to traditional relational databases. Conversely, Massively parallel Processing (MPP) databases do not scale as well as these systems. We present HRDBMS, a fully implemented distributed shared-nothing relational database developed with the goal of improving the scalability of OLAP queries. HRDBMS achieves high scalability through a principled combination of techniques from relational and big data systems with novel communication and work-distribution techniques. While we also support serializable transactions, the system has not been optimized for this use case. HRDBMS runs on a custom distributed and asynchronous execution engine that was built from the ground up to support highly parallelized operator implementations. Our experimental comparison with Hive, Spark SQL, and Greenplum confirms that HRDBMS's scalability is on par with Hive and Spark SQL (up to 96 nodes) while its per-node performance can compete with MPP databases like Greenplum.
On-Line Analytical Processing techniques are used for data analysis and decision support systems. The multidimensionality of the underlying data is well represented by multidimensional databases. For data mining in kn...
详细信息
ISBN:
(纸本)0818684038
On-Line Analytical Processing techniques are used for data analysis and decision support systems. The multidimensionality of the underlying data is well represented by multidimensional databases. For data mining in knowledge discovery, OLAP calculations can be effectively used. For these, high performance parallelsystems are required to provide interactive analysis. Precomputed aggregate calculations in a Data Cube can provide efficient query processing for OLAP applications. In this article, we present parallel data cube construction on distributed-memory, parallel computers from a relational database. Data Cube is used for data mining of associations using Attribute Focusing. Results are presented for these on the IBM-SP2, which show that our algorithms and techniques are scalable to a large number of processors, providing a high performance platform for such applications.
This paper considers text retrieval systems which store extremely huge amounts of text while providing a multi-user retrieval service for a large customer base. Due to the severe I/O demands of such a system, it is us...
详细信息
ISBN:
(纸本)0818620528
This paper considers text retrieval systems which store extremely huge amounts of text while providing a multi-user retrieval service for a large customer base. Due to the severe I/O demands of such a system, it is usually beneficial if not necessary to utilize a multi-processor system with multiple I/O facilities in an effort to increase the parallel I/O activity, the objective being to lower search response *** defining the problem, we model a solution and show that the application can be handled in a very effective fashion by a multi-processor system with a simple LAN-based topology. The final discussion describes a type of functional splitting which, if done in a careful manner, helps improve search response time.
This paper describes the design philosophy for the Grid system being developed by Japan Committee on High-Performance Computing for Bioinformatics and Initiative for parallel Bioinformatics (IPAB). Grid is one of attr...
详细信息
ISBN:
(纸本)0769516599
This paper describes the design philosophy for the Grid system being developed by Japan Committee on High-Performance Computing for Bioinformatics and Initiative for parallel Bioinformatics (IPAB). Grid is one of attractive solutions to achieve distributed bioinformtics environment with high performance parallel computers, large genomic databases, computation intensive applications such as homology search and molecular simulation. However, much has been remained in Grid system design especially in the wide area network environment. OBIGrid emphasizes the virtual organization aspect of the Grid system and gives more priority on security and scalability rather than performance.
The paper describes a preliminary evaluation of some multi-join strategies and their performances on parallel hardware. The hardware used was a Sequent (under UNIX) with 11 usable processors, each with shared and priv...
详细信息
ISBN:
(纸本)0818620528
The paper describes a preliminary evaluation of some multi-join strategies and their performances on parallel hardware. The hardware used was a Sequent (under UNIX) with 11 usable processors, each with shared and private primary memory. A multi-join was broken down into a series of single joins which were then allocated to clusters, each cluster being a collection of parallel processors. The results of single joins, which were studied by both binary search and hash-merge techniques, were then further processed as *** evaluation was conducted varying a number of parameters, such as cluster size, tuple size and cardinality. The comparative results were plotted. The study highlights the importance of a number of factors that influence the performance of a multi-join operation.
We present a novel optimization called Last parallel Call Optimization (LPCO) for parallelsystems. The last parallel call optimization can be regarded as a parallel extension of last call optimization found in sequen...
详细信息
ISBN:
(纸本)0818672552
We present a novel optimization called Last parallel Call Optimization (LPCO) for parallelsystems. The last parallel call optimization can be regarded as a parallel extension of last call optimization found in sequential systems. While the LPCO is fairly general, we use and-parallel logic programming systems to illustrate it and to report its performance on multiprocessor systems. The last parallel call optimization leads to improved time and space performance for a majority of and-parallel programs. We also present a generalization of the Last parallel Call Optimization called Nested parallel Call Optimization (NPCO). A major advantage of LPCO and NPCO is that parallelsystems designed for exploiting control parallelism can automatically exploit data parallelism efficiently.
The authors explore the notion of node autonomy in distributed computer systems. Some motivations for autonomy are presented. Different facets of autonomy as well as relationships among them are discussed. Finally, th...
详细信息
ISBN:
(纸本)0818608935
The authors explore the notion of node autonomy in distributed computer systems. Some motivations for autonomy are presented. Different facets of autonomy as well as relationships among them are discussed. Finally, they examine how autonomy affects other aspects of distributed computing, including timeliness, correctness, load sharing, data sharing, and data replication.
The need for managing massive attributed graphs is becoming common in many areas such as recommendation systems, proteomics analysis, social network analysis or bibliographic analysis. This is making it necessary to m...
详细信息
ISBN:
(纸本)9781450306270
The need for managing massive attributed graphs is becoming common in many areas such as recommendation systems, proteomics analysis, social network analysis or bibliographic analysis. This is making it necessary to move towards parallelsystems that allow managing graph databases containing millions of vertices and edges. Previous work on distributed graph databases has focused on finding ways to partition the graph to reduce network traffic and improve execution time. However, partitioning a graph and keeping the information regarding the location of vertices might be unrealistic for massive graphs. In this paper, we propose parallel-GDB, a new system based on specializing the local caches of any node in this system, providing a better cache hit ratio. parallelGDB uses a random graph partitioning, avoiding complex partition methods based on the graph topology, that usually require managing extra data structures. This proposed system provides an efficient environment for distributed graph databases.
distributeddatabasessystems need commit processing so that transactions executing on them still preserve the ACID property. With the advance of main memory database systems which become possible due to dropping pric...
An important aim of a database system is to guarantee database consistency, which means that the data contained in a database is both accurate and valid. Integrity constraints represent knowledge about data with which...
详细信息
ISBN:
(纸本)0769526411
An important aim of a database system is to guarantee database consistency, which means that the data contained in a database is both accurate and valid. Integrity constraints represent knowledge about data with which a database must be consistent. The process of checking constraints to ensure that update operations or transactions which alter the database will preserve its consistency has proved to be extremely difficult to implement, particularly in distributed and paralleldatabases. In distributeddatabases the aim of the constraint checking is to reduce the amount of data needing to be accessed, the number of sites involved and the amount of data transferred across the network. In paralleldatabases the focus is on the total execution time taken in checking the constraints. This paper highlights the differences between centralized, distributed and paralleldatabases with respect to constraint checking.
暂无评论