As distributed databases expand in popularity, there is ever-growing research into new database architectures that are designed from the start with built-in self-tuning and selfhealing features. In real world deployme...
详细信息
ISBN:
(纸本)9781728177281
As distributed databases expand in popularity, there is ever-growing research into new database architectures that are designed from the start with built-in self-tuning and selfhealing features. In real world deployments, however, migration to these entirely new systems is impractical and the challenge is to keep massive fleets of existing databases available under constant software and hardware change. Apache Cassandra is one such existing database that helped to popularize "scale-out" distributed databases and it runs some of the largest existing deployments of any open-source distributed database. In this paper, we demonstrate the techniques needed to transform the typical, highly manual, Apache Cassandra deployment into a self-healing system. We start by composing specialized agents together to surface the needed signals for a self-healing deployment and to execute local actions. Then we show how to combine the signals from the agents into the cluster level controlplanes required to safely iterate and evolve existing deployments without compromising database availability. Finally, we show how to create simulated models of the database's behavior, allowing rapid iteration with minimal risk. With these systems in place, it is possible to create a truly self-healing database system within existing large-scale Apache Cassandra deployments.
In this paper we continue the development of partitioning application presented in [13] using JDBC and SQLJ in the ORACLE DBMS. Both APIs are compared by showing specific characteristics, the advantages of connecting ...
详细信息
In this paper we continue the development of partitioning application presented in [13] using JDBC and SQLJ in the ORACLE DBMS. Both APIs are compared by showing specific characteristics, the advantages of connecting to the database through them. The application will calculate the best partitioning scheme using vertical fragmentation and fragment allocation then obtained, in effective places of network. We use a previously implemented algorithm that calculate the local data and remote data accessing cost, and then get the best fragmentation scheme by selecting the lowest cost.
In recent years, an increasing amount of data is collected in different and often, not cooperative, databases. The problem of privacy-preserving, distributed calculations over separate databases and, a relative to it,...
详细信息
ISBN:
(纸本)9781538627150
In recent years, an increasing amount of data is collected in different and often, not cooperative, databases. The problem of privacy-preserving, distributed calculations over separate databases and, a relative to it, the issue of private data release were intensively investigated. However, despite a considerable progress, computational complexity, due to an increasing size of data, remains a limiting factor in real-world deployments, especially in case of privacy-preserving computations. In this paper, we suggest sampling as a method of improving computational performance. Sampling was a topic of extensive research that recently received a boost of interest. We provide a sampling method targeted at separate, non-collaborating, vertically partitioned datasets. The method is exemplified and tested on approximation of intersection set both without and with privacy-preserving mechanism. An analysis of the bound on error as a function of the sample size is discussed and heuristic algorithm is suggested to further improve the performance. The algorithms were implemented and experimental results confirm the validity of the approach.
The efficient execution of data-intensive workflows relies on strategies to enable parallel data processing, such as partitioning and replicating data across distributed resources. The maximum degree of parallelism a ...
详细信息
ISBN:
(纸本)9781538672501
The efficient execution of data-intensive workflows relies on strategies to enable parallel data processing, such as partitioning and replicating data across distributed resources. The maximum degree of parallelism a workflow can reach during its execution is usually defined at design time. However, designing workflow models capable to provide an efficient use of distributed computing platforms is not a simple task and requires specialized expertise. Furthermore, since Workflow Management Systems see workflow activities as black-boxes, they are not able to automatically explore data parallelism in the workflow execution. To address this problem, in this work we propose a novel method to automatically improve data parallelism in workflows based on annotations that characterize how activities access and consume data. For an annotated workflow model, the method defines a model transformation and a database setup (including data sharding, replication, and indexing) to support data parallelism in a distributed environment. To evaluate this approach, we implemented and tested two workflows that process up to 20.5 million data objects from real-world datasets. We executed each model in 21 different scenarios in a cluster on a public cloud, using a centralized relational database and a distributed NoSQL database. The automatic parallelization created by the proposed method reduced the execution times of these workflows up to 66.6%, without increasing the monetary costs of their execution.
The objective of data reduction is to obtain a compact representation of a large data set to facilitate repeated use of non-redundant information with complex and slow learning algorithms and to allow efficient data t...
详细信息
ISBN:
(纸本)3540673822
The objective of data reduction is to obtain a compact representation of a large data set to facilitate repeated use of non-redundant information with complex and slow learning algorithms and to allow efficient data transfer and storage. For a user-controllable allowed accuracy loss we propose an effective data reduction procedure based on guided sampling for identifying a minimal size representative subset, followed by a model-sensitivity analysis for determining an appropriate compression level for each attribute. Experiments were performed on 3 large data sets and, depending on an allowed accuracy loss margin ranging from 1% to 5% of the ideal generalization, the achieved compression rates ranged between 95 and 12,500 times. These results indicate that transferring reduced data sets from multiple locations to a centralized site for an efficient and accurate knowledge discovery might often be possible in practice.
One of the challenges of applications of distributed database (DDB) systems is the possibility of expanding through the use of the Internet, so widespread nowadays. One of the most difficult problems in DDB systems de...
详细信息
ISBN:
(纸本)9783642153891
One of the challenges of applications of distributed database (DDB) systems is the possibility of expanding through the use of the Internet, so widespread nowadays. One of the most difficult problems in DDB systems deployment is distribution design. Additionally, existing models for optimizing the data distribution design have only aimed at optimizing query transmission and processing costs overlooking the delays incurred by query transmission and processing times, which is a major concern for Internet-based systems. In this paper a mathematical programming model is presented, which describes the behavior of a DDB with vertical fragmentation and permits to optimize its design taking into account the nonlinear nature of roundtrip response time (query transmission delay, query processing delay, and response transmission delay). This model was solved using two metaheuristics: the threshold accepting algorithm (a variant of simulated annealing) and tabu search, and comparative experiments were conducted with these algorithms in order to assess their effectiveness for solving this problem.
The process of knowledge discovery applied in distributed databases implies finding useful knowledge from mining data sets stored in real implementations of distributed databases. distributed databases represents a so...
详细信息
ISBN:
(纸本)9781509020478
The process of knowledge discovery applied in distributed databases implies finding useful knowledge from mining data sets stored in real implementations of distributed databases. distributed databases represents a software system that allows a multitude of applications to access the data stored in local or remote databases. In this scenario, the data distribution is achieved through the process of replication. Nowadays many solutions for storing the data are available: relational distributed Database Management Systems (DBMS), NoSQL storing solutions, NewSQL storing solutions, graph oriented databases, object oriented databases, object-relational databases, etc. The present study analyzes the most commonly used storing solution: the relational model. The replication topology used in the related experiments was the classical publisher-subscriber topology. The distribution of data is made from the publisher system. The present work studies the interaction between the most suited distributed data mining architecture (distributed Committee Machines) for mining distributed data and real relational distributed databases. The chosen Data Mining task is the classification one. distributed Committee Machines are a group of neuronal networks working in a distributed procedure to obtain an improved classification performance compared to a single neural structure. In these experiments we used the classical multilayer perceptron trained with the backpropagation algorithm. The execution performance of the distributed Committee Machine is analyzed, based on some of the most used types of replication in relational databases: snapshot replication, merge replication, transactional replication, and transactional with queued updating replication. For all these types of replications the execution performances (distributed speedup and distributed efficiency) of the entire system is also analyzed. These results are useful to numerous research fields: adaptive e-learning applications, med
Query caching has been utilized efficiently to improve query processing in distributed database environments. Most prior caching techniques are based on single-level caching of previous query results. This is basicall...
详细信息
ISBN:
(纸本)0780382927
Query caching has been utilized efficiently to improve query processing in distributed database environments. Most prior caching techniques are based on single-level caching of previous query results. This is basically to avoid accessing the underlying databases each time a user submits the same query. In this paper, we propose a new methodology that allows caching a combination of both plans and results of prior queries in a multilevel caching architecture. The objective is to reduce the response time of distributed query processing and hence increase the system throughput.
In this paper we present a horizontal fragmentation algorithm for design phase of a distributed databases. Our results has implemented in case of university databases application. We propose a matrix with values for a...
详细信息
ISBN:
(纸本)9786197105100
In this paper we present a horizontal fragmentation algorithm for design phase of a distributed databases. Our results has implemented in case of university databases application. We propose a matrix with values for attributes is used by the database administrator in the requirement analysis phase of system development life cycle for making decision of data mapping to different locations. Our matrix name is SIDU (S Select, I - Insert D - Delete, and U - Update). It is a table constructed by placing predicates of attributes of a relation as the rows and applications of the sites of a DDBMS as the columns. We have used SIDU to generate ACF array with values for each relation. We treated cost as the effort of access and modification of a particular attribute of a relation by an application from a particular site.
We present RADD, an innovative analytic pipeline used to measure reliability and availability for cloud-based distributed databases by leveraging the vast amount of telemetry present in the cloud. RADD can perform roo...
详细信息
ISBN:
(纸本)9781450367356
We present RADD, an innovative analytic pipeline used to measure reliability and availability for cloud-based distributed databases by leveraging the vast amount of telemetry present in the cloud. RADD can perform root cause analysis (RCA) to provide a minute-by-minute summary of the availability of a database close to real-time. On top of this data, RADD can raise alerts, analyze the stability of new versions during their deployment, and provide Key Performance Indicators (KPIs) that allow us to understand the stability of our system across all deployed databases. RADD implements an event correlation framework that puts the emphasis on data compliance and uses information entropy to measure causality and reduce noisy signals. It also uses statistical modelling to analyze new versions of the product and detect potential regressions early in our software development lifecycle. We demonstrate the application of RADD on top of Azure Synapse Analytics, where the system has helped us identify top-hitting and new issues and support on-call teams regarding every aspect of database health.
暂无评论