distributedquery processing systems such as Apache Hive and Spark are widely-used in many organizations for large-scale data analytics. Analyzing and understanding the queryexecution process of these systems are dai...
详细信息
distributedquery processing systems such as Apache Hive and Spark are widely-used in many organizations for large-scale data analytics. Analyzing and understanding the queryexecution process of these systems are daily routines for engineers and crucial for identifying performance problems, optimizing system configurations, and rectifying errors. However, existing visualization tools for distributed query execution are insufficient because (i) most of them (if not all) do not provide fine-grained visualization (i.e., the atomic task level), which can be crucial for understanding query performance and reasoning about the underlying execution anomalies, and (ii) they do not support proper linkages between system status and queryexecution, which makes it difficult to identify the causes of execution problems. To tackle these limitations, we propose QEVIS, which visualizes distributed query execution process with multiple views that focus on different granularities and complement each other. Specifically, we first devise a query logical plan layout algorithm to visualize the overall queryexecution progress compactly and clearly. We then propose two novel scoring methods to summarize the anomaly degrees of the jobs and machines during queryexecution, and visualize the anomaly scores intuitively, which allow users to easily identify the components that are worth paying attention to. Moreover, we devise a scatter plot-based task view to show a massive number of atomic tasks, where task distribution patterns are informative for execution problems. We also equip QEVIS with a suite of auxiliary views and interaction methods to support easy and effective cross-view exploration, which makes it convenient to track the causes of execution problems. QEVIS has been used in the production environment of our industry partner, and we present three use cases from real-world applications and user interview to demonstrate its effectiveness. QEVIS is open-source at https://***/
The availability of a multitude of data sources has naturally increased the need for subjects to collaborate for supporting distributed computations that combine different data collections for their elaboration and an...
详细信息
The availability of a multitude of data sources has naturally increased the need for subjects to collaborate for supporting distributed computations that combine different data collections for their elaboration and analysis. Due to the quick pace at which datasets grow, often the authorities collecting and owning such datasets resort to external third parties (e.g., cloud providers) for their storage and management. Data un-der the control of different authorities are autonomously encrypted (using different encryption schemes and keys) for their external storage. This makes distributed computations combining these sources dif-ficult to support. In this paper, we propose an approach enabling collaborative computations over data encrypted in storage, selectively involving also subjects that might not be authorized for accessing the data in plaintext when their collaboration is considered economically convenient. We also consider the possible adoption of trusted hardware components, to enable the evaluation of operations over plain -text data at non-fully trusted computational providers. The experimental results confirm the economic benefits that can be enabled by our proposal.(c) 2022 The Authors. Published by Elsevier Ltd. This is an open access article under the CC BY-NC-ND license ( http://***/licenses/by-nc-nd/4.0/ )
The availability of cloud services offered by different providers brings several advantages to users and companies, facilitating the storage, sharing, and processing of data. At the same time, the adoption of cloud se...
详细信息
RDF datasets have increased rapidly over the last few years. In order to process SPARQL queries on these large datasets, much effort has been spent on developing horizontally scalable techniques, which involve data pa...
详细信息
ISBN:
(纸本)9781509014453
RDF datasets have increased rapidly over the last few years. In order to process SPARQL queries on these large datasets, much effort has been spent on developing horizontally scalable techniques, which involve data partitioning and parallel query processing. While distribution may provide storage scalability, it may also incur high communication costs for processing queries. In this paper, we present a parallel and distributedquery processing approach that explores the existence of data allocation patterns, provided by a controlled data distribution, that determine how RDF triples should be grouped and stored on the same server. Fragments of the RDF datastore follow a given allocation pattern and correspond also to units of communication among servers. Based on this distribution model, we define two communication strategies for query processing: get-frag, which requests remote servers to send fragments that contain data required by a query, and send-result, which forwards intermediate results. These strategies are combined on a method, called 2ways, that chooses the adequate communication strategy whenever queries traverse fragment boundaries. We provide a cost function used to determine this choice and present experimental results. They show that our proposed technique effectively reduces the communication cost and improves the response time for processing SPARQL queries on a distributed RDF datastore.
Continuous query processing in data stream management systems (DSMS) has received considerable attention recently. Many applications share the same need for processing data streams in a continuous fashion. For most di...
详细信息
Continuous query processing in data stream management systems (DSMS) has received considerable attention recently. Many applications share the same need for processing data streams in a continuous fashion. For most distributed streaming applications, the centralized processing of continuous queries over distributed data is simply not viable. This paper addresses the problem of computing approximate answers to continuous join queries over distributed data streams. We present a new method, called DHTJoin, which combines hash-based placement of tuples in a distributed Hash Table (DHT) and dissemination of queries by exploiting the embedded trees in the underlying DHT, thereby incurring little overhead. DHTJoin also deals with join attribute value skew which may hurt load balancing and result completeness. We provide a performance evaluation of DHTJoin which shows that it can achieve significant performance gains in terms of network traffic.
暂无评论