We have proposed a Web-based sensor network constructed of Web-based sensor nodes and a remote management system. The Web-based sensor nodes consist of communication units and measurement devices with Web servers. The...
详细信息
We have proposed a Web-based sensor network constructed of Web-based sensor nodes and a remote management system. The Web-based sensor nodes consist of communication units and measurement devices with Web servers. The management system has intelligent processing and rule-based function to manage them flexibly via the Internet and performs various image analyses easily with Web application services. By distributing the image analyses to Web application services, our proposed system provides versatile and scalable dataprocessing. We demonstrated that it can realize the desired image analyses effectively and perform complicated management by changing its operations depending on the results of analysis. (C) 2011 Elsevier B.V. All rights reserved.
Over the past decade, much focus in the area of Technology has deviated towards two relatively new areas;"The Internet of Things" and "Machine Learning". Although completely separate technologies, ...
详细信息
ISBN:
(纸本)9781538621950
Over the past decade, much focus in the area of Technology has deviated towards two relatively new areas;"The Internet of Things" and "Machine Learning". Although completely separate technologies, they have one major factor in common, data. The IoT paradigm relies on sensor devices to ingest data and gain valuable insight on their surrounding environment. data is often considered the newest natural resource. Analysing data instantaneously can give companies a leading edge in their market. Machine learning algorithms are helping companies achieve this feat in the most efficient way possible. In this paper, we propose a governance architecture for dynamic distributeddata mining, utilizing a flow based programming inspired model. We illustrate a collaborative protocol between edge devices and central controllers where computation and distribution may be driven by factors including hardware limitations, latency, or energy consumption. Our proposed architecture is evaluated in a connected vehicle use case. To demonstrate the feasibility of our work, we present two scenarios;local real-time prediction of driver alertness, and task/computation offloading based on CPU usage of the edge device.
A growing number of domains (finance, seismology, internet-of-things, etc.) collect massive time series. When the number of series grow to the hundreds of millions or even billions, similarity queries become intractab...
详细信息
ISBN:
(纸本)9781450360142
A growing number of domains (finance, seismology, internet-of-things, etc.) collect massive time series. When the number of series grow to the hundreds of millions or even billions, similarity queries become intractable on a single machine. Further, naive (quadratic) parallelization won't work well. So, we need both efficient indexing and parallelization. We propose a demonstration of Spark-parSketch, a complete solution based on sketches /random projections to efficiently perform both the parallel indexing of large sets of time series and a similarity search on them. Because our method is approximate, we explore the tradeoff between time and precision. A video showing the dynamics of the demonstration can be found by the link http://***/video/parSketchdemo_***.
distributeddataflow systems such as Apache Spark and Apache Flink are used to derive new insights from large datasets. While they efficiently execute concrete dataprocessing workflows, expressed as dataflow graphs, ...
详细信息
ISBN:
(纸本)9781450317436
distributeddataflow systems such as Apache Spark and Apache Flink are used to derive new insights from large datasets. While they efficiently execute concrete dataprocessing workflows, expressed as dataflow graphs, they lack generic support for exploratory workflows: if a user is uncertain about the correct processing pipeline, e.g. in terms of data cleaning strategy or choice of model parameters, they must repeatedly submit modified jobs to the system. This, however, misses out on optimisation opportunities for exploratory workflows, both in terms of scheduling and memory allocation. We describe meta-dataflows (MDFs), a new model to effectively express exploratory workflows and efficiently execute them on compute clusters. With MDFs, users specify a family of dataflows using two primitives: (a) an explore operator automatically considers choices in a dataflow;and (b) a choose operator assesses the result quality of explored dataflow branches and selects a subset of the results. We propose optimisations to execute MDFs: a system can (i) avoid redundant computation when exploring branches by reusing intermediate results and discarding results from underperforming branches;and (ii) consider future data access patterns in the MDF when allocating cluster memory. Our evaluation shows that MDFs improve the runtime of exploratory workflows by up to 90% compared to sequential execution.
The suffix array is the key to efficient solutions for myriads of string processing problems in different application domains, like data compression, data mining, or bioinformatics. With the rapid growth of available ...
详细信息
ISBN:
(纸本)9781538650356
The suffix array is the key to efficient solutions for myriads of string processing problems in different application domains, like data compression, data mining, or bioinformatics. With the rapid growth of available data, suffix array construction algorithms have to be adapted to advanced computational models such as external memory and distributed computing. In this article, we present five suffix array construction algorithms utilizing the new algorithmic big data batch processing framework Thrill, which allows scalable processing of input sizes on distributed systems in orders of magnitude that have not been considered before.
The rapid increase in the volume of Earth satellite observation data over recent years makes more necessary the problem of developing new technologies for effective data search, selection, and processing within very l...
详细信息
The rapid increase in the volume of Earth satellite observation data over recent years makes more necessary the problem of developing new technologies for effective data search, selection, and processing within very large constantly updated distributed archives. The paper describes the features of such technologies developed at the Space Research Institute, Russian Academy of Sciences (IKI RAS). These techniques provide the design of various dataprocessing tools for satellite data analysis with the use of distributed computing resources of remote sensing dataprocessing and archiving centers. Advantages and capabilities of the approaches suggested are described, as well as examples of implemented tools for distributedprocessing of data from various satellite remote sensing systems. The examples given show the capabilities of using the tools for the analysis of various atmospheric and ocean surface phenomena.
Large-scale graph and machine learning analytics widely employ distributed iterative processing. Typically, these analytics are a part of a comprehensive workflow, which includes data preparation, model building, and ...
详细信息
Large-scale graph and machine learning analytics widely employ distributed iterative processing. Typically, these analytics are a part of a comprehensive workflow, which includes data preparation, model building, and model evaluation. General-purpose distributeddataflow frameworks execute all steps of such workflows holistically. This holistic view enables these systems to reason about and automatically optimize the entire pipeline. Here, graph and machine learning analytics are known to incur a long runtime since they require multiple passes over the data until convergence is reached. Thus, fault tolerance and a fast-recovery from any intermittent failure is critical for efficient analysis. In this paper, we propose novel fault-tolerant mechanisms for graph and machine learning analytics that run on distributeddataflow systems. We seek to reduce checkpointing costs and shorten failure recovery times. For graph processing, rather than writing checkpoints that block downstream operators, our mechanism writes checkpoints in an unblocking manner that does not break pipelined tasks. In contrast to the conventional approach for unblocking checkpointing (e.g., that manage checkpoints independently for immutable datasets), we inject the checkpoints of mutable datasets into the iterative dataflow itself. Hence, our mechanism is iteration-aware by design. This simplifies the system architecture and facilitates coordinating checkpoint creation during iterative graph processing. Moreover, we are able to rapidly rebound, via confined recovery, by exploiting the fact that log files exist locally on healthy nodes and managing to avoid a complete recomputation from scratch. In addition, we propose replica recovery for machine learning algorithms, whereby we employ a broadcast variable that enables us to quickly recover without having to introduce any checkpoints. In order to evaluate our fault tolerance strategies, we conduct both a theoretical study and experimental analyses us
Manufacturing faces increasing requirements from customers which causes the need of exploiting emerging technologies and trends for preserving competitive advantages. The apriori announced fourth industrial revolution...
详细信息
ISBN:
(纸本)9783319646350;9783319646343
Manufacturing faces increasing requirements from customers which causes the need of exploiting emerging technologies and trends for preserving competitive advantages. The apriori announced fourth industrial revolution (also known as Industry 4.0) is represented mainly by an employment of Internet technologies into industry. The essential requirement is the proper understanding of given CPS (one of the key component of Industry 4.0) data models together with a utilization of knowledge coming from various systems across a factory as well as an external data sources. The suitable solution for data integration problem is an employment of Semantic Web Technologies and the model description in ontologies. However, one of the obstacles to the wider use of the Semantic Web technologies including the use in the industrial automation domain is mainly insufficient performance of available triplestores. Thus, on so called Semantic Big data Historian use case we are proposing the usage of state of the art distributeddata storage. We discuss the approach to data storing and describe our proposed hybrid data model which is suitable for representing time series (sensor measurements) with added semantics. Our results demonstrate a possible way to allow higher performance distributed analysis of data from industrial domain.
As a form of random set, belief functions come with specific semantic and combination rule able to perform the representation and the fusion of uncertain and imprecise informations. The development of new combination ...
详细信息
ISBN:
(数字)9783319629117
ISBN:
(纸本)9783319629117;9783319629100
As a form of random set, belief functions come with specific semantic and combination rule able to perform the representation and the fusion of uncertain and imprecise informations. The development of new combination rules able to manage conflict between data now offers a variety of tools for robust combination of piece of data from a database. The computation of multiple combinations from many querying cases in a database make necessary the development of efficient approach for concurrent belief computation. The approach should be generic in order to handle a variety of fusion rules. We present a generic implementation based on a map-reduce paradigm. An enhancement of this implementation is then proposed by means of a Markovian decomposition of the rule definition. At last, comparative results are presented for these implementations within the frameworks Apache Spark and Apache Flink.
With the increasing amount of available data, distributed data processing systems like Apache Flink and Apache Spark have emerged that allow to analyze large-scale datasets. However, such engines introduce significant...
详细信息
ISBN:
(纸本)9781509027712
With the increasing amount of available data, distributed data processing systems like Apache Flink and Apache Spark have emerged that allow to analyze large-scale datasets. However, such engines introduce significant computational overhead compared to non-distributed implementations. Therefore, the question arises when using a distributedprocessing approach is actually beneficial. This paper helps to answer this question with an evaluation of the performance of the distributed data processing framework Apache Flink. In particular, we compare Apache Flink executed on up to 50 cluster nodes to single-threaded implementations executed on a typical laptop for three different benchmarks: TPC-H Query 10, Connected Components, and Gradient Descent. The evaluation shows that the performance of Apache Flink is highly problem dependent and varies from early outperformance in case of TPC-H Query 10 to slower runtimes in case of Connected Components. The reported results give hints for which problems, input sizes, and cluster resources using a distributed data processing system like Apache Flink or Apache Spark is sensible.
暂无评论