Searching frequent itemset in large size diverse database is one of the most important data mining problem and as existing algorithms are insufficient in mechanism that enables automatic parallelization, fault toleran...
详细信息
ISBN:
(纸本)9781728140421
Searching frequent itemset in large size diverse database is one of the most important data mining problem and as existing algorithms are insufficient in mechanism that enables automatic parallelization, fault tolerance and data distribution. Solution to this issue we design algorithm using mapreduce programming model. The overarching aim is to enhance the performance of parallel frequent itemset mining on Hadoop. Incorporating ultra-metric tress to improve more efficiency of mining frequent itemset and comparing Apriori algorithm and FP-Growth algorithm based on some parameters. We implement the algorithm with dataset of Market Basket Analytics
Big data refers to a collection of massive volume of data that cannot be processed by conventional data processing tools and technologies. In recent years, the data production sources are enlarged noticeably, such as ...
详细信息
Big data refers to a collection of massive volume of data that cannot be processed by conventional data processing tools and technologies. In recent years, the data production sources are enlarged noticeably, such as high-end streaming devices, wireless sensor networks, satellite, wearable Internet of Things (IoT) devices. These data generation sources generate a massive volume of data in a continuous manner. The large volume of climate data is collected from the IoT weather sensor devices and NCEP. In this paper, the big data processing framework is proposed to integrate climate and health data and to find the correlation between the climate parameters and incidence of dengue. This framework is demonstrated with the help of mapreduce programming model, Hive, HBase and ArcGIS in a Hadoop Distributed File System (HDFS) environment. The following weather parameters such as minimum temperature, maximum temperature, wind, precipitation, solar and relative humidity are collected for the study are Tamil Nadu with the help of IoT weather sensor devices and NCEP. Proposed framework focuses only on climate data for 32 districts of Tamil Nadu where each district contains 1,57,680 rows and so there are 50,45,760 rows in total. Batch view precomputation for the monthly mean of various climate parameters would require 50,45,760 rows. Hence, this would create more latency in query processing. In order to overcome this issue, batch views can precompute for a smaller number of records and involve more computation to be done at query time. The In-Mapper based mapreduce framework is used to compute the monthly mean of climate parameter for each latitude and longitude. The experimental results prove the effectiveness of the response time for the In-Mapper based combiner algorithm is less when compared with the existing mapreduce algorithm. (C) 2018 Elsevier B.V. All rights reserved.
The quantification of mineral resources refers to the fractional contribution of endmembers at the pixel level, namely, fraction cover mapping of mineralogy. Over a large area, the mineral deposit occurs generally in ...
详细信息
The quantification of mineral resources refers to the fractional contribution of endmembers at the pixel level, namely, fraction cover mapping of mineralogy. Over a large area, the mineral deposit occurs generally in a limited number either on a host rock or any geologic structure. In remote sensing, the purity of mineral's spectra is usually perturbed either because of the weathering effect or the compositional susceptibility, which may lead to a wrong fractional map of mineral endmembers. Having such physical disputes, the present paper establishes a fraction cover mapping model by incorporating the characterization of endmember variability, optimization model of endmember extraction (EE), and inverse model of abundance estimation. In this regard, a proposition of EE method was deployed, which comprises subproblems on the minimization of endmember variability by the alternating direction method. Next, the extracted endmembers were used to estimate abundances with the Hapke model by applying the fully constrained least-squares method. Experimenting on a synthetic image, both the qualitative analysis by correlation measure and quantitative analysis by statistical error measure were evaluated for the proposed fractional cover mapping model. Using airborne visible/infrared imaging spectrometer-next generation hyperspectral imagery, the fraction cover map of a validation area was justified first, then a distributed mapping of Jahazpur-mineralized belt was achieved by the mapreduce programming of the proposed model in Hadoop architecture. (C) 2020 Society of Photo-Optical Instrumentation Engineers (SPIE)
We discuss the features of functional programming related to formal methods and an emerging paradigm, Cloud Computing. Formal methods are useful in developing highly reliable mission-critical software. However, in lig...
详细信息
ISBN:
(纸本)9789896740092
We discuss the features of functional programming related to formal methods and an emerging paradigm, Cloud Computing. Formal methods are useful in developing highly reliable mission-critical software. However, in light-weight formal methods, we do not rely on very rigorous means, such as theorem proofs. Instead, we use adequately less rigorous means, such as evaluation of pre/post conditions and testing specifications, to increase confidence in our specifications. Millions of tests may be conducted in developing highly reliable mission-critical software in a light-weight formal approach. We consider an approach to leveraging lightweight formal methods by using "Cloud." Given a formal specification language which has the features of functional programming, such as referential transparency, we can expect advantages of parallel processing. One of the basic foundations of VDM specification languages is Set Theory. The pre/post conditions and proof-obligations may be expressed in terms of set expressions. We can evaluate this kind of expression in a data-parallel style by using mapreduce framework for a huge set of test cases over cloud computing environments. Thus, we expect we can greatly reduce the cost of testing specifications in light-weight formal methods.
Effective indexing schemes are crucial in supporting efficient queries on large datasets from multidimensional Non-ordered Discrete Data Spaces (NDDS) in many applications such as genome sequence analysis in bioinform...
详细信息
ISBN:
(纸本)9781509036776
Effective indexing schemes are crucial in supporting efficient queries on large datasets from multidimensional Non-ordered Discrete Data Spaces (NDDS) in many applications such as genome sequence analysis in bioinformatics. Although constructing an index structure for a large dataset in an NDDS via a bulk loading technique is quite efficient (comparing to using a conventional tuple loading technique), existing bulk loading techniques cannot meet the scalability requirement for the fast growing sizes of datasets in contemporary NDDS applications. To tackle this challenge, we propose a new bulk loading method for fast construction of an index structure, called the PND-tree, for large datasets in NDDSs. Specifically, utilizing the characteristics of an NDDS and a priori knowledge of the given dataset, we suggest an effective multi-way top-down dataset split strategy with a mapreduce implementation for our bulk loading procedure. Experiments demonstrate that the proposed bulk loading method is quite promising in terms of the index construction efficiency and the resulting index quality, comparing to the conventional tuple loading method and a popular serial bulk loading method for a state-of-arts index tree in NDDSs.
Retrieving documents in response to the user's query is the most commonly text retrieval task. For our work, we have mainly focused on detecting the semantic similarity between documents in large documents collect...
详细信息
ISBN:
(纸本)9781509038374
Retrieving documents in response to the user's query is the most commonly text retrieval task. For our work, we have mainly focused on detecting the semantic similarity between documents in large documents collection and queries. In this paper, we investigated mapreduce as a specific framework for managing distributed processing in dataset pattern and semantic similarity measures of documents. Then we study the state of the art of different approaches for computing the semantic similarity of documents. We propose an approach based on parallel algorithm of semantic similarity measures using mapreduce and WordNet to detect the relevant documents in the face of the query. Finally, we are leading basic experiments to assess the performance of the proposed approach and noted the leverage of Hadoop and mapreduce to the semantic similarity measures between documents.
Edge computing is proposed to remedy the Cloud-only processing architecture for Internet of Things (IoT) because of the massive amounts of IoT data. The challenge is how to deploy and execute data processing tasks on ...
详细信息
ISBN:
(纸本)9781538655535
Edge computing is proposed to remedy the Cloud-only processing architecture for Internet of Things (IoT) because of the massive amounts of IoT data. The challenge is how to deploy and execute data processing tasks on heterogeneous IoT edge network. As mapreduce is a well-known model in Cloud computing for distributed processing of big data, this paper aims to devise a mapreduce-based protocol to achieve IoT edge computing. Our design is built upon the novel Information Centric Networking (ICN), which supports function naming and forwarding so as to facilitate task distribution among edge devices. To guarantee the correctness of task execution, a tree topology is formed in our approach to establish the logical connection between different types of edge devices, namely processing-capable nodes and forward-only ones. Moreover, the proposed protocol includes a task maintenance scheme that enables the coexistence of multiple IoT computation jobs. A testbed is implemented on ndnSIM to verify the feasibility of our design. The results show our approach could significantly decrease the network traffic compared with centralized data processing.
With the evolution of Internet standards and advancements in various Internet and mobile technologies, especially since web 4.0, more and more web and mobile applications emerge such as e-commerce, social networks, on...
详细信息
With the evolution of Internet standards and advancements in various Internet and mobile technologies, especially since web 4.0, more and more web and mobile applications emerge such as e-commerce, social networks, online gaming applications and Internet of Things based applications. Due to the deployment and concurrent access of these applications on the Internet and mobile devices, the amount of data and the kind of data generated increases exponentially and the new era of Big Data has come into existence. Presently available data structures and data analyzing algorithms are not capable to handle such Big Data. Hence, there is a need for scalable, flexible, parallel and intelligent data analyzing algorithms to handle and analyze the complex massive data. In this article, we have proposed a novel distributed supervised machine learning algorithm based on the mapreduce programming model and Distance Weighted k-Nearest Neighbor algorithm called MR-DWkNN to process and analyze the Big Data in the Hadoop cluster environment. The proposed distributed algorithm is based on supervised learning performs both regression tasks as well as classification tasks on large-volume of Big Data applications. Three performance metrics, such as Root Mean Squared Error (RMSE), Determination coefficient (R2) for regression task, and Accuracy for classification tasks are utilized for the performance measure of the proposed MR-DWkNN algorithm. The extensive experimental results shows that there is an average increase of 3% to 4.5% prediction and classification performances as compared to standard distributed k-NN algorithm and a considerable decrease of Root Mean Squared Error (RMSE) with good parallelism characteristics of scalability and speedup thus, proves its effectiveness in Big Data predictive and classification applications.
暂无评论