An important problem that appears often when analyzing data involves identifying irregular or abnormal data points called outliers. For example, in credit card transaction data, outliers might indicate potential fraud...
详细信息
An important problem that appears often when analyzing data involves identifying irregular or abnormal data points called outliers. For example, in credit card transaction data, outliers might indicate potential fraud;in network traffic data, outliers might represent potential intrusion attempts. Outlier mining is the most important data mining task whose goal is to find the observation which is dissimilar from the remaining set of data set. Traditional outlier detection method assumes data is centralized at single location but this assumption is not true in today's scenario because data set size is increasing day to day. Moreover new requirement of these methods must be applicable to data which might be distributed among different locations. Design of efficient parallel algorithms and frameworks are the key to meeting the scalability and performance requirements. Map reduce is a framework for processing large data sets with a parallel, distributed algorithm on a cluster. Iterative Map reduce programming model has simplified the implementations of many distributed data mining applications. In this work we design and realize a parallel Outlier mining algorithm based on iterative Map Reduce framework. This algorithm uses Twister programming model which is a light weight map reduce runtime.
作者:
Xiaodong WuFaculty of Mathematics and Computer Science
Quanzhou Normal University Fujian Provincial Key Laboratory of Data Intensive Computing Key Laboratory of Intelligent Computing and Information Processing Fujian Province University
The MapReduce parallel and distributed computing framework has been widely applied in both academia and industry. MapReduce applications are divided into two steps: Map and Reduce. Then, the input data is divided into...
详细信息
ISBN:
(纸本)9781467383134
The MapReduce parallel and distributed computing framework has been widely applied in both academia and industry. MapReduce applications are divided into two steps: Map and Reduce. Then, the input data is divided into splits, which can be concurrently processed, and the amount of the splits determines the number of map tasks. In this paper, we present a regression-based method to compute the number of Map tasks as well as Reduce tasks such that the performance of the MapReduce application can be improved. The regression analysis is used to predict the executing time of MapReduce applications. Experimental results show that the proposed optimization method can effectively reduce the execution time of the applications.
The end of Dennard scaling has made all systemsenergy-constrained. For data-intensive applications with limitedtemporal locality, the major energy bottleneck is data movementbetween processor chips and main memory mod...
详细信息
The end of Dennard scaling has made all systemsenergy-constrained. For data-intensive applications with limitedtemporal locality, the major energy bottleneck is data movementbetween processor chips and main memory modules. For such workloads, the best way to optimize energy is to place processing near the datain main memory. Advances in 3D integrationprovide an opportunity to implement near-data processing (NDP) withoutthe technology problems that similar efforts had in the past. This paper develops the hardware and software of an NDP architecturefor in-memory analytics frameworks, including MapReduce, graphprocessing, and deep neural networks. We develop simple but scalablehardware support for coherence, communication, and synchronization, anda runtime system that is sufficient to support analytics frameworks withcomplex data patterns while hiding all thedetails of the NDP hardware. Our NDP architecture provides up to 16x performance and energy advantageover conventional approaches, and 2.5x over recently-proposed NDP systems. We also investigate the balance between processing and memory throughput, as well as the scalability and physical and logical organization of the memory system. Finally, we show that it is critical to optimize software frameworksfor spatial locality as it leads to 2.9x efficiency improvements for NDP.
This book constitutes the refereed proceedings of the 4th internationalconference on Soft Computing, Intelligent Systems, and Information Technology, ICSIIT 2015, held in Bali, Indonesia, in March 2015. The 34 revise...
ISBN:
(数字)9783662467428
ISBN:
(纸本)9783662467411;9783662467428
This book constitutes the refereed proceedings of the 4th internationalconference on Soft Computing, Intelligent Systems, and Information Technology, ICSIIT 2015, held in Bali, Indonesia, in March 2015. The 34 revised full papers presented together with 19 short papers, one keynote and 2 invited talks were carefully reviewed and selected from 92 submissions. The papers cover a wide range of topics related to intelligence in the era of Big Data, such as fuzzy logic and control system; genetic algorithm and heuristic approaches; artificial intelligence and machine learning; similarity-based models; classification and clustering techniques; intelligent data processing; feature extraction; image recognition; visualization techniques; intelligent network; cloud and parallel computing; strategic planning; intelligent applications; and intelligent systems for enterprise, government and society.
With the help of simulation tools, users can evaluate new proposals in cluster environment efficiently. However, current cloud simulators cannot meet the needs of application-driven simulation scenarios. In this paper...
详细信息
ISBN:
(纸本)9781467365994
With the help of simulation tools, users can evaluate new proposals in cluster environment efficiently. However, current cloud simulators cannot meet the needs of application-driven simulation scenarios. In this paper, we propose Pallas, a task and network simulation framework that supports various cloud applications. Task-aware network scheduling and network-perceived task placement algorithms can be easily implemented in Pallas. We present the architecture and main components of Pallas and evaluate its effectiveness by comparing algorithm improvements to the actual results.
SSDs have been widely deployed in different areas and become competitive storage devices even for data-intensive applications. They have important performance and endurance requirements and their internal features pro...
详细信息
SSDs have been widely deployed in different areas and become competitive storage devices even for data-intensive applications. They have important performance and endurance requirements and their internal features provide a real potential to fulfil them. The multiple and independent SSD internal components allow parallel access to data at each of the four levels (package-chip-die-plane) but it relies completely on the data layout scheme. We proposed a data layout algorithm based only on the SSD basic operations. It distributes data up to the lowest level to exploit the fine grain internal parallelism and improves the SSD performance. In this paper, we also use advanced commands available on newer SSDs and request scheduling in combination with data layout scheme to provide up to the planes parallelism, taking into account both performance and endurance. The result is a new data layout algorithn to exploit the fine grain SSD internal parallelism. It respects the rules imposed by the wise use of advanced commands and the recommandations of maintaining a wide data distribution. The results show an improvement of performance and a Write Amplification (WA) factor very close to the one using basic operations which indicates a preserved endurance.
Mining sequence patterns in form of n-grams (sequences of words that appear consecutively) from a large text data is one of the fundamental parts in several information retrieval and natural language processing applic...
详细信息
ISBN:
(纸本)9781509003648
Mining sequence patterns in form of n-grams (sequences of words that appear consecutively) from a large text data is one of the fundamental parts in several information retrieval and natural language processingapplications. In this work, we present Spark-gram, a method for large scale frequent sequence mining based on Spark that was adapted from its equivalent method in MapReduce called Suffix-σ. Spark-gram design allows the discovery of all n-grams with maximum length a and minimum occurrence frequency τ, using iterative algorithm with only a single shuffle phase. We show that Spark-gram can outperform Suffix-σ mainly when t is high but potentially worse when the value of a grows higher.
The MapReduce programming model has been widely used in Big Data and Cloud applications. Criticism on its inflexibility when being applied to complicated scientific applications recently emerges. Several techniques ha...
详细信息
ISBN:
(纸本)9781467393010
The MapReduce programming model has been widely used in Big Data and Cloud applications. Criticism on its inflexibility when being applied to complicated scientific applications recently emerges. Several techniques have been proposed to enhance its flexibility. However, some of them exert special requirements on applications, while others fail to support the increasingly popular coprocessors, such as Graphics processing Unit (GPU). In this paper, we propose MR-Graph, a customizable and unified framework for GPU-based MapReduce, which aims to improve the flexibility, scalability and performance of MapReduce. MR-Graph addresses the limitations and restrictions of the traditional MapReduce execution paradigm. The three execution modes integrated in MR-Graph facilitates users to write their applications in a more flexible fashion by defining a Map and Reduce function call graph. MR-Graph efficiently explores the memory hierarchy in GPUs to reduce the data transfer overhead between execution stages and accommodate big data applications. We have implemented a prototype of MR-Graph and experimental results show the effectiveness of using MR-Graph for flexible and scalable GPU-based MapReduce computing.
The provisioning of high-performance computing infrastructure through cloud environments enables data intensive processing to be a viable solution. In this paper, we introduce a novel parallel computation model simila...
详细信息
The provisioning of high-performance computing infrastructure through cloud environments enables data intensive processing to be a viable solution. In this paper, we introduce a novel parallel computation model similar to MapReduce framework. The proposed parallelized model incorporates a parallel execution strategy in worker nodes to decrease execution response times in cloud environments. The parallelized model adopts efficient local memory management techniques in the worker nodes to reduce memory transfer overheads. For evaluation, we compared the proposed framework with the state of art Hadoop MapReduce framework. From experiments on benchmark datasets, it turns out that the parallelized model reduces the execution times by about 45.86%. Those experimental results indicate the efficiency and the scalability of proposed framework on cloud environments.
The current development of high performance parallel supercomputing infrastructures are pushing the boundaries of applications of science and are bringing new paradigms into engineering practices and simulations. Eart...
详细信息
The current development of high performance parallel supercomputing infrastructures are pushing the boundaries of applications of science and are bringing new paradigms into engineering practices and simulations. Earthquake engineering is also one of the major fields, which benefits from above by looking for solutions in grid computing and cloud computing techniques. Generally, earthquake simulations involve analysis of petabytes of data. Analyzing these large amounts of data in parallel in thousands of nodes in computer clusters results in gaining high performances. Open source cloud solutions such as Hadoop MapReduce, which is highly scalable and capable of processing large amount of data rapidly in parallel on large clusters provide better solution compared to RDBDM. Both GPUs and MapReduce are designed to support vast data parallelism. For performance considerations, GPU computing could be adopted over low performing CPU systems. This paper discusses MapReduce system using Hadoop and Mars. Mars is a MapReduce framework on graphics processor. Hence, the proposition is to use GPU based systems for earthquake simulations in which Digital elevation model 3D data sets are fully materialized where scientist can make use of these data for various analysis and simulations.
暂无评论