Purpose This work can be used as a building block in other settings such as GPU, Map-Reduce, Spark or any other. Also, DDPML can be deployed on other distributed systems such as P2P networks, clusters, clouds computin...
详细信息
Purpose This work can be used as a building block in other settings such as GPU, Map-Reduce, Spark or any other. Also, DDPML can be deployed on other distributed systems such as P2P networks, clusters, clouds computing or other technologies. Design/methodology/approach In the age of Big Data, all companies want to benefit from large amounts of data. These data can help them understand their internal and external environment and anticipate associated phenomena, as the data turn into knowledge that can be used for prediction later. Thus, this knowledge becomes a great asset in companies' hands. This is precisely the objective of data mining. But with the production of a large amount of data and knowledge at a faster pace, the authors are now talking about Big Data mining. For this reason, the authors' proposed works mainly aim at solving the problem of volume, veracity, validity and velocity when classifying Big Data using distributed and parallel processing techniques. So, the problem that the authors are raising in this work is how the authors can make machine learning algorithms work in a distributed and parallel way at the same time without losing the accuracy of classification results. To solve this problem, the authors propose a system called Dynamic distributed and parallel Machine Learning (DDPML) algorithms. To build it, the authors divided their work into two parts. In the first, the authors propose a distributed architecture that is controlled by Map-Reduce algorithm which in turn depends on random sampling technique. So, the distributed architecture that the authors designed is specially directed to handle big data processing that operates in a coherent and efficient manner with the sampling strategy proposed in this work. This architecture also helps the authors to actually verify the classification results obtained using the representative learning base (RLB). In the second part, the authors have extracted the representative learning base by sampling at
Monitoring the data sources for possible changes is an important consumption requirement for applications running in interaction with the Web of Data. In this article, MonARCh which is an architecture for monitoring t...
详细信息
Monitoring the data sources for possible changes is an important consumption requirement for applications running in interaction with the Web of Data. In this article, MonARCh which is an architecture for monitoring the result changes of registered SPARQL queries in the Linked Data environment, is proposed. MonARCh can be comprehended as a publish/subscribe system in the general sense. However, it differs in how communication with the data sources is realized. Data sources in the Linked Data environment do not publish the changes in the data. MonARCh provides the necessary communication infrastructure between the data sources and the consumers for the notification of changes. Users subscribe SPARQL queries to the system which are then converted to federated queries. MonARCh periodically checks for updates by re-executing SERVICE clauses and notifying users in case of any result change. In addition, to provide scalability, MonARCh takes the advantage of concurrent computation of the actor model. The parallel join algorithm utilized speeds up query execution and result generation processes. The design science methodology is used during the design, implementation and evaluation of the architecture. When compared to the literature MonARCh meets all the sufficient requirements from the linked data monitoring and state of the art perspectives while having many outstanding features from both points of view. The evaluation results show that even while working under the limited two-node cluster setting MonARCh could reach from 300 to 25,000 query monitoring capacity according to the diverse query selectivities executed within our test bench.
The proliferation of current and next-generation mobile and sensing devices has increased at an alarming rate. With these state-of-the-art devices, the global positioning system (GPS) has made remote sensing and locat...
详细信息
The proliferation of current and next-generation mobile and sensing devices has increased at an alarming rate. With these state-of-the-art devices, the global positioning system (GPS) has made remote sensing and location tracking more viable. One such query is the All Nearest Neighbor (ANN) query, which extracts and returns all data objects that are in close vicinity to all query objects. An ANN is a combination of k-nearest neighbors (kNN), and join queries. Hence, ANN has useful for applications in different domains such as transportation optimization, locating safe zones, and ride-sharing. An example of its applications is, "find the nearest gas station for each car parking lot". Because these applications are responsible for generating a massive number of query requests, a large amount of computation is required to return these query requests. As a single machine cannot meet this demand in this study, we propose a distributed query processing framework to process ANN queries using the Apache Spark framework. In an empirical study, our proposed framework achieved superior query efficiency and scalability compared to other methods and design alternatives.
The best region search (BRS) is one of the major research problems in geospatial data processing applications. The BRS problem objective is to discover the ideal location of a particular size specified rectangle, with...
详细信息
The best region search (BRS) is one of the major research problems in geospatial data processing applications. The BRS problem objective is to discover the ideal location of a particular size specified rectangle, with a predetermined end goal of maximizing the user-defined scoring function. The existing solutions for finding the top-k best regions have focused on designing algorithms for centralized settings. These solutions are not suitable for processing massive datasets. In this paper, we enable a Hadoop MapReduce-based parallel and distributed computation to obtain significant improvement in the performance. In addition to the parallel and distributed setting, we also incorporate early pruning strategies to eliminate the need to process rectangles that are not part of the output to minimize the communication cost involved in computing k-BRS. We later introduced a redistribution strategy over the initially proposed methodology that handles skew inherited from the dataset. Our results are obtained from extensive experimentation, both synthetic and real-world datasets.
With the wide penetration of smart robots in multifarious fields, the simultaneous localization and mapping (SLAM) technique in robotics has attracted growing attention in the community. Yet collaborating SLAM over mu...
详细信息
With the wide penetration of smart robots in multifarious fields, the simultaneous localization and mapping (SLAM) technique in robotics has attracted growing attention in the community. Yet collaborating SLAM over multiple robots still remains challenging due to performance contradiction between the intensive graphics computation of SLAM and the limited computing capability of robots. While traditional solutions resort to the powerful cloud servers acting as an external computation provider, we show by real-world measurements that the significant communication overhead in data offloading prevents its practicability to real deployment. To tackle these challenges, this article promotes the emerging edge-computing paradigm into multirobot SLAM and proposes RecSLAM, a multirobot laser SLAM system that focuses on accelerating the map construction process under the robot-edge-cloud architecture. In contrast to the conventional multirobot SLAM that generates graphic maps on robots and completely merges them on the cloud, RecSLAM develops a hierarchical map fusion technique that directs robots' raw data to edge servers for real-time fusion and then sends to the cloud for global merging. To optimize the overall pipeline, an efficient multirobot SLAM collaborative processing framework is introduced to adaptively optimize robot-to-edge offloading tailored to heterogeneous edge resource conditions, meanwhile ensuring the workload balancing among the edge servers. Extensive evaluations show RecSLAM can achieve up to 39.31% processing latency reduction over the state of the art. Besides, a proof-of-concept prototype is developed and deployed in real scenes to demonstrate its effectiveness.
Hyperspectral target detection (HTD) aims to detect fine targets in hyperspectral images ( HSIs). The traditional HTD method in low-resolution hyperspectral image (LR-HSI) is incapable of detecting small targets, clea...
详细信息
ISBN:
(纸本)9798350320107
Hyperspectral target detection (HTD) aims to detect fine targets in hyperspectral images ( HSIs). The traditional HTD method in low-resolution hyperspectral image (LR-HSI) is incapable of detecting small targets, clearly and precisely. Accordingly, in this paper, we propose a hyperspectral and multispectral image fusion target detection method based on cloud-edge collaboration. In this method, LR-HSI is first employed for coarse detection with the output of some suspicious target areas. Afterwards, the hyperspectral images and multispectral images (HSI-MSI) fusion is performed on these areas for precise target detection. In order to ensure the efficiency of HTD, we intend to accelerate our method in parallel based on the cloud-edge collaborative architecture. Furthermore, we establish an optimization model and design a greedy strategy to achieve the optimal deployment for minimizing the shortest runtime on the cloud-edge collaborative architecture. The experimental results demonstrate that our proposed method can significantly improve the computational efficiency while ensuring the accuracy.
Hyperspectral computational imaging (HCI) is to reconstruct hyperspectral images (HSIs) based on the compressed signals collected by remote sensing and imaging systems. Collaborative Tucker3 tensor decomposition is be...
详细信息
ISBN:
(数字)9781665427920
ISBN:
(纸本)9781665427920
Hyperspectral computational imaging (HCI) is to reconstruct hyperspectral images (HSIs) based on the compressed signals collected by remote sensing and imaging systems. Collaborative Tucker3 tensor decomposition is beneficial for HCI models in reconstructing high-fidelity HSIs. However, the ever-increasing amount of compressed data leads to heavy computation burden for tensor decomposition-based HCI models, which may exceed the computing capacity of a single machine. For this reason, this paper proposes a Spark-based distributed and parallel HCI implementation via collaborative Tucker3 tensor decomposition. The proposed implementation decomposes the processing flow of the HCI algorithm into several stages, each of which can be processed in parallel on Spark. In addition, we develop parallel strategies for improving the performance of the redundant computational procedure and data storage procedure, respectively. Experimental results demonstrate that the parallel algorithm not only achieves high accuracy but also improves the computational efficiency when processing large-scale HSI datasets.
We present the Feature Tracking Kit (FTK), a framework that simplifies, scales, and delivers various feature-tracking algorithms for scientific data. The key of FTK is our simplicial spacetime meshing scheme that gene...
详细信息
We present the Feature Tracking Kit (FTK), a framework that simplifies, scales, and delivers various feature-tracking algorithms for scientific data. The key of FTK is our simplicial spacetime meshing scheme that generalizes both regular and unstructured spatial meshes to spacetime while tessellating spacetime mesh elements into simplices. The benefits of using simplicial spacetime meshes include (1) reducing ambiguity cases for feature extraction and tracking, (2) simplifying the handling of degeneracies using symbolic perturbations, and (3) enabling scalable and parallelprocessing. The use of simplicial spacetime meshing simplifies and improves the implementation of several feature-tracking algorithms for critical points, quantum vortices, and isosurfaces. As a software framework, FTK provides end users with VTK/ParaView filters, Python bindings, a command line interface, and programming interfaces for feature-tracking applications. We demonstrate use cases as well as scalability studies through both synthetic data and scientific applications including tokamak, fluid dynamics, and superconductivity simulations. We also conduct end-to-end performance studies on the Summit supercomputer. FTK is open sourced under the MIT license: https://***/hguo/ftk.
The large data volume and high algorithm complexity of hyperspectral image (HSI) problems have posed big challenges for efficient classification of massive HSI data repositories. Recently, cloud computing architecture...
详细信息
The large data volume and high algorithm complexity of hyperspectral image (HSI) problems have posed big challenges for efficient classification of massive HSI data repositories. Recently, cloud computing architectures have become more relevant to address the big computational challenges introduced in the HSI field. This article proposes an acceleration method for HSI classification that relies on scheduling metaheuristics to automatically and optimally distribute the workload of HSI applications across multiple computing resources on a cloud platform. By analyzing the procedure of a representative classification method, we first develop its distributed and parallel implementation based on the MapReduce mechanism on Apache Spark. The subtasks of the processing flow that can be processed in a distributed way are identified as divisible tasks. The optimal execution of this application on Spark is further formulated as a divisible scheduling framework that takes into account both task execution precedences and task divisibility when allocating the divisible and indivisible subtasks onto computing nodes. The formulated scheduling framework is an optimization procedure that searches for optimized task assignments and partition counts for divisible tasks. Two metaheuristic algorithms are developed to solve this divisible scheduling problem. The scheduling results provide an optimized solution to the automatic processing of HSI big data on clouds, improving the computational efficiency of HSI classification by exploring the parallelism during the parallelprocessing flow. Experimental results demonstrate that our scheduling-guided approach achieves remarkable speedups by facilitating the automatic processing of HSI classification on Spark, and is scalable to the increasing HSI data volume.
We present a novel distributed union-find algorithm that features asynchronous parallelism and k-d tree based load balancing for scalable visualization and analysis of scientific data. Applications of union-find inclu...
详细信息
We present a novel distributed union-find algorithm that features asynchronous parallelism and k-d tree based load balancing for scalable visualization and analysis of scientific data. Applications of union-find include level set extraction and critical point tracking, but distributed union-find can suffer from high synchronization costs and imbalanced workloads across parallel processes. In this study, we prove that global synchronizations in existing distributed union-find can be eliminated without changing final results, allowing overlapped communications and computations for scalable processing. We also use a k-d tree decomposition to redistribute inputs, in order to improve workload balancing. We benchmark the scalability of our algorithm with up to 1,024 processes using both synthetic and application data. We demonstrate the use of our algorithm in critical point tracking and super-level set extraction with high-speed imaging experiments and fusion plasma simulations, respectively.
暂无评论