We live in on-demand, on-command Digital universe with data prolifering by Institutions, Individuals and Machines at a very high rate. This data is categories as "Big Data" due to its sheer Volume, Variety a...
详细信息
We live in on-demand, on-command Digital universe with data prolifering by Institutions, Individuals and Machines at a very high rate. This data is categories as "Big Data" due to its sheer Volume, Variety and Velocity. Most of this data is unstructured, quasi structured or semi structured and it is heterogeneous in nature. The volume and the heterogeneity of data with the speed it is generated, makes it difficult for the present computing infrastructure to manage Big Data. Traditional data management, warehousing and analysis systems fall short of tools to analyze this data. Due to its specific nature of Big Data, it is stored in distributed file system architectures. Hadoop and HDFS by Apache is widely used for storing and managing Big Data. Analyzing Big Data is a challenging task as it involves large distributed file systems which should be fault tolerant, flexible and scalable. Map Reduce is widely been used for the efficient analysis of Big Data. Traditional DBMS techniques like Joins and Indexing and other techniques like graph search is used for classification and clustering of Big Data. These techniques are being adopted to be used in Map Reduce. In this paper we suggest various methods for catering to the problems in hand through Map Reduce framework over Hadoop distributed File System (HDFS). Map Reduce is a Minimization technique which makes use of file indexing with mapping, sorting, shuffling and finally reducing. Map Reduce techniques have been studied in this paper which is implemented for Big Data analysis using HDFS.
Many scientific fields routinely generate huge datasets. In many cases, these datasets are not static but rapidly grow in size. Handling these types of datasets, as well as allowing sophisticated queries necessitates ...
详细信息
Many scientific fields routinely generate huge datasets. In many cases, these datasets are not static but rapidly grow in size. Handling these types of datasets, as well as allowing sophisticated queries necessitates scalable distributed database systems, in which scientists are efficiently able to search the datasets. In this paper we present the architecture, implementation and performance analysis of a scalable, distributed database system built on software based virtualization environments. The system architecture makes use of a software partitioning of the database based on data clustering, SQMD (single query multiple database) mechanism, a Web service interface, and virtualization software technologies. The system allows uniform access to concurrently distributed databases, using the SQMD mechanism based on the publish/subscribe paradigm. We highlight the scalability of our architecture by applying it to a database of 17 million chemical structures. In addition to simple identifier based retrieval, we will present performance results for shape similarity queries, which is extremely, time intensive with traditional architectures.
High volumes of uncertain data can be generated in distributed environments in many real-life biological, medical and life science applications. As an important data mining task, frequent pattern mining helps discover...
详细信息
High volumes of uncertain data can be generated in distributed environments in many real-life biological, medical and life science applications. As an important data mining task, frequent pattern mining helps discover frequently co-occurring items, objects, or events from these distributed databases. However, users may be interested in only some small portions of all the frequent patterns that can be mined from these databases. In this paper, we propose an intelligent computing system that (i) allows users to express their interests via the use of user-specified constraints and (ii)effectively exploits anti-monotonic properties of user-specified constraints and efficiently discovers frequent patterns satisfying these constraints from the distributed databases containing uncertain data.
With the growing interest in cell-free massive multiple-input multiple-output (MIMO) systems, the benefits of single-antenna access points (APs) versus multi-antenna APs must be analyzed in order to optimize deploymen...
详细信息
With the growing interest in cell-free massive multiple-input multiple-output (MIMO) systems, the benefits of single-antenna access points (APs) versus multi-antenna APs must be analyzed in order to optimize deployment. In this paper, we compare various antenna system topologies based on achievable downlink spectral efficiency, using both measured and synthetic channel data in an indoor environment. We assume multi-user scenarios, analyzing both conjugate beamforming (or maximum-ratio transmission (MRT)) and zero-forcing (ZF) precoding methods. The results show that the semi-distributed multi-antenna APs can reduce the number of APs, and still achieve the comparable achievable rates as the fully-distributed single-antenna APs with the same total number of antennas.
The Internet of Things (IoT) is a technological paradigm that aims to connect millions of networked devices to provide more complex functionality. However, the heterogeneity of application/device communication standar...
详细信息
The Internet of Things (IoT) is a technological paradigm that aims to connect millions of networked devices to provide more complex functionality. However, the heterogeneity of application/device communication standards precludes support for interoperability, which impacts developer collaboration. There are many works that propose solutions to support syntactic and semantic interoperability in IoT context. This paper aims to propose a service capable of supporting pragmatic IoT interoperability with the goal of enriching developer collaboration. This being possible through the use of inferences and similarity calculations about information provided by the developers.
Earth Observation (EO) is considered a key element in the European Research Roadmap and an opportunity market for the next years. However, this field presents some critical challenges to cover the current demand of se...
详细信息
Earth Observation (EO) is considered a key element in the European Research Roadmap and an opportunity market for the next years. However, this field presents some critical challenges to cover the current demand of services: i) there is massive and large-sized data from Earth Observation recordings, ii) On demand storage, processing and distribution of geo-information generated with the recorded data are required. Conventional infrastructures have the risks of over/under size the infrastructure when big data is used, they are not flexible to cover sudden changes in the demand of services and the access to the information presents large latencies. These aspects limit the use of EO technology for real time use. The use of cloud computing technology can overcome the previously defined limitations. The GEO-Cloud experiment emerged to find viable solutions to provide highly demanding EO services by using future internet technologies. It is a close to reality experiment, part of the FP7 Fed4FIRE project. GEO-Cloud consists of the design, implementation and testing in cloud a complete EO system, from the acquisition of geo-data with a constellation of satellites to its on demand distribution to end users with remote access. This paper presents the GEO-Cloud experiment design, architecture and foreseen research activity.
In this paper, we address the challenges in supporting reliability and scalability in societal-scale notification systems that aim to reach large populations with customized alerts. We explore fault tolerance (FT) tec...
详细信息
ISBN:
(纸本)9781665445993
In this paper, we address the challenges in supporting reliability and scalability in societal-scale notification systems that aim to reach large populations with customized alerts. We explore fault tolerance (FT) techniques in the context of Big Data Publish-Subscribe systems (BDPS), a scalable hierarchical architecture, that meshes big-data platforms (to store and operate on large volumes of data) with a distributed pub/sub broker network (to manage and communicate with a large number of end subscribers). The role of brokers in this architecture is critical since they serve to mediate interactions between subscribers and the backend big data system. We propose the REAPS (REliable Active Publish Subscribe) framework that can handle different classes of broker failures including randomized failures and geographically-correlated failures (as in a natural disaster). REAPS implements a low overhead fault tolerance service using a primary-backup approach; key features include the ability to exploit subscription similarity among brokers and techniques for quasi-active state replication to support fast recovery and delivery guarantees of notification services. We implement REAPS and conduct measurement studies on a prototype BDPS platform using real world usecases. We further evaluate REAPS under various failure scenarios to explore the scalability and performance of our proposed FT mechanisms via simulation studies.
We consider two critical aspects of security in the distributed computing (DC) model: secure data shuffling and secure coded computing. It is imperative that any external entity overhearing the transmissions does not ...
详细信息
ISBN:
(数字)9798350393187
ISBN:
(纸本)9798350393194
We consider two critical aspects of security in the distributed computing (DC) model: secure data shuffling and secure coded computing. It is imperative that any external entity overhearing the transmissions does not gain any information about the intermediate values (IVs) exchanged during the shuffling phase of the DC model. Our approach ensures IV confidentiality during data shuffling. Moreover, each node in the system must be able to recover the IVs necessary for computing its output functions but must also remain oblivious to the IVs associated with output functions not assigned to it. We design secure DC methods and establish achievable limits on the tradeoffs between the communication and computation loads to contribute to the advancement of secure data processing in distributed systems.
As the push towards electronic storage, publication, curation, and discoverability of research data collected in multiple research domains has grown, so too have the massive numbers of small to medium datasets that ar...
详细信息
ISBN:
(纸本)9781479915194
As the push towards electronic storage, publication, curation, and discoverability of research data collected in multiple research domains has grown, so too have the massive numbers of small to medium datasets that are highly distributed and not easily discoverable - a region of data that is sometimes referred to as the long tail of science. The rapidly increasing, sheer volume of these long tail data present one aspect of the Big Data problem: how does one more easily access, discover, use, and reuse long tail data to lead to new multidisciplinary collaborative research and scientific advancement? In this paper, we describe Data Bridge, a new e-science collaboration environment that will realize the potential of long tail data by implementing algorithms and tools to more easily enable data discoverability and reuse. Data Bridge will define different types of semantic bridges that link diverse datasets by applying a set of sociometric network analysis (SNA) and relevance algorithms. We will measure relevancy by examining different ways datasets can be related to each other: data to data, user to data, and method to data connections. Through analysis of metadata and ontology, by pattern analysis and feature extraction, through usage tools and models, and via human connections, Data Bridge will create an environment for long tail data that is greater than the sum of its parts. In the project's initial phase, we will test and validate the new tools with real-world data contained in the Data verse Network, the largest social science data repository. In this short paper, we discuss the background and vision for the Data Bridge project, and present an introduction to the proposed SNA algorithms and analytical tools that are relevant for discoverability of long tail science data.
暂无评论