A columnar data representation is known to be an efficient way for data storage, specifically in cases when the analysis is often done based only on a small fragment of the available data structures. A data representa...
详细信息
A columnar data representation is known to be an efficient way for data storage, specifically in cases when the analysis is often done based only on a small fragment of the available data structures. A data representation like Apache Parquet is a step forward from a columnar representation, which splits data horizontally to allow for easy parallelization of data analysis. Based on the general idea of columnar data storage, working on the FNAL LDRD Project FNAL-LDRD-2016-032, we have developed a striped data representation, which, we believe, is better suited to the needs of High Energy Physics data analysis. A traditional columnar approach allows for efficient data analysis of complex structures. While keeping all the benefits of columnar data representations, the striped mechanism goes further by enabling easy parallelization of computations without requiring special hardware. We will present an implementation and some performance characteristics of such a data representation mechanism using a distributed no-SQL database or a local file system, unified under the same API and data representation model. the representation is efficient and at the same time simple so that it allows for a common data model and APIs for wide range of underlying storage mechanisms such as distributed no-SQL databases and local file systems. Striped storage adopts Numpy arrays as its basic data representation format, which makes it easy and efficient to use in Python applications. the Striped Data Server is a web service, which allows to hide the server implementation details from the end user, easily exposes data to WAN users, and allows to utilize well known and developed data caching solutions to further increase data access efficiency. We are considering the Striped Data Server as the core of an enterprise scale data analysis platform for High Energy Physics and similar areas of data processing. We have been testing this architecture with a 2TB dataset from a CMS dark matter search and
Wireless Sensor Networks (WSNs) are collection of large number of sensor nodes that are used for essential monitoring purposes. they include many applications such as military and security applications, elementary mon...
详细信息
ISBN:
(纸本)9781509030385
Wireless Sensor Networks (WSNs) are collection of large number of sensor nodes that are used for essential monitoring purposes. they include many applications such as military and security applications, elementary monitoring, seismic monitoring, industrialized automation, wellness and robust monitoring etc. Due to its distributed nature and interconnected network approach, they are considered common target among attackers. WSNs are unprotected due to many attacks like illegal access, malicious node attack, wormhole attack, jamming attack, insider attack, cloning attack, routing attack and node capturing attack etc. Node capturing attack leads to node replication attack which is considered dangerous because after capturing node, attacker may extract all the encrypted information and further introduce many insider attacks. In node replication attack, attacker tries to capture the sensor node and if he becomes successful then he generates the replica which appears to be genuine. this paper reviews the various types of identification schemes and protocols contrary to replication attack.
the maximum flow problem is a classical combinatorial problem with many applications. In this work a hybrid parallel algorithm using both multi-core and many-core technologies for computingthe maximum flow in a netwo...
详细信息
the maximum flow problem is a classical combinatorial problem with many applications. In this work a hybrid parallel algorithm using both multi-core and many-core technologies for computingthe maximum flow in a network is presented. the proposed implementation is applicable in OpenMP/CUDA-enabled computing environment. To improve the performance two strategies were implemented: an adaptive approach where the algorithm alternate GPU/CPU processing according to the number of active nodes and implementations of the global relabeling and gap relabeling heuristics on multi-core approach. When compared against the best sequential implementation, the speedups range from 2.36 to 5.38 in several kinds of graph. Results show that the proposed algorithm is faster than previous parallel implementations on CPU/GPUs for all kinds of tested graphs.
this project is devoted to monitor and analyze the labour market using the publicly available data on job offers, CVs and companies gathered from open data sources and recruitment agencies. the relevance of the projec...
详细信息
this paper proposes and assesses a Big Data Platform for effective storage and analysis of On Board Unit (OBU) data related to the mobility of trucks in Belgium. the large volume and the streaming nature of the OBU da...
详细信息
ISBN:
(纸本)9781728116389;9781728116372
this paper proposes and assesses a Big Data Platform for effective storage and analysis of On Board Unit (OBU) data related to the mobility of trucks in Belgium. the large volume and the streaming nature of the OBU data requires the setup of a big data platform for an efficient collection, storage and analysis. the solution relies on (i) the Hadoop distributed File System (HDFS) to store data, (ii) the Apache Parquet format for data compression and columnar storage, and (iii) Spark for parallel and streaming processing of data. Data replication, compression and columnar storage ensure robustness to node failure, data distribution, and faster access to data.
the method of finding the optimal processing method to answer a query is called Query optimization, whereas a collection of various sites, distributed over a computer network is called distributed database. In Distrib...
详细信息
ISBN:
(纸本)9781509030385
the method of finding the optimal processing method to answer a query is called Query optimization, whereas a collection of various sites, distributed over a computer network is called distributed database. In distributed Database, the site communicates with each other through networks. the processing cost and the transmission cost are the important issues arise during evaluation of query cost. Several algorithms have been developed to find the best optimal solution for a particular query;however they all have their certain limitations. Hence, to find the optimal cost for a particular query is emerging as an open challenge for many researchers. therefore the cost-based query optimization technique has emerged as an important concept for dealing withthe query optimization. Withthe help of fragmentations one can replicate each fragment to various distributed sites, since
作者:
Škrabal, MichalBenko, VladimírCharles University
Institute of the Czech National Corpus Faculty of Arts Panská 7 Praha 111000 Czech Republic Slovak Academy of Sciences
L. Štúr Institute of Linguistics Comenius University UNESCO Chair in Plurilingual and Multicultural Communication Bratislava Slovakia
As Latvian can still be considered an under-resourced language, several corpora and corpus tools that can be used for its linguistic research are presented in the paper, namely: the InterCorp and Araneum Lettonicum co...
详细信息
Cloud computing is one of the most popular technologies nowadays because of its wide utilities and various benefits in several IT companies all over the world. However, in front of the increasing users' requests f...
详细信息
ISBN:
(纸本)9781538637906
Cloud computing is one of the most popular technologies nowadays because of its wide utilities and various benefits in several IT companies all over the world. However, in front of the increasing users' requests for computing services, cloud providers are encouraged to deploy large data centers, which consumes very large amount of energy and contribute to high operational costs. Among the effects, carbon dioxide emission rate is growing each day due to the huge amount of power consumption. this energy efficiency is an important issue in cloud computing, mainly due to the required electrical power to run these systems and to cool them. therefore, energy consumption has become a major concern for the widespread deployment of Cloud data centers. the growing importance for parallelapplications in the Cloud introduces significant challenges in reducing energy consumption from hosted servers. this paper addresses the problem of placing independent applications on the physical servers (hosts) of a Cloud infrastructure. We proposed a novel heuristic to allocate applications so that total energy consumption is reduced. Our proposal respects various constraints e.g. the machines availability, capability and the duplication of applications. Experiments are illustrated to validate the potential of our approach.
Last years, the prospects for digital transformation of economic processes were actively discussed. It is quite a complex problem having no solution with traditional methods. Opportunities of the qualitative developme...
详细信息
State-of-the-art high-throughput sequencing instruments decipher in excess of a billion short genomic fragments per run. the output sequences are referred to as reads. these read datasets facilitate a wide variety of ...
详细信息
ISBN:
(纸本)9781450347228
State-of-the-art high-throughput sequencing instruments decipher in excess of a billion short genomic fragments per run. the output sequences are referred to as reads. these read datasets facilitate a wide variety of analyses withapplications in areas such as genomics, metagenomics, and transcriptomics. Owing to the large size of the read datasets, such analyses are often compute and memory intensive. In this paper, we present a parallel algorithm for partitioning large-scale read datasets in order to facilitate distributed-memory parallel analyses. During the process of partitioning the read datasets, we construct and partition the associated de Bruijn graph in parallel. this allows applicationsthat make use of a variant of the de Bruijn graph, such as de novo assembly, to directly leverage the generated de Bruijn graph partitions. In addition, we propose a mechanism for evaluating the quality of the generated partitions of reads and demonstrate that our algorithm produces high quality partitions. Our implementation is available at ***/ParBLiSS/read_partitioning.
暂无评论