The amount of big data from high-throughput Next-Generation Sequencing (NGS) techniques represents various challenges such as storage, analysis and transmission of massive datasets. One solution to storage and transmi...
详细信息
ISBN:
(纸本)9781467379526
The amount of big data from high-throughput Next-Generation Sequencing (NGS) techniques represents various challenges such as storage, analysis and transmission of massive datasets. One solution to storage and transmission of data is compression using specialized compression algorithms. The existing specialized algorithms suffer from poor scalability with increasing size of the datasets and best available solutions can take hours to compress gigabytes of data. Compression and decompression using these techniques for peta-scale data sets is prohibitively expensive in terms of time and energy. In this paper we introduce paraDSRC, a parallel implementation of the DNA Sequence Reads Compression (DSRC) application using a message passing model that presents reduction of the compression time complexity by a factor of O(1/p) (where p is the number of processing units). Our experimental results show that paraDSRC achieves compression times that are 43% to 99% faster than DSRC and compression throughputs of up to 8.4GB/s on a moderate size cluster. For many of the datasets used in our experiments super-linear speedups have been registered making the implementation strongly scalable. We also show that paraDSRC is more than 25.6x faster than comparable parallel compression algorithms.
Various big data management systems have emerged to handle different types of applications, which cast very different demands on storage, indexing and retrieval of large amount of data on distributed file system. Such...
详细信息
ISBN:
(纸本)9781479979646
Various big data management systems have emerged to handle different types of applications, which cast very different demands on storage, indexing and retrieval of large amount of data on distributed file system. Such diversity on demands has raised huge challenges to the design of new generation of data access service for big data. In this paper, we present PABIRS, a unified data access middleware to support mixed workloads. PABIRS encapsulates the underlying distributed file system (DFS) and provides a unified access interface to systems such as MapReduce and key-value stores. PABIRS achieves dramatic improvement on efficiency by employing a novel hybrid indexing scheme. Based on the data distribution, the indexing scheme adaptively builds bitmap index and Log Structured Merge Tree (LSM) index. Moreover, PABIRS distributes the computation to multiple index nodes and utilizes a Pregel-based algorithm to facilitate parallel data search and retrieval. We empirically evaluate PABIRS against other existing distributed data processing systems and verify the huge advantages of PABIRS on shorter response time, higher throughput and better scalability, over big data with real-life phone logs and TPC-H benchmark.
Latent Semantic Indexing (LSI) is one of the well-known searching techniques which match queries to documents in information retrieval applications. LSI has been proven to improve the retrieval performance, however, a...
详细信息
ISBN:
(纸本)9789812879363;9789812879356
Latent Semantic Indexing (LSI) is one of the well-known searching techniques which match queries to documents in information retrieval applications. LSI has been proven to improve the retrieval performance, however, as the size of documents gets larger, current implementations are not fast enough to compute the result on a standard personal computer. In this paper, we proposed a new parallel LSI algorithm on standard personal computers with multicore processors to improve the performance of retrieving relevant documents. The proposed parallel LSI was designed to automatically run the matrix computation on LSI algorithms as parallel threads using multi-core processors. The Fork-Join technique is applied to execute the parallel programs. We used the Malay Translated Hadith of Shahih Bukhari from Jilid 1 until Jilid 4 as the test collections. The total number of documents used is 2028 of text files. The processing time during the pre-processing phase of the documents for the proposed parallel LSI is measured and compared to the sequential LSI algorithm. Our results show that processing time for pre-processing tasks using our proposed parallel LSI system is faster than sequential system. Thus, our proposed parallel LSI algorithm has improved the searching time as compared to sequential LSI algorithm.
Since the dawn of the big data era the search giant Google has been in the lead for meeting the challenge of the new era. Results from Google's big data projects in the past decade have inspired the development of...
详细信息
ISBN:
(纸本)9781479981281
Since the dawn of the big data era the search giant Google has been in the lead for meeting the challenge of the new era. Results from Google's big data projects in the past decade have inspired the development of many other big data technologies such as Apache Hadoop and NoSQL databases. The study article examines ten major milestone papers on big data management published by Google, from Google File system (GFS), MapReduce, Bigtable, Chubby, Percolator, Pregel, Dremel, to Megastore, Spanner and finally Omega. The purpose of the study article is to help provide a high-level understanding of the concepts behind many popular big data solutions and derive insights on building robust and scalable systems for handling big data.
Because moving objects usually move on spatial networks in location-based service applications, their locations are updated frequently, leading to the degradation of retrieval performance. To manage the frequent updat...
详细信息
ISBN:
(数字)9789401796187
ISBN:
(纸本)9789401796187;9789401796170
Because moving objects usually move on spatial networks in location-based service applications, their locations are updated frequently, leading to the degradation of retrieval performance. To manage the frequent updates of moving objects' locations in an efficient way, we propose a new distributed grid scheme which utilizes node-based pre-computation technique to minimize the update cost of the moving objects' locations. Because our grid scheme manages spatial network data separately from the POIs (Point of Interests) and moving objects, it can minimize the update cost of the POIs and moving objects. Using our grid scheme, we propose a new k-nearest neighbor (k-NN) query processing algorithm which minimizes the number of accesses to adjacent cells during POIs retrieval in a parallel way. Finally, we show from our performance analysis that our k-NN query processing algorithm is better on retrieval performance than that of the existing S-GRID.
Two-way seamless communication is the key aspect of realizing the vision of smart grid. Reliable and real-time information becomes the key factor for reliable delivering of power from the generating units to the end-u...
详细信息
Today, a multitude of highly-connected applications and information systems hold, consume and produce huge amounts of heterogeneous data. The overall amount of data is even expected to dramatically increase in the fut...
详细信息
ISBN:
(纸本)9789897581038
Today, a multitude of highly-connected applications and information systems hold, consume and produce huge amounts of heterogeneous data. The overall amount of data is even expected to dramatically increase in the future. In order to conduct, e.g., data analysis, visualizations or other value-adding scenarios, it is necessary to integrate specific, relevant parts of data into a common source. Due to oftentimes changing environments and dynamic requests, this integration has to support ad-hoc and flexible data processing capabilities. Furthermore, an iterative and explorative trial-and-error integration based on different data sources has to be possible. To cope with these requirements, several data mashup platforms have been developed in the past. However, existing solutions are mostly non-extensible, monolithic systems or applications with many limitations regarding the mentioned requirements. In this paper, we introduce an approach that copes with these issues (i) by the introduction of patterns to enable decoupling from implementation details, (ii) by a cloud-ready approach to enable availability and scalability, and (iii) by a high degree of flexibility and extensibility that enables the integration of heterogeneous data as well as dynamic (un-)tethering of data sources. We evaluate our approach using runtime measurements of our prototypical implementation.
In this article we were studying the types of architectures for cloud processing and storage of data, data consolidation and enterprise storage. Special attention is given to the use of large data sets in computationa...
详细信息
ISBN:
(纸本)9783319214108;9783319214092
In this article we were studying the types of architectures for cloud processing and storage of data, data consolidation and enterprise storage. Special attention is given to the use of large data sets in computational process. It was shown, that based on the methods of theoretical analysis and experimental study of computer systems architectures, including heterogeneous, special techniques of data processing, large volumes of information models relevant architectures, methods of optimization software for the heterogeneous systems, it is possible to ensure the integration of computer systems to provide computations with very large data sets.
Many curricula for undergraduate studies in computer science provide a lecture on the fundamentals of parallel programming like multi-threaded computation on shared memory architectures using POSIX threads or OpenMP. ...
详细信息
ISBN:
(纸本)9783319273082;9783319273075
Many curricula for undergraduate studies in computer science provide a lecture on the fundamentals of parallel programming like multi-threaded computation on shared memory architectures using POSIX threads or OpenMP. The complex structure of parallel programs can be challenging, especially for inexperienced students. Thus, there is a latent need for software supporting the learning process. Subsequent lectures may cover more advanced parallelization techniques such as the Message Passing Interface (MPI) and the Compute Unified Device Architecture (CUDA) languages. Unfortunately, the majority of students cannot easily access MPI clusters or modern hardware accelerators in order to effectively develop parallel programming skills. To overcome this, we present an interactive tool to aid both educators and students in the learning process. This paper describes the "System for AUtomated Code Evaluation" (SAUCE), a web-based open source (available under the AGPL-3.0 license at https://***/moschlar/SAUCE) application for programming assignment evaluation and elaborates on its features specifically designed for the teaching of parallel programming. This tool enables educators to provide the required programming environments with a low barrier to entry since it is usable with just a web browser. SAUCE allows for immediate feedback and thus can be used interactively in class room settings.
Designing and implementing distributed systems is a hard endeavor, both at an abstract level when designing the system, and at a concrete level when implementing, debugging and evaluating it. This stems not only from ...
详细信息
暂无评论