In the present information era, huge amount of data to be processed daily. In contrast of conventional sequential data processingtechniques, parallel data processing approaches can expedite the processes and more eff...
详细信息
In the present information era, huge amount of data to be processed daily. In contrast of conventional sequential data processingtechniques, parallel data processing approaches can expedite the processes and more efficiently deal with big data. In the last few decades, neural computation emerged as a popular area for parallel and distributed data processing. The data processingapplications of neural computation included, but not limited to, data sorting, data selection, data mining, data fusion, and data reconciliation. In this talk, neurodynamic approaches to parallel data processing will be introduced, reviewed, and compared. In particular, my talk will compare several mathematical problem formulations of well-known multiple winners-take-all problem and present several recurrent neural networks with reducing model complexity. Finally, the best one with the simplest model complexity and maximum computational efficiency will be highlighted. Analytical and Monte Carlo simulation results will be shown to demonstrate the computing characteristics and performance of the continuous-time and discrete-time models. The applications to parallel sorting, rank-order filtering, and data retrieval will be also discussed.
MapReduce has become a dominant parallel computing paradigm for storing and processing massive data due to its excellent scalability, reliability, and elasticity. In this paper, we present a new architecture of Distri...
详细信息
MapReduce has become a dominant parallel computing paradigm for storing and processing massive data due to its excellent scalability, reliability, and elasticity. In this paper, we present a new architecture of distributed Beta Wavelet Networks {DBWN} for large image classification in MapReduce model. First to prove the performance of wavelet networks, a parallelized learning algorithm based on the Beta Wavelet Transform is proposed. Then the proposed structure of the {DBWN} is itemized. However the new algorithm is realized in MapReduce model. Comparisons with Fast Beta Wavelet Network {FBWN} are presented and discussed. Results of comparison have shown that the {DBWN} model performs better than {FBWN} model in classification rate and in the context of training run time.
Data clustering is usually time-consuming since it by default needs to iteratively aggregate and process large volume of data. Approximate aggregation based on sample provides fast and quality ensured results. In this...
详细信息
ISBN:
(纸本)9781467365994
Data clustering is usually time-consuming since it by default needs to iteratively aggregate and process large volume of data. Approximate aggregation based on sample provides fast and quality ensured results. In this paper, we propose to leverage approximation techniques to data clustering to obtain the trade-off between clustering efficiency and result quality, along with online accuracy estimation. The proposed method is based on the bootstrap trials. We implemented this method as an Intelligent Bootstrap Library (IBL) on Spark to support efficient data clustering. Intensive evaluations show that IBL can provide a 2x speed-up over the state of art solution with the same error bound.
How to effectively distribute and share increasingly large volumes of data in large-scale network applications is a key challenge for Internet infrastructure. Although NDN, a promising new future internet architecture...
详细信息
How to effectively distribute and share increasingly large volumes of data in large-scale network applications is a key challenge for Internet infrastructure. Although NDN, a promising new future internet architecture which takes data oriented transfer approaches, aims to better solve such needs than IP, it still faces problems like data redundancy transmission and inefficient in-network cache utilization. This paper combines network coding techniques to NDN to improve network throughput and efficiency. The merit of our design is that it is able to avoid duplicate and unproductive data delivery while transferring disjoint data segments along multiple paths and with no excess modification to NDN fundamentals. To quantify performance benefits of applying network coding in NDN, we integrate network coding into an NDN streaming media system implemented in the ndn SIM simulator. Basing on BRITE generated network topologies in our simulation, the experimental results clearly and fairly demonstrate that considering network coding in NDN can significantly improve the performance, reliability and QoS. More importantly, our approach is capable of and well fit for delivering growing Big Data applications including high-performance and high-density video streaming services.
The emerging applications in big data and social networks issue rapidly increasing demands on graph processing. Graph query operations that involve a large number of vertices and edges can be tremendously slow on trad...
详细信息
The emerging applications in big data and social networks issue rapidly increasing demands on graph processing. Graph query operations that involve a large number of vertices and edges can be tremendously slow on traditional databases. The state-of-the-art graph processing systems and databases usually adopt master/slave architecture that potentially impairs their The contributions of this paper are as follows: scalability. This work describes the design and implementation of a new graph processing system based on Bulk Synchronous parallel model. Our system is built on top of ZHT, a scalable distributed key-value store, which benefits the graph processing in terms of scalability, performance and persistency. The experiment results imply excellent scalability.
Fast increasing volumes of spatial data has made it imperative to develop both scalable and efficient spatial data management techniques by leveraging modern parallel hardware and distributed systems. By integrating a...
详细信息
Fast increasing volumes of spatial data has made it imperative to develop both scalable and efficient spatial data management techniques by leveraging modern parallel hardware and distributed systems. By integrating a leading open source Big Data system called Impala and our previous work on data parallel designs for spatial indexing and query processing, we have developed ISP-MC+ and ISP-GPU for large-scale spatial data management on computer clusters equipped with multi-core CPUs and Graphics processing Units (GPUs), respectively. Both ISP-MC+ and ISP-GPU have shown high efficiency and good scalability on a 10-node Amazon EC2 cluster equipped with multi-core CPUs and GPUs. Comparison with a baseline implementation using traditional techniques on a single CPU core have demonstrated orders of magnitude of speedups on a real world dataset with hundreds of millions of point locations.
Many core systems are basically designed for applications having large data parallelism. Strassen Matrix Multiply (MM) can be formulated as a depth ¿rst (DFS) traversal of a recursion tree where all cores work in...
详细信息
Many core systems are basically designed for applications having large data parallelism. Strassen Matrix Multiply (MM) can be formulated as a depth ¿rst (DFS) traversal of a recursion tree where all cores work in parallel on computing each of the NxN sub-matrices that reduces storage at the detriment of large data motion to gather and aggregate the results. We propose Strassen and Winograd algorithms (SMM and W-MM) based on three optimizations: a set of basic algebra functions to reduce overhead, invoking efficient library (CUBLAS 5.5), and parameter-tuning of parametric kernel to improve resource occupancy. On GPUs, W-MM and S-MM with one recursion level outperform CUBLAS 5.5 Library with up to twice as faster for large arrays satisfying N>=2048 and N>=3072, respectively. Compared to NVIDIA SDK library, SMM and W-MM achieved a speedup between 20x to 80x for the above arrays. The proposed approach can be used to enhance the performance of CUBLAS and MKL libraries.
Power consumption costs takes up to half of the operational expenses of data centers, making power management a critical concern. Advances in processor technology provide fine-grained control over operating frequency ...
详细信息
Power consumption costs takes up to half of the operational expenses of data centers, making power management a critical concern. Advances in processor technology provide fine-grained control over operating frequency of processors and this control can be used to trade off power for performance. We show that existing power models incorrectly assume quadratic relationship between power and frequency, leading to higher inaccuracy in prediction. Moreover, existing performance models have significant error margins while predicting performance of memory or file-intensive tasks and HPC applications due to negligence of the combined effects of frequency and CPU variations on the task execution time. In this paper, we empirically derive power and completion time models using linear regression with CPU utilization and operating frequency as parameters. We validate our power model on several Intel and AMD processors by predicting within 2-7% of measured power. We validate our completion time model using five kernels of NASA parallel Benchmark suite and five CPU, memory and file-intensive benchmarks on four heterogeneous systems and predicting within 1-6% of observed performance.
Advanced Encryption Standard (AES) plays an important role in modern cryptographic applications. High performance implementation of AES are required for many application scenarios. parallelization techniques are popul...
详细信息
Advanced Encryption Standard (AES) plays an important role in modern cryptographic applications. High performance implementation of AES are required for many application scenarios. parallelization techniques are popular in recent years to improve the performance. In this paper we propose two parallel AES schemes, one is full software implementation and another is software implementation with hardware accelerator. These schemes are implemented on two 4-core clusters with shared memory architecture. The experimental results show that our parallel schemes have a good performance compared with related works and speedup for two schemes achieves 4.92 and 9.78, respectively. The throughput achieves 176.48 Mbps when using hardware accelerator.
The vast computing power of GPUs makes them an attractive platform for accelerating large scale data parallel computations such as popular graph processingapplications. However, the inherent irregularity and large si...
详细信息
The vast computing power of GPUs makes them an attractive platform for accelerating large scale data parallel computations such as popular graph processingapplications. However, the inherent irregularity and large sizes of real-world power law graphs makes effective use of GPUs a major challenge. In this paper we develop techniques that greatly enhance the performance and scalability of vertex-centric graph processing on GPUs. First, we present Warp Segmentation, a novel method that greatly enhances GPU device utilization by dynamically assigning appropriate number of SIMD threads to process a vertex with irregular-sized neighbors while employing compact CSR representation to maximize the graph size that can be kept inside the GPU global memory. Prior works can either maximize graph sizes (VWC uses the CSR representation) or device utilization (e.g., CuSha uses the CW representation, however, CW is roughly 2.5x the size of CSR). Second, we further scale graph processing to make use of multiple GPUs while proposing Vertex Refinement to address the challenge of judiciously using the limited bandwidth available for transferring data between GPUs via the PCIe bus. Vertex refinement employs parallel binary prefix sum to dynamically collect only the updated boundary vertices inside GPUs' outbox buffers for dramatically reducing inter-GPU data transfer volume. Whereas existing multi-GPU techniques (Medusa, TOTEM) perform high degree of wasteful vertex transfers. On a single GPU, our framework delivers average speedups of 1.29x to 2.80x over VWC. When scaled to multiple GPUs, our framework achieves up to 2.71x performance improvement compared to inter-GPU vertex communication schemes used by other multi-GPU techniques (i.e., Medusa, TOTEM).
暂无评论