Paper presents an advanced iterative MapReduce solution that employs Hadoop and MPI technologies. First, we present an overview of working implementations that make use of the same technologies. Then we define an acad...
详细信息
Paper presents an advanced iterative MapReduce solution that employs Hadoop and MPI technologies. First, we present an overview of working implementations that make use of the same technologies. Then we define an academic example of numeric problem with an emphasis on its computational features. The named definition is used to justify the proposed solution design.
In recent years we have seen how Cloud Computing is changing the way of doing businesses and how services are delivered over the Internet. This disruption is a major challenge for Service Providers and Independent Sof...
详细信息
This paper proposes an FPGA based layered architecture for quasi-cyclic (QC) irregular LDPC decoder. Our approach is based on merging variable and check node processing into one single variable-check node (VCN) unit. ...
详细信息
This paper proposes an FPGA based layered architecture for quasi-cyclic (QC) irregular LDPC decoder. Our approach is based on merging variable and check node processing into one single variable-check node (VCN) unit. Layer message computation is done using a parallel scheme of a number of VCNs equal to the expansion factor of the QC matrix. The proposed architecture is characterized by the serial processing of the a posteriori LLRs by an FPGA specific high frequency VCN unit implementation using ROM memories. In our approach data conversions as well as additions and comparators are replaced by look-up-tables implemented using distributed RAM. In addition to this, other techniques such as: efficient packaging of LLRs messages and check-node message compression as well as the configurable port width of the FPGA's BRAM are used to reduce BRAM block utilization. Throughput increase is achieved by utilizing techniques such as pipelining, parallelprocessing of multiple VCNs, as well as relatively high working frequency. Implementation results for the WiMAX (1152, 2304) QC irregular LDPC code indicate that the proposed architecture has up to 3x less slices resource utilization and up to 1 order of magnitude less BRAM blocks with respect to other approaches, while maintaining a throughput of several hundreds of Mbps (800 Mbps coded bits). We achieved this without sacrificing flexibility; therefore we can easily adapt our design to accommodate different code rates.
Limited battery power has long been a challenge for mobile applications. As a result, the work in power monitoring and management has attracted great interests. In this paper, we propose a model to estimate power cons...
详细信息
Limited battery power has long been a challenge for mobile applications. As a result, the work in power monitoring and management has attracted great interests. In this paper, we propose a model to estimate power consumption of mobile applications at run-time, based on application-specific per-action power profiling. In addition, we have developed on-line optimization techniques which help maximize users' experience while conserving power. Our power model is lightweight and flexible, in that it can be used by any mobile applications as a plugin, and it can support user-defined optimization mechanisms. This approach has been evaluated using a case study, a mobile application for field studies, and the experimental results show that our model accurately captures power consumption of the application, and the model can be used to optimize the power consumption based on users' needs.
The hamming weight (also known as population count) of a bitstring is the number of 1's in the bitstring. It has applications in scopes like cryptography, chemical informatics and information theory. Typical bitst...
详细信息
The hamming weight (also known as population count) of a bitstring is the number of 1's in the bitstring. It has applications in scopes like cryptography, chemical informatics and information theory. Typical bitstring lengths range from the processor's word length to several thousands of bits. A plethora of hamming weight algorithms have been pro- posed. While some implementations expose just scalar par- allelism, others expose vector parallelism. Moreover, some implementations use special machine instructions that compute the hamming weight of a processor's word. This paper presents a new hybrid scalar-vector hamming weight implementation that exposes both scalar and vector parallelism. This implementation will be useful on platforms that can exploit both kinds of parallelism simultaneously. On a Sandy Bridge platform, our hybrid implementation outperforms by up to 1.23X and 1.6X the, to the best of our knowledge, best scalar and vector implementations respectively.
Multi-core phones are now pervasive. Yet, existing applications rely predominantly on a client-server computing paradigm, using phones only as thin clients, sending sensed information via the cellular network to serve...
详细信息
Multi-core phones are now pervasive. Yet, existing applications rely predominantly on a client-server computing paradigm, using phones only as thin clients, sending sensed information via the cellular network to servers for processing. This makes the cellular network the bottleneck, limiting overall application performance. In this paper, we propose Mobi Streams, a distributed Stream processing System (DSPS) that runs directly on smartphones. Mobi Streams can offload computing from remote servers to local phones and thus alleviate the pressure on the cellular network. Implementing DSPS on smartphones faces significant challenges: 1) multiple phones can readily fail simultaneously, and 2) the phones' ad-hoc WiFi network has low bandwidth. Mobi Streams tackles these challenges through two new techniques: 1) token-triggered check pointing, and 2) broadcast-based check pointing. Our evaluations driven by two real world applications deployed in the US and Singapore show that migrating from a server platform to a smartphone platform eliminates the cellular network bottleneck, leading to 0.78~42.6X throughput increase and 10%~94.8% latency decrease. Also, Mobi Streams' fault tolerance scheme increases throughput by 230% and reduces latency by 40% vs. prior state-of-the-art fault-tolerant DSPSs.
Two camps of file systems exist: parallel file systems designed for conventional high performance computing (HPC) and distributed file systems designed for newly emerged data-intensive applications. Addressing the big...
详细信息
ISBN:
(纸本)9781479955497
Two camps of file systems exist: parallel file systems designed for conventional high performance computing (HPC) and distributed file systems designed for newly emerged data-intensive applications. Addressing the big data challenge requires an approach that utilizes both high performance computing and data-intensive computing power. Thus, HPC applications may need to interact with distributed file systems, such as HDFS. The N-1 (N-to-1) parallel file write is a critical technical challenge, because it is very common for HPC applications but HDFS does not allow it. This study introduces a system solution, named SCALER, which allows MPI based applications to directly access HDFS without extra data movement. SCALER supports N-1 file write at both the inter-block level and intra-block level. Experimental results confirm that SCALER achieves the design goal efficiently.
We present Realm, an event-based runtime system for heterogeneous, distributed memory machines. Realm is fully asynchronous: all runtime actions are non-blocking. Realm supports spawning computations, moving data, and...
详细信息
ISBN:
(纸本)9781509066070
We present Realm, an event-based runtime system for heterogeneous, distributed memory machines. Realm is fully asynchronous: all runtime actions are non-blocking. Realm supports spawning computations, moving data, and reservations, a novel synchronization primitive. Asynchrony is exposed via a light-weight event system capable of operating without central management. We describe an implementation of Realm that relies on a novel generational event data structure for efficiently handling large numbers of events in a distributed address space. Microbenchmark experiments show our implementation of Realm approaches the underlying hardware performance limits. We measure the performance of three real-world applications on the Keeneland supercomputer. Our results demonstrate that Realm confers considerable latency hiding to clients, attaining significant speedups over traditional bulk-synchronous and independently optimized MPI codes.
In the last few years, GPUs have become an integral part of HPC clusters. To test these heterogeneous CPU-GPU systems, we designed a hybrid CUDA-MPI benchmark suite that consists of three communication- and compute-in...
详细信息
ISBN:
(纸本)9781509066070
In the last few years, GPUs have become an integral part of HPC clusters. To test these heterogeneous CPU-GPU systems, we designed a hybrid CUDA-MPI benchmark suite that consists of three communication- and compute-intensive applications: Matrix Multiplication (MM), Needleman-Wunsch (NW) and the ADFA compression algorithm [1]. The main goal of this work is to characterize these workloads on CPU-GPU clusters. Our benchmark applications are designed to allow cluster administrators to identify bottlenecks in the cluster, to decide if scaling applications to multiple nodes would improve or decrease overall throughput and to design effective scheduling policies. Our experiments show that inter-node communication can significantly degrade the throughput of communication-intensive applications. We conclude that the scalability of the applications depends primarily on two factors: the cluster configuration and the applications characteristics.
The MapReduce paradigm is one of the best solutions for implementing distributedapplications which perform intensive data processing. In terms of performance regarding this type of applications, MapReduce can be impr...
详细信息
The MapReduce paradigm is one of the best solutions for implementing distributedapplications which perform intensive data processing. In terms of performance regarding this type of applications, MapReduce can be improved by adding GPU capabilities. In this context, the GPU clusters for large scale computing can bring a considerable increase in the efficiency and speedup of data intensive applications. In this article we present a framework for executing MapReduce using GPU programming. We describe several improvements to the concept of GPU MapReduce and we compare our solution with others.
暂无评论