Limited battery power has long been a challenge for mobile applications. As a result, the work in power monitoring and management has attracted great interests. In this paper, we propose a model to estimate power cons...
详细信息
Limited battery power has long been a challenge for mobile applications. As a result, the work in power monitoring and management has attracted great interests. In this paper, we propose a model to estimate power consumption of mobile applications at run-time, based on application-specific per-action power profiling. In addition, we have developed on-line optimization techniques which help maximize users' experience while conserving power. Our power model is lightweight and flexible, in that it can be used by any mobile applications as a plugin, and it can support user-defined optimization mechanisms. This approach has been evaluated using a case study, a mobile application for field studies, and the experimental results show that our model accurately captures power consumption of the application, and the model can be used to optimize the power consumption based on users' needs.
The hamming weight (also known as population count) of a bitstring is the number of 1's in the bitstring. It has applications in scopes like cryptography, chemical informatics and information theory. Typical bitst...
详细信息
The hamming weight (also known as population count) of a bitstring is the number of 1's in the bitstring. It has applications in scopes like cryptography, chemical informatics and information theory. Typical bitstring lengths range from the processor's word length to several thousands of bits. A plethora of hamming weight algorithms have been pro- posed. While some implementations expose just scalar par- allelism, others expose vector parallelism. Moreover, some implementations use special machine instructions that compute the hamming weight of a processor's word. This paper presents a new hybrid scalar-vector hamming weight implementation that exposes both scalar and vector parallelism. This implementation will be useful on platforms that can exploit both kinds of parallelism simultaneously. On a Sandy Bridge platform, our hybrid implementation outperforms by up to 1.23X and 1.6X the, to the best of our knowledge, best scalar and vector implementations respectively.
Multi-core phones are now pervasive. Yet, existing applications rely predominantly on a client-server computing paradigm, using phones only as thin clients, sending sensed information via the cellular network to serve...
详细信息
Multi-core phones are now pervasive. Yet, existing applications rely predominantly on a client-server computing paradigm, using phones only as thin clients, sending sensed information via the cellular network to servers for processing. This makes the cellular network the bottleneck, limiting overall application performance. In this paper, we propose Mobi Streams, a distributed Stream processing System (DSPS) that runs directly on smartphones. Mobi Streams can offload computing from remote servers to local phones and thus alleviate the pressure on the cellular network. Implementing DSPS on smartphones faces significant challenges: 1) multiple phones can readily fail simultaneously, and 2) the phones' ad-hoc WiFi network has low bandwidth. Mobi Streams tackles these challenges through two new techniques: 1) token-triggered check pointing, and 2) broadcast-based check pointing. Our evaluations driven by two real world applications deployed in the US and Singapore show that migrating from a server platform to a smartphone platform eliminates the cellular network bottleneck, leading to 0.78~42.6X throughput increase and 10%~94.8% latency decrease. Also, Mobi Streams' fault tolerance scheme increases throughput by 230% and reduces latency by 40% vs. prior state-of-the-art fault-tolerant DSPSs.
Two camps of file systems exist: parallel file systems designed for conventional high performance computing (HPC) and distributed file systems designed for newly emerged data-intensive applications. Addressing the big...
详细信息
ISBN:
(纸本)9781479955497
Two camps of file systems exist: parallel file systems designed for conventional high performance computing (HPC) and distributed file systems designed for newly emerged data-intensive applications. Addressing the big data challenge requires an approach that utilizes both high performance computing and data-intensive computing power. Thus, HPC applications may need to interact with distributed file systems, such as HDFS. The N-1 (N-to-1) parallel file write is a critical technical challenge, because it is very common for HPC applications but HDFS does not allow it. This study introduces a system solution, named SCALER, which allows MPI based applications to directly access HDFS without extra data movement. SCALER supports N-1 file write at both the inter-block level and intra-block level. Experimental results confirm that SCALER achieves the design goal efficiently.
We present Realm, an event-based runtime system for heterogeneous, distributed memory machines. Realm is fully asynchronous: all runtime actions are non-blocking. Realm supports spawning computations, moving data, and...
详细信息
ISBN:
(纸本)9781509066070
We present Realm, an event-based runtime system for heterogeneous, distributed memory machines. Realm is fully asynchronous: all runtime actions are non-blocking. Realm supports spawning computations, moving data, and reservations, a novel synchronization primitive. Asynchrony is exposed via a light-weight event system capable of operating without central management. We describe an implementation of Realm that relies on a novel generational event data structure for efficiently handling large numbers of events in a distributed address space. Microbenchmark experiments show our implementation of Realm approaches the underlying hardware performance limits. We measure the performance of three real-world applications on the Keeneland supercomputer. Our results demonstrate that Realm confers considerable latency hiding to clients, attaining significant speedups over traditional bulk-synchronous and independently optimized MPI codes.
In the last few years, GPUs have become an integral part of HPC clusters. To test these heterogeneous CPU-GPU systems, we designed a hybrid CUDA-MPI benchmark suite that consists of three communication- and compute-in...
详细信息
ISBN:
(纸本)9781509066070
In the last few years, GPUs have become an integral part of HPC clusters. To test these heterogeneous CPU-GPU systems, we designed a hybrid CUDA-MPI benchmark suite that consists of three communication- and compute-intensive applications: Matrix Multiplication (MM), Needleman-Wunsch (NW) and the ADFA compression algorithm [1]. The main goal of this work is to characterize these workloads on CPU-GPU clusters. Our benchmark applications are designed to allow cluster administrators to identify bottlenecks in the cluster, to decide if scaling applications to multiple nodes would improve or decrease overall throughput and to design effective scheduling policies. Our experiments show that inter-node communication can significantly degrade the throughput of communication-intensive applications. We conclude that the scalability of the applications depends primarily on two factors: the cluster configuration and the applications characteristics.
The MapReduce paradigm is one of the best solutions for implementing distributedapplications which perform intensive data processing. In terms of performance regarding this type of applications, MapReduce can be impr...
详细信息
The MapReduce paradigm is one of the best solutions for implementing distributedapplications which perform intensive data processing. In terms of performance regarding this type of applications, MapReduce can be improved by adding GPU capabilities. In this context, the GPU clusters for large scale computing can bring a considerable increase in the efficiency and speedup of data intensive applications. In this article we present a framework for executing MapReduce using GPU programming. We describe several improvements to the concept of GPU MapReduce and we compare our solution with others.
The distributed query is one of the research focus in the Big Data. Nowadays, many companies and institutions provide technology and products to realize function or improve efficiency in the all kinds of database. In ...
详细信息
ISBN:
(纸本)9781479966226
The distributed query is one of the research focus in the Big Data. Nowadays, many companies and institutions provide technology and products to realize function or improve efficiency in the all kinds of database. In the scene of electricity, using these techniques, the real-time requirement (
Equi-join is heavily used in MapReduce-based log processing. With the rapid growth of dataset sizes, join methods on MapReduce are extensively studied recently. We find that existing join methods usually cannot get hi...
详细信息
ISBN:
(纸本)9781479976164
Equi-join is heavily used in MapReduce-based log processing. With the rapid growth of dataset sizes, join methods on MapReduce are extensively studied recently. We find that existing join methods usually cannot get high query performance and affordable storage consumption at the same time when faced with a huge amount of log data. They either only optimize one aspect but significantly sacrifice the other or have limited applications. In this paper, after analyzing characteristics of the workloads and underlying MapReduce, we present a join method with specific optimizations for log processing called RHJoin (Repartition Hash Join) and its implementation on Hadoop. In RHJoin, reference tables are partitioned in the pre-processing step, the log table is partitioned on the map side and hash join is executed on the reduce side. The shuffle procedure of MapReduce is also optimized by removing the sort step and overlapping the execution of mappers and reducers. Comprehensive experiments show that RHJoin achieves high query performance with only a small extra storage cost, and has wide application circumstances for log processing.
High-performance analytical data processing systems often run on servers with large amounts of memory. A common data structure used in such environment is the hash tables. This paper focuses on investigating efficient...
详细信息
High-performance analytical data processing systems often run on servers with large amounts of memory. A common data structure used in such environment is the hash tables. This paper focuses on investigating efficient parallel hash algorithms for processing large-scale data. Currently, hash tables on distributed architectures are accessed one key at a time by local or remote threads while shared-memory approaches focus on accessing a single table with multiple threads. A relatively straightforward “bulk-operation” approach seems to have been neglected by researchers. In this work, using such a method, we propose a high-level parallel hashing framework, Structured parallel Hashing, targeting efficiently processing massive data on distributed memory. We present a theoretical analysis of the proposed method and describe the design of our hashing implementations. The evaluation reveals a very interesting result - the proposed straightforward method can vastly outperform distributed hashing methods and can even offer performance comparable with approaches based on shared memory supercomputers which use specialized hardware predicates. Moreover, we characterize the performance of our hash implementations through extensive experiments, thereby allowing system developers to make a more informed choice for their high-performance applications.
暂无评论