In recent years, many companies are embracing the Hadoop MapReduce system for large-data processing with completion time constrains. However, exiting Hadoop schedulers still suffer from the reducer load imbalancing pr...
详细信息
ISBN:
(纸本)9781467381741
In recent years, many companies are embracing the Hadoop MapReduce system for large-data processing with completion time constrains. However, exiting Hadoop schedulers still suffer from the reducer load imbalancing problem. In this paper, we present a novel run-time load balancing method for MapReduce. Our approach predicts the workload of each reduce task at run-time, and assigns the reduce tasks to specified machines based on the estimated workload of reduce tasks dynamically. Therefore, our approach can achieve load balance among machines. The experimental results show that our approach achieves high accuracy while predicting the workload of reduce tasks, and improves the job completion time by up to 23.15%.
On June 17, 2013, MilkyWay-2 (Tianhe-2) supercomputer was crowned as the fastest supercomputer in the world on the 41th TOP500 list. This paper provides an overview of the MilkyWay-2 project and describes the design...
详细信息
On June 17, 2013, MilkyWay-2 (Tianhe-2) supercomputer was crowned as the fastest supercomputer in the world on the 41th TOP500 list. This paper provides an overview of the MilkyWay-2 project and describes the design of hardware and software systems. The key architecture features of MilkyWay-2 are highlighted, including neo-heterogeneous compute nodes integrating commodity- off-the-shelf processors and accelerators that share similar instruction set architecture, powerful networks that employ proprietary interconnection chips to support the massively parallel message-passing communications, proprietary 16- core processor designed for scientific computing, efficient software stacks that provide high performance file system, emerging programming model for heterogeneous systems, and intelligent system administration. We perform extensive evaluation with wide-ranging applications from LINPACK and Graph500 benchmarks to massively parallel software deployed in the system.
The fast numerical solutions of Riesz fractional equation have computational cost of O(NMlogM), where M, N are the number of grid points and time steps. In this paper, we present a GPU-based fast solution for Riesz sp...
详细信息
The fast numerical solutions of Riesz fractional equation have computational cost of O(NMlogM), where M, N are the number of grid points and time steps. In this paper, we present a GPU-based fast solution for Riesz space fractional equation. The GPU-based fast solution, which is based on the fast method using FFT and implemented with CUDA programming model, consists of parallel FFT, vector-vector addition and vector-vector multiplication on GPU. The experimental results show that the GPU-based fast solution compares well with the exact solution. Compared to the known parallel fast solution on 8-core Intel E5-2670 CPU, the overall performance speedup on NVIDIA GTX650 GPU reaches 2.12 times and that on NVIDIA K20C GPU achieves 10.93 times.
In IaaS cloud environments, peak memory demand caused by hotspot applications in Virtual Machine (VM) often results in performance degradation within and outside of this VM. Some solutions such as host swapping and ba...
详细信息
In IaaS cloud environments, peak memory demand caused by hotspot applications in Virtual Machine (VM) often results in performance degradation within and outside of this VM. Some solutions such as host swapping and ballooning for memory consolidation and overcommitment have been proposed. These solutions, however, have no help for addressing guest swapping issues inside VM. Even though host holds sufficient memory pages, guest OS is unable to utilize free pages in host directly due to the semantic gap between VMM and it. Our goal is to alleviate the performance degradation by decreasing disk I/O operations generated by guest swapping. Based on the insight analysis of behavioral features of guest swapping, we design HybridSwap, a distributed scalable framework which organize surplus memory in all hosts within data center into virtual pools for swapping. This framework builds up a synthetic swapping mechanism in a peer-to-peer way, which VM can adaptively choose suitable pools for swapping. We implement the prototype of HybridSwap and evaluate it with different benchmarks. The results demonstrate that our solution has the ability to promote the guest swapping efficiency indeed. Even in some cases, it shows 2-5 times of performance promotion compared with the baseline setup.
Many-core system is main architecture trend currently. One of the dominating challenges for on-chip manycore system is the memory wall. However traditional research primarily focus on the limited bandwidth. To solve t...
详细信息
ISBN:
(纸本)9781479986712
Many-core system is main architecture trend currently. One of the dominating challenges for on-chip manycore system is the memory wall. However traditional research primarily focus on the limited bandwidth. To solve this problem, many-core system is aided with large cache, and a lot of complex approaches about memory and cache are adopted aiming at relaxing the pressure of bandwidth and improving the efficiency of cache. All these methods generate much cost of area and power. In this paper, we are motivated by the feature of abundant bandwidth and low latency of optical interconnect. We analyze the characteristics of memory access on 64 cores system under the case of high bandwidth which can be assumed to benefit from optical interconnect, considering the sensibility with bandwidth and cache for different benchmarks. Finally, we discuss about promising basic frameworks suitable for manycore system with optical interconnect.
Coordination among users is inevitable in wireless communication for efficient medium access. Even though the data rate of individual user increases significantly, the performance of wireless network does not grow up ...
详细信息
ISBN:
(纸本)9781467364300
Coordination among users is inevitable in wireless communication for efficient medium access. Even though the data rate of individual user increases significantly, the performance of wireless network does not grow up accordingly due to the high MAC coordination overhead. In this paper, we present VFA, namely virtual frame aggregation, to achieve high coordination efficiency by amortizing the overhead over multiple transmissions. VFA provides a novel way to construct a winner cluster and allow the winners to transmit without interruption. Specifically, in a multicarrier network, every contending node chooses a subcarrier and the nodes are ordered by the index of the chosen subcarrier. When there are some subcarriers chosen by two or more nodes, an additional slot is exploited to reorder the collided nodes. Finally, all ordered nodes form a cluster and the transmissions are issued sequentially and uninterruptedly. Simulation results show that usually two slots are enough to construct a sufficiently large winner cluster. Moreover, VFA achieves a notable throughput gain over IEEE 802.11 as high as 120% with better fairness under various scenarios.
The performance of virtualized networks is critical to cloud applications. The "distributed line graphs" (DLG) are a universal technique for designing network topologies based on arbitrary regular graphs. In...
详细信息
The performance of virtualized networks is critical to cloud applications. The "distributed line graphs" (DLG) are a universal technique for designing network topologies based on arbitrary regular graphs. In this paper we implement a prototype (C library) for a DLG-enabled network (called DLG-Kautz), as an application-layer virtualized network service. The effectiveness of our design and implementation is demonstrated through prototype evaluations.
On the 41st Top500 list announced in June 2013, the MilkyWay-2 system produced by National University of Defense technology (NUDT) in China won the first place with a LINPACK test result of 33.86 PFLOPS. It has been...
On the 41st Top500 list announced in June 2013, the MilkyWay-2 system produced by National University of Defense technology (NUDT) in China won the first place with a LINPACK test result of 33.86 PFLOPS. It has been one and a half year since its predecessor, MilkyWay-1 (TH-1), reached the same place for the first time. On the newest Top500 list published in November 2013, MilkyWay-2 continued to win the champion.
Sparse coding has shown its great potential in learning image feature representation. Recent developed methods such as group sparse coding prefer discovering the group relationships among examples and have achieved th...
详细信息
ISBN:
(纸本)9781467370066
Sparse coding has shown its great potential in learning image feature representation. Recent developed methods such as group sparse coding prefer discovering the group relationships among examples and have achieved the state-of-the-art results in image classification. However, they suffer from poor robustness shortcomings in practice. This paper proposes a robust weighted supervised sparse coding method (RWSSC) to address this deficiency. Particularly, RWSSC distinguishes different classes' contributions to the sparse coding by a novel weighting strategy meanwhile removes the out liers by imposing l1-regularization over the noisy entries. Benefitting from these strategies, RWSSC can effectively boost performance of sparse coding in image classification. Besides, we developed the block coordinate descent algorithm to optimize it, and proved its convergence. Experimental results of image classification on two popular datasets show that RWSSC outperforms the representative sparse coding methods in quantities.
The coupling of microwaves into apertures plays an important part in many electromagnetic physics and engineering fields. When the width of apertures is very small, Finite Difference Time Domain (FDTD) simulation of t...
详细信息
ISBN:
(纸本)9781467377898
The coupling of microwaves into apertures plays an important part in many electromagnetic physics and engineering fields. When the width of apertures is very small, Finite Difference Time Domain (FDTD) simulation of the coupling is very time-consuming. As a many-core architecture, the Intel's Many Integrated Core (MIC) architecture owns 512-bit vector units and more than 200 threads. In this paper, we parallelize FDTD simulation of microwave pulse coupling into narrow slots on the Intel MIC architecture. In the implementation, the parallel programming model OpenMP is used to exploit thread parallelism while loop unrolling and SIMD intrinsic functions are utilized to accomplish vectorization. Compared with the serial version on Intel Xeon E5-2670 CPU, the implementation on the MIC coprocessor including 57 cores obtains a speedup of 11.57 times. The experiment results also demonstrate that the parallelization has good scalability in performance. Additionally, how binding relationship between OpenMP threads and hardware threads in MIC influences performance is also reported.
暂无评论