this paper presents the establishment of cluster computing lab at a minority serving institution that aims to provide computing resources to support undergraduate computer science curriculum. We present a case study o...
详细信息
the proceedings contain 133 papers. the topics discussed include: multiple spanning tree construction for deadlock-free adaptive routing in irregular networks;chemical reaction optimization for heterogeneous computing...
ISBN:
(纸本)9780769547015
the proceedings contain 133 papers. the topics discussed include: multiple spanning tree construction for deadlock-free adaptive routing in irregular networks;chemical reaction optimization for heterogeneous computing environments;on adaptive contention management strategies for software transactional memory;reducing energy consumption of dense linear algebra operations on hybrid CPU-GPU platforms;efficient GPU asynchronous implementation of a watershed algorithm based on cellular automata;portfolio management using particle swarm optimization on GPU;parallel parameter identification in industrial biotechnology;towards an infrastructure description language for modeling computing infrastructures;trace file comparison with a hierarchical sequence alignment algorithm;network load-aware user grouping for Internet media streaming systems;a simulation study of multi-criteria scheduling in grid based on genetic algorithms;and analytical framework for QoS aware publish/subscribe system deployed on MANET.
the Argonne Leadership Computing Facility (ALCF) is home to Mira, a 10 PF Blue Gene/Q (BG/Q) system. the BG/Q system is the third generation in Blue Gene architecture from IBM and like its predecessors combines system...
详细信息
ISBN:
(纸本)9780769549712
the Argonne Leadership Computing Facility (ALCF) is home to Mira, a 10 PF Blue Gene/Q (BG/Q) system. the BG/Q system is the third generation in Blue Gene architecture from IBM and like its predecessors combines system-on-chip technology with a proprietary interconnect (5-D torus). Each compute node has 16 augmented PowerPC A2 processor cores with support for simultaneous multithreading, 4-wide double precision SIMD, and different data prefetching mechanisms. Mira offers several new opportunities for tuning and scaling scientific applications. this paper discusses our early experience with a subset of micro-benchmarks, MPI benchmarks, and a variety of science and engineering applications running at ALCF. Both performance and power are studied and results on BG/Q is compared with its predecessor BG/P. Several lessons gleaned from tuning applications on the BG/Q architecture for better performance and scalability are shared.
We develop an optimized FFT based Poisson solver on a CPU-GPU heterogeneous platform for the case when the input is too large to fit on the GPU global memory. the solver involves memory bound computations such as 3D F...
详细信息
ISBN:
(纸本)9780769549712
We develop an optimized FFT based Poisson solver on a CPU-GPU heterogeneous platform for the case when the input is too large to fit on the GPU global memory. the solver involves memory bound computations such as 3D FFT in which the large 3D data may have to be transferred over the PCIe bus several times during the computation. We develop a new strategy to decompose and allocate the computation between the GPU and the CPU such that the 3D data is transferred only once to the device memory, and the executions of the GPU kernels are almost completely overlapped withthe PCI data transfer. We were able to achieve significantly better performance than what has been reported in previous related work, including over 50 GFLOPS for the three periodic boundary conditions, and over 40 GFLOPS for the two periodic, one Neumann boundary conditions. the PCIe bus bandwidth achieved is over 5GB/s, which is close to the best possible on our platform. For all the cases tested, the single 3D PCIe transfer time, which constitutes a lower bound on what is possible on our platform, takes almost 70% of the total execution time of the Poisson solver.
the unrank pattern to process combinatorial objects in parallel is revisited. the pattern is applied to find, in parallel, solutions to a restricted version of the community finding problem on small graphs. Performanc...
详细信息
We analyse gather-scatter performance bottle-necks in molecular dynamics codes and the challenges that they pose for obtaining benefits from SIMD execution. this analysis informs a number of novel code-level and algor...
详细信息
ISBN:
(纸本)9780769549712
We analyse gather-scatter performance bottle-necks in molecular dynamics codes and the challenges that they pose for obtaining benefits from SIMD execution. this analysis informs a number of novel code-level and algorithmic improvements to Sandia's miniMD benchmark, which we demonstrate using three SIMD widths (128-, 256- and 512-bit). the applicability of these optimisations to wider SIMD is discussed, and we show that the conventional approach of exposing more parallelism through redundant computation is not necessarily best. In single precision, our optimised implementation is up to 5x faster than the original scalar code running on Intel (R) Xeon (R) processors with 256-bit SIMD, and adding a single Intel (R) Xeon Phi (TM) coprocessor provides up to an additional 2x performance increase. these results demonstrate: (i) the importance of effective SIMD utilisation for molecular dynamics codes on current and future hardware;and (ii) the considerable performance increase afforded by the use of Intel (R) Xeon Phi (TM) coprocessors for highly parallel workloads.
Graph processing has gained renewed attention. the increasing large scale and wealth of connected data, such as those accrued by social network applications, demand the design of new techniques and platforms to effici...
详细信息
ISBN:
(纸本)9780769549712
Graph processing has gained renewed attention. the increasing large scale and wealth of connected data, such as those accrued by social network applications, demand the design of new techniques and platforms to efficiently derive actionable information from large scale graphs. Hybrid systems that host processing units optimized for both fast sequential processing and bulk processing (e. g., GPU-accelerated systems) have the potential to cope withthe heterogeneous structure of real graphs and enable high performance graph processing. Reaching this point, however, poses multiple challenges. the heterogeneity of the processing elements (e. g., GPUs implement a different parallelprocessing model than CPUs and have much less memory) and the inherent irregularity of graph workloads require careful graph partitioning and load assignment. In particular, the workload generated by a partitioning scheme should match the strength of the processing element the partition is allocated to. this work explores the feasibility and quantifies the performance gains of such low-cost partitioning schemes. We propose to partition the workload between the two types of processing elements based on vertex connectivity. We show that such partitioning schemes offer a simple, yet efficient way to boost the overall performance of the hybrid system. Our evaluation illustrates that processing a 4-billion edges graph on a system with one CPU socket and one GPU, while offloading as little as 25% of the edges to the GPU, achieves 2x performance improvement over state-of-the-art implementations running on a dual-socket symmetric system. Moreover, for the same graph, a hybrid system with dual-socket and dual-GPU is capable of 1.13 Billion breadth-first search traversed edge per second, a performance rate that is competitive withthe latest entries in the Graph500 list, yet at a much lower price point.
We consider the problem of distributedthroughput maximization for multi-channel ALOHA networks. We focus on networks containing a large number of users that transmit over a low number of channels. First, we consider ...
详细信息
ISBN:
(纸本)9781467331463;9781467331449
We consider the problem of distributedthroughput maximization for multi-channel ALOHA networks. We focus on networks containing a large number of users that transmit over a low number of channels. First, we consider the problem of constrained distributed rate maximization, where user rates are subject to total transmission probability constraints. We propose a distributed best-response algorithm to solve the rate maximization problem, where each user updates its strategy using its local channel state information (CSI) and by monitoring the channel utilization. We then consider the case where users are not restricted by transmission probability constraints. distributed optimization of the network throughput under uncertainty is mandatory since the transmission probabilities of other users are unknown. We propose a distributed scheme to solve the throughput optimization problem under uncertainty, where users adjust their transmission probability to maximize their rates, but maintain the desired load on the channels. We propose sequential and parallel algorithms for this purpose.
In order to adapt to the requirements of the massive scale storage environments, and improve storage space utilization of the data center host, we designed and implemented InfoStor, a heterogeneous environment, distri...
详细信息
Different from the previous work on energy-efficient algorithms, which focused on assumption that a task can be assigned to any processor, we study the problem of task Scheduling withthe objective of Energy Minimizat...
详细信息
暂无评论