the proceedings contain 109 papers. the topics discussed include: balanced coloring for parallel computing applications;high-performance graph analytics on manycore processors;scalable community detection withthe Lou...
ISBN:
(纸本)9781479986484
the proceedings contain 109 papers. the topics discussed include: balanced coloring for parallel computing applications;high-performance graph analytics on manycore processors;scalable community detection withthe Louvain algorithm;cooperative computing for autonomous data centers;divide and conquer symmetric tridiagonal eigensolver for multicore architectures;contention-based nonminimal adaptive routing in high-radix networks;identifying the culprits behind network congestion;embedding nonblocking multicast virtual networks in fat-tree data centers;cashmere: heterogeneous many-core computing;a scheduling and runtime framework for a cluster of heterogeneous machines with multiple accelerators;hierarchical DAG scheduling for hybrid distributed systems;pushing the performance envelope of modular exponentiation across multiple generations of GPUs;and addressing fairness in SMT multicores with a progress-aware scheduler.
the proceedings contain 143 papers. the topics discussed include: bridging the gap between performance and bounds of Cholesky factorization on heterogeneous platforms;efficient message logging to support process repli...
ISBN:
(纸本)0769555101
the proceedings contain 143 papers. the topics discussed include: bridging the gap between performance and bounds of Cholesky factorization on heterogeneous platforms;efficient message logging to support process replicas in a volunteer computing environment;early multi-node performance evaluation of a knights corner (KNC) Based NASA supercomputer;mini-NOVA: a lightweight ARM-based virtualization microkernel supporting dynamic partial reconfiguration;real-time multiprocessor architecture for sharing stream processing accelerators;relocation-aware floorplanning for partially-reconfigurable FPGA-based systems;experiences with compiler support for processors with exposed pipelines;performance modeling of matrix multiplication on 3d memory integrated FPGA;enhancing speedups for FPGA accelerated SPICE through frequency scaling and precision reduction;and an automated high-level design framework for partially reconfigurable FPGAs.
We consider continuous maintenance of a random sample of distinct elements from a massive data stream, whose input elements are observed at multiple distributed sites that communicate via a central coordinator. At any...
详细信息
ISBN:
(纸本)9781479986484
We consider continuous maintenance of a random sample of distinct elements from a massive data stream, whose input elements are observed at multiple distributed sites that communicate via a central coordinator. At any point, when a query is received at the coordinator, it responds with a random sample from the set of all distinct elements observed at the different sites so far. We present the first algorithms for distinct random sampling from a distributed stream. We also present a lower bound on the expected number of messages that must be transmitted by any distributed algorithm, showing that our algorithm is message optimal to within a factor of four. We present extensions to sliding windows, and experimental results showing the performance of our algorithm on real-world data sets.
In this paper, we present a framework that automatically decomposes programmer-written flat transactions into closed-nested transactions. the framework relies on two key mechanisms for the decomposition. the first is ...
详细信息
ISBN:
(纸本)9781479986484
In this paper, we present a framework that automatically decomposes programmer-written flat transactions into closed-nested transactions. the framework relies on two key mechanisms for the decomposition. the first is a static tool that analyzes application source code and produces a compact representation of transactions' business logic. the second is a runtime monitor that captures the actual contention level of shared objects and, relying on the outcome of the static tool, triggers the optimal closed-nested configuration for the workload at hand. We implemented this framework atop QR-CN, an open source fault-tolerant DTM written in Java. Our experimental studies conducted using the TPC-C, Vacation and Bank benchmarks reveal that the framework yields better performance than flat nesting and manual closed nesting, especially when the workload changes.
Irregular computations on large workloads are a necessity in many areas of computational science. Mapping these computations to modern parallel architectures, such as GPUs, is particularly challenging because the perf...
详细信息
ISBN:
(纸本)9781479986484
Irregular computations on large workloads are a necessity in many areas of computational science. Mapping these computations to modern parallel architectures, such as GPUs, is particularly challenging because the performance often depends critically on the choice of data-structure and algorithm. In this paper, we develop a parallelprocessing scheme, based on Merge Path partitioning, to compute segmented row-wise operations on sparse matrices that exposes parallelism at the granularity of individual nonzeros entries. Our decomposition achieves competitive performance across many diverse problems while maintaining predictable behavior dependent only on the computational work and ameliorates the impact of irregularity. We evaluate the performance of three sparse kernels: SpMV, SpAdd and SpGEMM. We show that our processing scheme for each kernel yields comparable performance to other schemes in many cases and our performance is highly correlated, nearly 1, to the computational work irrespective of the underlying structure of the matrices.
Consider n nodes connected to a single coordinator. Each node receives an individual online data stream of numbers and, at any point in time, the coordinator has to know the k nodes currently observing the largest val...
详细信息
ISBN:
(纸本)9781479986484
Consider n nodes connected to a single coordinator. Each node receives an individual online data stream of numbers and, at any point in time, the coordinator has to know the k nodes currently observing the largest values, for a given k between 1 and n. We design and analyze an algorithm that solves this problem while bounding the amount of messages exchanged between the nodes and the coordinator. Our algorithm employs the idea of using filters which, intuitively speaking, leads to few messages to be sent, if the new input is "similar" to the previous ones. the algorithm uses a number of messages that is on expectation by a factor of O((log Delta + k) . log n) larger than that of an offline algorithm that sets filters in an optimal way, where Delta is upper bounded by the largest value observed by any node.
Failure detection plays a central role in the engineering of distributed systems. Furthermore, many applications have timing constraints and require failure detectors that provide quality of service (QoS) with some qu...
详细信息
ISBN:
(纸本)9781479986484
Failure detection plays a central role in the engineering of distributed systems. Furthermore, many applications have timing constraints and require failure detectors that provide quality of service (QoS) with some quantitative timeliness guarantees. therefore, they need failure detectors that are fast and accurate. We introduce the Two Windows Failure Detector (2W-FD), an algorithm that provides QoS and is able to react to sudden changes in network conditions, a property that currently existing algorithms do not satisfy. We ran tests on real traces and compared the 2W-FD to state-of-the-art algorithms. Our results show that our algorithm presents the best performance in terms of speed and accuracy in unstable scenarios.
Following an exhaustive set of experiments, we identify slowdowns in I/O performance that occur when processor power and frequency are increased. Our initial analyses indicate slowdowns are more likely to occur and mo...
详细信息
ISBN:
(纸本)9781479986484
Following an exhaustive set of experiments, we identify slowdowns in I/O performance that occur when processor power and frequency are increased. Our initial analyses indicate slowdowns are more likely to occur and more acute when the number of parallel I/O threads increases and the variability between runs is high. We use a microbenchmark-driven methodology to simplify isolation of the root causes of I/O performance loss. We classify the observed performance loss into two categories: file synchronization and file write delays. We introduce LUC, a runtime system to Limit the Unintended Consequences of power scaling and dynamically improve I/O performance. We demonstrate the effectiveness of the LUC system running on two platforms for two critical parallel transaction-oriented workloads including a mail server (varMail) and online transaction processing (oltp).
High-fidelity nuclear power plant core simulations require solving the Boltzmann transport equation. In discrete ordinates methods, the most computationally demanding operation of this equation is the sweep operation....
详细信息
ISBN:
(纸本)9781479986484
High-fidelity nuclear power plant core simulations require solving the Boltzmann transport equation. In discrete ordinates methods, the most computationally demanding operation of this equation is the sweep operation. Considering the evolution of computer architectures, we propose in this paper, as a first step toward heterogeneous distributed architectures, a hybrid parallel implementation of the sweep operation on top of the generic task-based runtime system: PARSEC. Such an implementation targets three nested levels of parallelism: message passing, multi-threading, and vectorization. A theoretical performance model was designed to validate the approach and help the tuning of the multiple parameters involved in such an approach. the proposed parallel implementation of the Sweep achieves a sustained performance of 6.1 Tflop/s, corresponding to 33.9% of the peak performance of the targeted supercomputer. this implementation compares favorably with state-of-art solvers such as PARTISN;and it can therefore serve as a building block for a massively parallel version of the neutron transport solver DOMINO developed at EDF.
暂无评论