The proceedings contain 50 papers. The topics discussed include: graph expansion and communication costs of fast matrix multiplication;near linear-work parallel SDD solvers, low-diameter decomposition, and low-stretch...
ISBN:
(纸本)9781450307437
The proceedings contain 50 papers. The topics discussed include: graph expansion and communication costs of fast matrix multiplication;near linear-work parallel SDD solvers, low-diameter decomposition, and low-stretch subgraphs;linear-work greedy parallel approximate set cover and variants;optimizing hybrid transactional memory: the importance of nonspeculative operations;parallelism and data movement characterization of contemporary application classes;work-stealing for mixed-mode parallelism by deterministic team-building;full reversal routing as a linear dynamical system;reclaiming the energy of a schedule, models and algorithms;a tight runtime bound for synchronous gathering of autonomous robots with limited visibility;convergence of local communication chain strategies via linear transformations: or how to trade locality for speed;and convergence to equilibrium of logit dynamics for strategic games.
This special issue contains 6 selected papers whose preliminary versions appeared in the Proceedings of the 23rdannual ACM symposium on parallelism in algorithms and architectures (SPAA), held June 2011, in San Jose,...
详细信息
This special issue contains 6 selected papers whose preliminary versions appeared in the Proceedings of the 23rdannual ACM symposium on parallelism in algorithms and architectures (SPAA), held June 2011, in San Jose, California, USA. These papers were selected by the special issue co-editors from 35 papers that were presented at the conference. The authors were invited to submit full versions of their papers, which were then fully refereed according to the usual standards of Theory of Computing Systems. The selected papers are representative of the breadth and depth of the research in parallelism in algorithms and architectures that was presented at SPAA 2011.
As the gap between the cost of communication (i.e., data movement) and computation continues to grow, the importance of pursuing algorithms which minimize communication also increases. Toward this end, we seek asympto...
详细信息
ISBN:
(纸本)9781450307437
As the gap between the cost of communication (i.e., data movement) and computation continues to grow, the importance of pursuing algorithms which minimize communication also increases. Toward this end, we seek asymptotic communication lower bounds for general memory models and classes of algorithms. Recent work [2] has established lower bounds for a wide set of linear algebra algorithms on a sequential machine and on a parallel machine with identical processors. This work extends these previous bounds to a heterogeneous model in which processors access data and perform floating point operations at differing speeds. We also present an algorithm for dense matrix multiplication which attains the lower bound.
Massive spatial parallelism at low energy gives FPGAs the potential to be core components in large scale high performance computing (HPC) systems. In this paper we present four major design steps that harness high-lev...
详细信息
ISBN:
(纸本)9781450349826
Massive spatial parallelism at low energy gives FPGAs the potential to be core components in large scale high performance computing (HPC) systems. In this paper we present four major design steps that harness high-level synthesis (HLS) to implement scalable spatial FPGA algorithms. To aid productivity, we introduce the open source library hlslib to complement HLS. We evaluate kernels designed with our approach on an FPGA accelerator board, demonstrating high performance and board utilization with enhanced programmer productivity. By following our guidelines, programmers can use HLS to develop efficient parallel algorithms for FPGA, scaling their implementations with increased resources on future hardware.
The proceedings contain 11 papers. The topics discussed include: NUMA aware I/O in virtualized systems;the BXI interconnect architecture;exploiting offload enabled network interfaces;a brief introduction to the OpenFa...
ISBN:
(纸本)9781467391603
The proceedings contain 11 papers. The topics discussed include: NUMA aware I/O in virtualized systems;the BXI interconnect architecture;exploiting offload enabled network interfaces;a brief introduction to the OpenFabrics interfaces - a new network API for maximizing high performance application efficiency;UCX: an open source framework for HPC network APIs and beyond;OWN: optical and wireless network-on-chip for kilo-core architectures;Amon: an advanced mesh-like optical NoC;impact of InfiniBand DC transport protocol on energy consumption of all-to-all collective algorithms;implementing ultra low latency data center services with programmable logic;and enhanced overloaded CDMA interconnect (OCI) bus architecture for on-chip communication.
HPC simulations suffer from failures of numerical reproducibility because of floating-point arithmetic peculiarities. Different computing distributions of a parallel computation may yield different numerical results. ...
详细信息
ISBN:
(纸本)9781509016150
HPC simulations suffer from failures of numerical reproducibility because of floating-point arithmetic peculiarities. Different computing distributions of a parallel computation may yield different numerical results. We are interested in a finite element computation of hydrodynamic simulations within the openTelemac software where parallelism is provided by domain decomposition. One main task in a finite element simulation consists in building one large linear system and to solve it. Here the building step relies on element-by-element storage mode and the solving step applies the conjugated gradient algorithm. The subdomain parallelism is merged within these steps. We study why reproducibility fails in this process and which operations have to be corrected. We detail how to use compensation techniques to compute a numerically reproducible resolution. We illustrate this approach with the reproducible version of one test case provided by the openTelemac software suite.
This paper presents an approach for the development of microcode for parallel and pipelined machines. The approach is geared towards mapping programs with real-time constraints and/or massive time requirements onto sy...
详细信息
Grid networks are large distributed systems that share and virtualize heterogeneous resources. Quality of Service (QoS) is a key and complex issue for Grid services provisioning. Currently, most Grid networks offer be...
详细信息
ISBN:
(纸本)9781595937537
Grid networks are large distributed systems that share and virtualize heterogeneous resources. Quality of Service (QoS) is a key and complex issue for Grid services provisioning. Currently, most Grid networks offer best-effort (BE) services. Thus, QoS architectures initially developed for Internet such as DiffServ (DS) have been adapted to Grid environment. Since the widespread of Internet, many Grid networks will be deployed in the years to come over this technology. In this paper, we propose to compare two Flow-Aware Networking (FAN) architectures, mainly from the second generation (2GFAN). The purpose is to answer the question of which 2GFAN architecture performs better under Grid traffic. FAN is a promising option to DS for QoS provisioning in Internet networks. DS provides QoS differentiation through explicit packet marking and classification whereas FAN consist on per-flow admission control and implicit flow differentiation through priority fair queuing. The main difference between the two 2GFAN architectures is the fair queuing algorithm. Thus;to the knowledge of the authors, this is the first time two priority per-flow fair queuing algorithms are compared under Grid traffic. A GridFTP session may be seen as a succession of parallel TCP flows with large volumes of data transfers. Metrics used are average delay, average goodput and the average rejection rate.
VLIW architectures have been shown to be able to exploit large amounts of find grain parallelism in the execution of sequential imperative programs. In this paper, a new computing model is presented, which allows the ...
详细信息
VLIW architectures have been shown to be able to exploit large amounts of find grain parallelism in the execution of sequential imperative programs. In this paper, a new computing model is presented, which allows the VLIW techniques to be adopted to operate a distributed memory, multiprocessor machine. The model, called VLIW-in-the-large, can be adopted in conjunction with a suitable hardware framework to obtain consistent speedups in the execution of both sequential and parallel-natured software. The authors show that the advantages of the VLIW-in-the-large computing model with respect to the classical VLIW approach are: (i) better utilization of hardware resources; (ii) extension of the applicability of the VLIW techniques to multiprocessor architectures, in such a way that they can be used for multi-style, multi-grain parallelism exploitation; (iii) compact realization of processing elements, suitable for VLSI massively parallel architectures.< >
暂无评论