The recent design shift towards multicore processors has spawned a significant amount of research in the area of program parallelization. The future abundance of cores on a single chip requires programmer and compiler...
详细信息
ISBN:
(纸本)9780769530475
The recent design shift towards multicore processors has spawned a significant amount of research in the area of program parallelization. The future abundance of cores on a single chip requires programmer and compiler intervention to increase the amount of parallel work possible. Much of the recent work has fallen into the areas of coarse-grain parallelization: new programming models and different ways to exploit threads and data-level parallelism. This work focuses on a complementary direction, improving performance through automated fine-grain parallelization. The main difficulty in achieving a performance benefit from fine-grain parallelism is the distribution of data memory accesses across the data caches of each core. Poor choices in the placement of data accesses can lead to increased memory stalls and low resource utilization. We propose a profile-guided method for partitioning memory accesses across distributed data caches. First, a profile determines affinity relationships between memory accesses and working set characteristics of individual memory operations in the program. Next, a program-level partitioning of the memory operations is performed to divide the memory accesses across the data caches. As a result, the data accesses are proactively dispersed to reduce memory stalls and improve computation parallelization. A final detailed partitioning of the computation instructions is performed with knowledge of the cache location of their associated data. Overall, our data partitioning reduces stall cycles by LIP to 51% versus data-incognizant partitioning, and has an overall speedup average of 30% over a single core processor.
Much like multi-storey buildings in densely packed metropolises, three-dimensional (3D) chip structures are envisioned as a viable solution to skyrocketing transistor densities and burgeoning die sizes in multi-core a...
详细信息
ISBN:
(纸本)1595937064
Much like multi-storey buildings in densely packed metropolises, three-dimensional (3D) chip structures are envisioned as a viable solution to skyrocketing transistor densities and burgeoning die sizes in multi-core architectures. Partitioning a larger die into smaller segments and then stacking them in a 3D fashion can significantly reduce latency and energy consumption. Such benefits emanate from the notion that inter-wafer distances are negligible compared to intra-wafer distances. This attribute substantially reduces global wiring length in 3D chips. The work in this paper integrates the increasingly popular idea of packet-based Networks-on-Chip (NoC) into a 3D setting. While NoCs have been studied extensively in the 2D realm, the microarchitectural ramifications of moving into the third dimension have yet to be fully explored. This paper presents a detailed exploration of inter-strata communication architectures in 3D NoCs. Three design options are investigated;a simple bus-based inter-wafer connection, a hop-by-hop standard 3D design, and a full 3D crossbar implementation. In this context, we propose a novel partially-connected 3D crossbar structure, called the 3D Dimensionally- Decomposed (DimDe) Router, which provides a good tradeoff between circuit complexity and performance benefits. Simulation results using (a) a stand-alone cycle-accurate 3D NoC simulator running synthetic workloads, and (b) a hybrid 3D NoC/cache simulation environment running real commercial and scientific benchmarks, indicate that the proposed DimDe design provides latency and throughput improvements of over 20% on average over the other 3D architectures, while remaining within 5% of the full 3D crossbar performance. Furthermore, based on synthesized hardware implementations in 90 nm technology, the DimDe architecture outperforms all other designs - including the full 3D crossbar - by an average of 26% in terms of the Energy-Delay Product (EDP). Copyright 2007 acm.
The most prevalent algorithm for DNA sequence matching is the BLAST algorithm. We have developed several FPGA-based architectures to speed up BLAST execution. In this work we present a new VLSI architecture as a follo...
详细信息
The most prevalent algorithm for DNA sequence matching is the BLAST algorithm. We have developed several FPGA-based architectures to speed up BLAST execution. In this work we present a new VLSI architecture as a followup to our reconfigurable logic-based architectures, and we compare the performance of several software and hardware implementations of BLAST.
We present a linear work parallel iterative algorithm for solving linear systems involving Laplacians of planar graphs. In particular, if Ax = b, where A is the Laplacian of any planar graph with n nodes, the algorith...
ISBN:
(纸本)9780898716245
We present a linear work parallel iterative algorithm for solving linear systems involving Laplacians of planar graphs. In particular, if Ax = b, where A is the Laplacian of any planar graph with n nodes, the algorithm produces a vector x such that ||x--x||A ≤ ε, in O(n1/6+clog(1/ε)) parallel time, doing O(nlog(1/ε)) work, where c is any positive constant. One of the key ingredients of the solver, is an O(nklog2k) work, O(klogn) time, parallel algorithm for decomposing any embedded planar graph into components of size O(k) that are delimited by O(n/√k) boundary edges. The result also applies to symmetric diagonally dominant matrices of planar structure.
The recent design shift towards multicore processors has spawned a significant amount of research in the area of program parallelization. The future abundance of cores on a single chip requires programmer and compiler...
详细信息
ISBN:
(纸本)9780769530475
The recent design shift towards multicore processors has spawned a significant amount of research in the area of program parallelization. The future abundance of cores on a single chip requires programmer and compiler intervention to increase the amount of parallel work possible. Much of the recent work has fallen into the areas of coarse-grain parallelization: new programming models and different ways to exploit threads and data-level parallelism. This work focuses on a complementary direction, improving performance through automated fine-grain parallelization. The main difficulty in achieving a performance benefit from fine-grain parallelism is the distribution of data memory accesses across the data caches of each core. Poor choices in the placement of data accesses can lead to increased memory stalls and low resource utilization. We propose a profile-guided method for partitioning memory accesses across distributed data caches. First, a profile determines affinity relationships between memory accesses and working set characteristics of individual memory operations in the program. Next, a program-level partitioning of the memory operations is performed to divide the memory accesses across the data caches. As a result, the data accesses are proactively dispersed to reduce memory stalls and improve computation parallelization. A final detailed partitioning of the computation instructions is performed with knowledge of the cache location of their associated data. Overall, our data partitioning reduces stall cycles by up to 51 % versus data-incognizant partitioning, and has an overall speedup average of 30% over a single core processor.
The proceedings contain 43 papers. The topics discussed include: publish and perish: definition and analysis of an n-person publication impact game;exponential separation of quantum and classical online space complexi...
详细信息
ISBN:
(纸本)1595934529
The proceedings contain 43 papers. The topics discussed include: publish and perish: definition and analysis of an n-person publication impact game;exponential separation of quantum and classical online space complexity;minimizing the stretch when scheduling flows of biological requests;position paper and brief announcement: the FG programming environment - good and good for you;efficient parallelalgorithms for dead sensor diagnosis and multiple access channels;on the communication complexity of randomized broadcasting in random-like graphs;strip packing with precedence constraints and strip packing with release times;on space-stretch trade-offs: lower bounds;a performance analysis of local synchronization;the cache complexity of multithreaded cache oblivious algorithms;and deterministic load balancing and dictionaries in the parallel disk model.
暂无评论