The design of NoCs for multi-core chips introduces new design constraints like power consumption, area, and ultra low latencies. Although 2D meshes are preferred, heterogeneous blocks, fabrication faults, reliability ...
详细信息
ISBN:
(纸本)9780769530987;0769530982
The design of NoCs for multi-core chips introduces new design constraints like power consumption, area, and ultra low latencies. Although 2D meshes are preferred, heterogeneous blocks, fabrication faults, reliability issues, and chip virtualization may lead to the need of irregular topologies or regions. In this situation, efficient routing becomes a challenge. Although the use of routing tables at switches is flexible, it does not scale in terms of latency and area due to its memory requirements. LBDR (logic-based distributed routing) is proposed as a new routing method that removes the need of using routing tables at all. LBDR enables the implementation of many routing algorithms on most of the practical topologies we might find in the near future in a multi-core system. From an initial topology and routing algorithm, a set of three bits per switch/output port is computed. Evaluation results show that, by using a small logic, LBDR mimics the performance of routing algorithms when implemented with routing tables, both in regular and irregular topologies.
Over the years many interesting and efficient parallelalgorithms have been developed to solve a wide variety of problems, but not much attention has been paid to studying the inherent parallelism in sequential algori...
ISBN:
(纸本)9781450345934
Over the years many interesting and efficient parallelalgorithms have been developed to solve a wide variety of problems, but not much attention has been paid to studying the inherent parallelism in sequential algorithms---i.e., understanding the depth of their dependence structure, and how shallow dependence structures might beused to develop efficient parallel *** this talk I will describe recent work on analyzing the dependence depth of iterative sequential algorithms---ones that loop over a collection of elements. Many of these algorithms have deep dependence chains in the worst case, but shallow chains (polylog w.h.p.) if the elements are randomly ordered. Examples include many fundamental algorithms: the Knuth shuffle for random permutations, sorting by insertion into a binary search tree, greedy maximal independent set (MIS), greedy maximal matching, greedy graph-coloring, counting cycles in a permutation, incremental k-dimensional linear programming, and incremental 2d Delaunay *** advantage of the approach is that it can lead to very simple and efficient parallelalgorithms. Our MIS algorithm, for example can be coded in a dozen or so lines, and is significantly faster than Luby's algorithm on modern multicore machines. Also the approach encourages snapping the view that sequential and parallelalgorithms are distinct, and instead thinking of algorithms, in general, as collections of instructions with dependences among them.
We describe several optimizations which can be employed in a dynamic binary translation (DBT) system, where low compilation/translation overhead is essential. These optimizations achieve a high degree of ILP, sometime...
详细信息
We describe several optimizations which can be employed in a dynamic binary translation (DBT) system, where low compilation/translation overhead is essential. These optimizations achieve a high degree of ILP, sometimes even surpassing a static compiler employing more sophisticated, and more time-consuming algorithms. We present results in which we employ these optimizations in a dynamic binary translation system capable of computing oracle parallelism.
Serverless computing has shown vast potential for big data analytics applications, especially involving machine learning algorithms. Nevertheless, little consideration has been given in the literature to cloud-agnosti...
详细信息
ISBN:
(数字)9798350367300
ISBN:
(纸本)9798350367317
Serverless computing has shown vast potential for big data analytics applications, especially involving machine learning algorithms. Nevertheless, little consideration has been given in the literature to cloud-agnostic serverless architectures that leverage existing parallel implementations of machine learning algorithms. This work bridges this gap by proposing a multicloud serverless architecture for distributed machine learning, that enables machine learning engineers without cloud computing expertise to effortlessly port already implemented parallel machine learning algorithms to serverless, whilst overcoming vendor lock-in. In this work, two stateful machine learning algorithms have been ported to serverless, k-means clustering and logistic regression. The serverless implementation of k-means provided superior performance and scalability compared to a serverful implementation when using a number of workers that is equal to or slightly lower than the total number of vCPUs available on the VM running the serverful implementation. Additionally, it achieved an 87-fold speedup compared to a sequential implementation. Moreover, two storage designs of the shared state will be proposed for the serverless implementations, one that requires locks for updating the shared state, and another that is lock-free. Our experimental evaluation demonstrates that the performance of the lock-free serverless implementation of k-means declines with the increase in the number of clusters.
Given n jobs with release dates, deadlines and processing times we consider the problem of scheduling them on m parallel machines so as to minimize the total energy consumed. Machines can enter a sleep state and they ...
Given n jobs with release dates, deadlines and processing times we consider the problem of scheduling them on m parallel machines so as to minimize the total energy consumed. Machines can enter a sleep state and they consume no energy in this state. Each machine requires L units of energy to awaken from the sleep state and in its active state the machine can process jobs and consumes a unit of energy per unit time. We allow for preemption and migration of jobs and provide the first constant approximation algorithm for this problem.
Platform FPGAs that integrate sequential processors with a spatial fabric have become prevalent. While these hybrid architectures ease the burden of integrating sequential and spatial code in a single application, pro...
详细信息
Platform FPGAs that integrate sequential processors with a spatial fabric have become prevalent. While these hybrid architectures ease the burden of integrating sequential and spatial code in a single application, programming them, and particularly their spatial fabrics remains challenging. The difficulty arises in part from the lack of an agreed upon computational model and family of programming languages. In addition, moving algorithms into hardware is an arcane art far removed from the experience of most programmers. To address this challenge, we present a new type architecture, an abstract model analogous to the von Neumann machine for sequential computers, that can serve as common ground for algorithm designers, language designers, and hardware architects. We show that many parallelarchitectures, including platform FPGAs, are implementations of this type architecture. Using examples from a variety of application domains, we show how algorithms can be analyzed to estimate their performance on implementations of this type architecture. This analysis is done without having to delve into the details of any architecture in particular. Finally, we describe some of the common features of languages designed for expressing micro-parallelism, highlighting connections with the type architecture
In this paper we present an O(log n) time parallel algorithm for computing the convex hull of n points in ℜ3. This algorithm uses O(n1+α) processors on a CREW PRAM, for any constant 0 < α ≤ 1. So far, all adequa...
ISBN:
(纸本)9780897915823
In this paper we present an O(log n) time parallel algorithm for computing the convex hull of n points in ℜ3. This algorithm uses O(n1+α) processors on a CREW PRAM, for any constant 0 < α ≤ 1. So far, all adequately documented parallelalgorithms proposed for this problem use time at least O (log2 n). In addition, the algorithm presented here is the first parallel algorithm for the three-dimensional convex hull problem that is not based on the serial divide-and-conquer algorithm of Preparata and Hong, whose crucial operation is the merging of the convex hulls of two linearly separated point sets. The contributions of this paper are therefore (i) an O(logn) time parallel algorithm for the three-dimensional convex hull problem, and (ii) a parallel algorithm for this problem that does not follow the traditional divide-and-conquer paradigm.
Previous work (J.S. McCaskill et al., 1996; 1997) has shown the power of massively parallel configurable hardware (NGEN) in conjunction with dataflow architectures for the simulation of evolving populations. NGEN is a...
详细信息
Previous work (J.S. McCaskill et al., 1996; 1997) has shown the power of massively parallel configurable hardware (NGEN) in conjunction with dataflow architectures for the simulation of evolving populations. NGEN is a flexible computer hardware for rapid custom circuit simulation of fine grained physical processes via a massively parallel architecture, e.g. 144 hardware configurable field programmable gate arrays (FPGAs, XC4008, Xilinx). NGEN is optimized to implement dataflow architectures and systolic algorithms for large problems and is confectioned with high speed distributed SRAM, 144*8*256 kBit, 15ns access time, on the chip to chip interconnect. Microconfigurable FPGAs allow a further step to close the gap between micro electronics and biology on the information processing area. A design for a massively parallel microconfigurable computer (POLYP) is presented. It is designed to allow online evolution in hardware with significant locally controllable memory resources. It is also designed for high throughput dataflow applications with large problem size. Additionally, an evolvable interface between high rate measurement devices is provided to allow adaptive processing coupled with real time experimental environments. The computer represents the next logical step towards evolvable hardware interacting with biology beyond the massively parallel computer NGEN.
暂无评论