As the gap between the cost of communication (i.e., data movement) and computation continues to grow, the importance of pursuing algorithms which minimize communication also increases. Toward this end, we seek asympto...
详细信息
ISBN:
(纸本)9781450307437
As the gap between the cost of communication (i.e., data movement) and computation continues to grow, the importance of pursuing algorithms which minimize communication also increases. Toward this end, we seek asymptotic communication lower bounds for general memory models and classes of algorithms. Recent work [2] has established lower bounds for a wide set of linear algebra algorithms on a sequential machine and on a parallel machine with identical processors. This work extends these previous bounds to a heterogeneous model in which processors access data and perform floating point operations at differing speeds. We also present an algorithm for dense matrix multiplication which attains the lower bound.
Energy consumption by computer systems has emerged as an important concern However, the energy consumed in executing an algorithm cannot be inferred from its performance alone it must be modeled explicitly This paper ...
详细信息
ISBN:
(纸本)9781450300797
Energy consumption by computer systems has emerged as an important concern However, the energy consumed in executing an algorithm cannot be inferred from its performance alone it must be modeled explicitly This paper analyzes energy consumption of parallel algorithms executed on shared memory multicore processors Specifically, we develop a methodology to evaluate how energy consumption of a given parallel algorithm changes as the number of cores and their frequency is varied We use this analysis to establish the optimal number of cores to minimize the energy consumed by the execution of a parallel algorithm for a specific problem size while satisfying a given performance requirement We study the sensitivity of our analysis to changes in parameters such as the ratio of the power consumed by a computation step versus the power consumed in accessing memory The results show that the relation between the problem size and the optimal number of cores is relatively unaffected for a wide range of these parameters.
We present a scheduling algorithm of stream programs for multi-core architectures called team scheduling Compared to previous multi-core stream scheduling algorithms, team scheduling achieves 1) similar synchronizatio...
详细信息
ISBN:
(纸本)9781450300797
We present a scheduling algorithm of stream programs for multi-core architectures called team scheduling Compared to previous multi-core stream scheduling algorithms, team scheduling achieves 1) similar synchronization overhead, 2) coverage of a larger class of applications, 3) better control over buffer space, 4) deadlock-free feedback loops, and 5) lower latency We compare team scheduling to the latest stream scheduling algorithm, SGMS, by evaluating 14 applications on a multi-core architecture with 16 cores. Team scheduling successfully targets applications that cannot be validly scheduled by SGMS clue to excessive buffer requirement or deadlocks in feedback loops (e.g., GSM and W-cDmA) For applications that can be validly scheduled by SGMS, team scheduling shows on average 37% higher throughput within the same buffer space constraints
The proceedings contain 53 papers. The topics discussed include: a first insight into object-aware hardware transactional memory;safe open-nested transactions through ownership;leveraging non-blocking collective commu...
ISBN:
(纸本)9781595939739
The proceedings contain 53 papers. The topics discussed include: a first insight into object-aware hardware transactional memory;safe open-nested transactions through ownership;leveraging non-blocking collective communication in high-performance applications;fractal communication in software data dependency graphs;many random walks are faster than one;improved distributed approximate matching;graph partitioning into isolated, high conductance clusters: theory, commutation and applications to preconditioning;automatic data partitioning in software transactional memories;checkpoints and continuations instead of nested transactions;adaptive transaction scheduling for transactional memory systems;operational analysis of processor speed scaling;and kicking the tires of software transactional memory: why the going gets tough.
A regular architecture is proposed for the evaluation of polynomials of arbitrary degree. Control of the execution is accomplished with cube-type instruction. The time complexity is derived. The basic design is easily...
详细信息
ISBN:
(纸本)0818619333
A regular architecture is proposed for the evaluation of polynomials of arbitrary degree. Control of the execution is accomplished with cube-type instruction. The time complexity is derived. The basic design is easily adapted to a pipeline structure. The study also shows that the constraint of regularity may make faster algorithms unsuitable for certain implementations.
In chip multiprocessors (CMPs), limiting the number of off-chip cache misses is crucial for good performance. Many multithreaded programs provide opportunities for constructive cache sharing, in which concurrently sch...
详细信息
ISBN:
(纸本)9781595934529
In chip multiprocessors (CMPs), limiting the number of off-chip cache misses is crucial for good performance. Many multithreaded programs provide opportunities for constructive cache sharing, in which concurrently scheduled threads share a largely overlapping working set. In this brief announcement, we highlight our ongoing study [4] comparing the performance of two schedulers designed for fine-grained multithreaded programs: Parallel Depth first (PDF) [2], which is designed for constructive sharing, and Work Stealing (WS) [3], which takes a more traditional *** of schedulers. In PDF, processing cores are allocated ready-to-execute program tasks such that higher scheduling priority is given to those tasks the sequential program would have executed earlier. As a result, PDF tends to co-schedule threads in a way that tracks the sequential execution. Hence, the aggregate working set is (provably) not much larger than the single thread working set [1]. In WS, each processing core maintains a local work queue of readyto-execute threads. Whenever its local queue is empty, the core steals a thread from the bottom of the first non-empty queue it finds. WS is an attractive scheduling policy because when there is plenty of parallelism, stealing is quite rare. However, WS is not designed for constructive cache sharing, because the cores tend to have disjoint working *** configurations studied. We evaluated the performance of PDF and WS across a range of simulated CMP configurations. We focused on designs that have fixed-size private L1 caches and a shared L2 cache on chip. For a fixed die size (240 mm2), we varied the number of cores from 1 to 32. For a given number of cores, we used a (default) configuration based on current CMPs and realistic projections of future CMPs, as process technologies decrease from 90nm to *** of findings. We studied a variety of benchmark programs to show the following *** several application classes, PDF enable
暂无评论