Complex, distributed applications pose new challenges for performance analysis and optimization. this paper outlines an online approach to performance analysis where developers are active participants, using integrate...
ISBN:
(纸本)9780769502878
Complex, distributed applications pose new challenges for performance analysis and optimization. this paper outlines an online approach to performance analysis where developers are active participants, using integrated measurement and immersive performance visualization to tune parallel and distributed applications.
Recursive Diagonal Torus (RDT), a class of interconnection network is proposed for massively parallel computers with up to 2/sup 16/ nodes. By adding remote links to the diagonal directions of the torus network recurs...
详细信息
Recursive Diagonal Torus (RDT), a class of interconnection network is proposed for massively parallel computers with up to 2/sup 16/ nodes. By adding remote links to the diagonal directions of the torus network recursively, the RDT can realize a smaller diameter (e.g., it is 11 for 2/sup 16/ nodes) with smaller number of links per node (i.e., 8 links per node) than that of the hypercube. A simple routing algorithm called vector routing, which is near-optima and easy to implement is also proposed. the RDT comprises the mesh structure, and emulates hypercube and tree structures easily. FFT and the bitonic sorting algorithm are also easy to implement.< >
Recently the GEMINI Holographic Particle Image Velocimetry (HPIV) system developed in the Laser Flow Diagnostics (LFD) lab at Kansas State University has been successfully applied in volumetric 3-D flow velocity measu...
ISBN:
(纸本)9780769502878
Recently the GEMINI Holographic Particle Image Velocimetry (HPIV) system developed in the Laser Flow Diagnostics (LFD) lab at Kansas State University has been successfully applied in volumetric 3-D flow velocity measurement. Due to the 3-D nature of this application, very large computation and communication requirements are imposed. An innovative algorithm, the Concise Cross Correlation (CCC), is employed in the system to extract velocity field form the hologram of the test flows. With CCC we achieved a compression ratio of 104 and a processing speed 1000 times faster than with traditional 3-D FFT-based correlation. To further accelerate the processing speed for fully time- and space- resolved measurement, parallelprocessing is necessary. We present our design for a distributed system supporting this previously unparallelized application, and comment on our experiences implementing a master-slave distributed version of CCC utilizing MPI. Brief experimental results on Gigabit Ethernet and multi- processor Pentium Xeon systems are given.
GPS based navigation systems became popular in dedicated handheld devices, and are now also found in modern cell phones, and other small personal devices. A key element of any navigation system is fast and effective r...
详细信息
ISBN:
(纸本)9781424479535
GPS based navigation systems became popular in dedicated handheld devices, and are now also found in modern cell phones, and other small personal devices. A key element of any navigation system is fast and effective route finding, and this depends heavily on Dijkstra's shortest path algorithm. Dijkstra's algorithm is serial in nature; prior efforts to accelerate it through parallelprocessing have had almost no success. In this paper, we present a practical approach to extract small-scale parallelism by shifting priority queue operations to a secondary tightly-coupled processor. We obtain a substantial speedup on real-world graphs (in particular, road maps), allowing the development of navigation systems that are more responsive, and also lower in total power consumption.
Programmable embedded systems are ubiquitous nowadays, and their number will even further increase withthe emergence of Ambient Intelligence. One of the first challenges for embedded systems is mastering the increasi...
详细信息
Programmable embedded systems are ubiquitous nowadays, and their number will even further increase withthe emergence of Ambient Intelligence. One of the first challenges for embedded systems is mastering the increasing complexity of future Systems on Chip (SoC). the complexity will increase irremediably because the applications become more and more demanding and the algorithmic complexity grows exponentially over time.
distributed signal processing algorithms suitable for their implementation over wireless sensor networks (WSNs) and ad hoc networks with communications and computing capabilities have become a hot topic during the pas...
详细信息
distributed signal processing algorithms suitable for their implementation over wireless sensor networks (WSNs) and ad hoc networks with communications and computing capabilities have become a hot topic during the past years. One class of algorithms that have received special attention are particles filters. However, most distributed versions of this type of methods involve various heuristic or simplifying approximations and, as a consequence, classical convergence theorems for standard particle filters do not hold for their distributed counterparts. In this paper, we look into a distributed particle filter scheme that has been proposed for implementation in bothparallel computing systems and WSNs, and prove that, under certain stability assumptions regarding the physical system of interest, its asymptotic convergence is guaranteed. Moreover, we show that convergence is attained uniformly over time. this means that approximation errors can be kept bounded for an arbitrarily long period of time without having to progressively increase the computational effort.
the task-based programming paradigm offers a portable way of writing parallel applications. However, it requires tedious tuning of the application for performance. We present a novel design flow where programmers can ...
详细信息
the task-based programming paradigm offers a portable way of writing parallel applications. However, it requires tedious tuning of the application for performance. We present a novel design flow where programmers can use application knowledge to easily generate a System-on-Chip (SoC) specialized in executing the application. Our design flow uses a compiler that automatically generates task-specific cores and packs them into a custom SoC. A SoC-specific runtime systems schedules tasks on cores to accelerate application execution. the generated SoC shows up to 6000 times performance improvement in comparison to the Altera NiosII/s processor and up to 7 times compared to an AMD Opteron 6172 core. Our design flow helps programmers generate high-performance systems without requiring tuning and prior hardware design knowledge.
High-performance computing clusters running long-lived tasks currently cannot have kernel software updates applied to them without causing system downtime. these clusters miss opportunities for increased performance v...
详细信息
ISBN:
(纸本)9781424400546
High-performance computing clusters running long-lived tasks currently cannot have kernel software updates applied to them without causing system downtime. these clusters miss opportunities for increased performance via specialized kernel support, cannot benefit from new kernel features, and continue to operate with kernel security holes unpatched, at least until the next scheduled maintenance date. We developed a system enabling dynamic kernel updates in parallel computing clusters to address these problems. Our system, DynAMOS, is founded on execution flow high-jacking through function cloning. It enables commodity operating systems popularly used in clusters gain adaptive and mutative capabilities. To demonstrate the efficacy of our system, we illustrate our experience in dynamically updating and extending a Linux cluster. We introduce adaptive memory paging for efficient gang-scheduling; extend the kernel's process scheduler to support unobtrusive fine-grain cycle stealing, apply public security fixes, and inject performance monitoring functionality to a selection of kernel functions. Our benchmarks show that the overhead imposed by DynAMOS is mostly in the range of 1-8% for common Linux kernel functions
Current processors exploit out-of-order execution and branch prediction to improve instruction level parallelism. When a branch prediction is wrong, processors flush the pipeline and squash all the speculative work. H...
详细信息
Current processors exploit out-of-order execution and branch prediction to improve instruction level parallelism. When a branch prediction is wrong, processors flush the pipeline and squash all the speculative work. However, control-flow independent instructions compute the same results when they re-enter the pipeline down the correct path. If these instructions are not squashed, branch misprediction penalty can significantly be reduced. In this paper we present a novel mechanism that detects control-flow independent instructions, executes them before the branch is resolved, and avoids their re-execution in the case of a branch misprediction. the mechanism can detect and exploit control-flow independence even for instructions that are far away from the corresponding branch and even out of the instruction window. Performance figures show that the proposed mechanism can exploit control-flow independence for nearly 50% of the mispredicted branches, which results in a performance improvement that ranges from 14% to 17,8% for realistic configurations of forthcoming microprocessors.
this paper presents a framework for real-time reactive stream processing. the approach is to extend the proposed Java 9 Reactive Streams model and integrate it withthe Real-Time Specification for Java. the approach l...
详细信息
ISBN:
(纸本)9781467390330
this paper presents a framework for real-time reactive stream processing. the approach is to extend the proposed Java 9 Reactive Streams model and integrate it withthe Real-Time Specification for Java. the approach leverages a real-time version of the Java 8 Stream processing framework. Our approach addresses the major issue when using Reactive Streams in real-time: there is no way to set the timeout. Our evaluation shows there is significant improvement in the predictability of stream processing with our framework over that of one implemented using regular Java.
暂无评论