the broad thesis presented in this paper suggests that the serial emulation of a parallel algorithm has the potential advantage of a parallelalgorithms has the potential advantage of running on a serial machine faste...
详细信息
ISBN:
(纸本)0818656026
the broad thesis presented in this paper suggests that the serial emulation of a parallel algorithm has the potential advantage of a parallelalgorithms has the potential advantage of running on a serial machine faster than a standard serial algorithm for the same problem. It is too early to reach definite conclusions regarding the significance of this thesis. However, using some imagination, validity of the thesis and some arguments supporting it may lead to several far-reaching outcomes: (1) Reliance on 'predictability of reference' in the design of computer systems will increase. (2) parallelalgorithms will be taught as part of the standard computer science and engineering undergraduate curriculum irrespective of whether (or when) parallelprocessing will become ubiquitous in the general-purpose computing world.
Memory-CPU single communication channel bottleneck of the von Neumann architecture is quickly stalling the growth of computer processors. A probable solution to this problem is to fuse processing and memory elements. ...
详细信息
ISBN:
(纸本)9783642246494
Memory-CPU single communication channel bottleneck of the von Neumann architecture is quickly stalling the growth of computer processors. A probable solution to this problem is to fuse processing and memory elements. A simple low latency single on-chip memory and processor cannot solve the problem as the fundamental channel bottleneck will still be there due to the logical splitting of processor and memory. this paper presents that a paradigm shift is possible by combining Arithmetic logic unit and Random Access Memory (ARAM) elements at bit level. this bit level modest ARAM is used to perform word level ALU instructions with minor modifications. this makes the ARAM cells capable of executing instructions in parallel. It is also asynchronous and hence reduces power consumption significantly. A CMOS implementation is presented that verifies the practicality of the proposed ARAM.
the Gurevich's thesis stipulates that sequential abstract state machines (ASMS) capture the essence of sequential algorithms. On another hand, the bulk-synchronous parallel (BSP) bridging model is a well known mod...
详细信息
ISBN:
(纸本)9783030050573;9783030050566
the Gurevich's thesis stipulates that sequential abstract state machines (ASMS) capture the essence of sequential algorithms. On another hand, the bulk-synchronous parallel (BSP) bridging model is a well known model for HPC algorithm design. It provides a conceptual bridge between the physical implementation of the machine and the abstraction available to a programmer of that machine. the assumptions of the BSP model are thus provide portable and scalable performance predictions on most HPC systems. We follow Gurevich's thesis and extend the sequential postulates in order to intuitively and realistically capture BSP algorithms.
the proceedings contain 140 papers. the special focus in this conference is on parallelprocessing. the topics include: Orchestrating computations on the world-wide web;non-massive, non-high performance, distributed c...
ISBN:
(纸本)3540440496
the proceedings contain 140 papers. the special focus in this conference is on parallelprocessing. the topics include: Orchestrating computations on the world-wide web;non-massive, non-high performance, distributed computing;facts on performance evaluation and its dependence on workloads;concepts and technologies for a worldwide grid infrastructure;a performance analysis tool for distributed and parallel programs;a hybrid strategy for automated performance problem searches;on the scalability of tracing mechanisms;component based problem solving environment;integrating temporal assertions into a parallel debugger;performance evaluation, analysis and optimization;prototyping and verifying stream-processing systems;symbolic cost estimation of parallel applications;performance modeling and interpretive simulation of PIM architectures and applications;extended overhead analysis for openMP;a call-graph based automatic tool for capture of hardware performance metrics for MPI and openMP applications;performance tuning through source code interdependence;on scheduling task-graphs to logP-machines with disturbances;optimal scheduling algorithms for communication constrained parallelprocessing;an automatic scheduler for parallel machines;non-approximability results for the hierarchical communication problem with a bounded number of clusters;non-approximability of the bulk synchronous task scheduling problem;adjusting time slices to apply coscheduling techniques in a non-dedicated now;a semi-dynamic multiprocessor scheduling algorithm with an asymptotically optimal competitive ratio;tiling and memory reuse for sequences of nested loops;towards detection of coarse-grain loop-level parallelism in irregular computations and parallel and distributed databases, data mining and knowledge discovery.
Lattice sieving is currently the leading class of algorithms for solving the shortest vector problem over lattices. the computational difficulty of this problem is the basis for constructing secure post-quantum public...
详细信息
ISBN:
(纸本)9783030602451;9783030602444
Lattice sieving is currently the leading class of algorithms for solving the shortest vector problem over lattices. the computational difficulty of this problem is the basis for constructing secure post-quantum public-key cryptosystems based on lattices. In this paper, we present a novel massively parallel approach for solving the shortest vector problem using lattice sieving and hardware acceleration. We combine previously reported algorithms with a proper caching strategy and develop hardware architecture. the main advantage of the proposed approach is eliminating the overhead of the data transfer between a CPU and a hardware accelerator. the authors believe that this is the first such architecture reported in the literature to date and predict to achieve up to 8 times higher throughput when compared to a multi-core high-performance CPU. Presented methods can be adapted for other sieving algorithms hard to implement in FPGAs due to the communication and memory bottleneck.
this paper presents the parallel Heterogeneous Architecture Technology (PHAT), a scalable design methodology for prototyping and evaluating heterogeneous arrays of software-programmable VLIW processors and both manual...
详细信息
ISBN:
(纸本)9791092279061
this paper presents the parallel Heterogeneous Architecture Technology (PHAT), a scalable design methodology for prototyping and evaluating heterogeneous arrays of software-programmable VLIW processors and both manually designed and automatically-compiled custom hardware accelerators, using a shared memory architecture for communication. We discuss the trade-offs and break-even point for switching from bus-based to network-on-chip interconnects, the interface and protocols for connecting distributed on-chip caches and multi-bank out-of-order offchip- memories, as well as the impact of floorplanning on the quality of results for implementation on Xilinx Virtex 6 LX 760 devices. the capabilities are evaluated at the system-level on the multi-FPGA Convey HC-1ex hybrid-core computer, accessing its high-performance memory system, and integrating r-VEX processor cores with IP blocks for SHA and FFT computations.
Medical imaging provides physicians withthe ability to generate 3D images of the human body in order to detect and diagnose a wide variety of ailments. Making medical imaging portable and more accessible provides a u...
详细信息
ISBN:
(纸本)9781450301787
Medical imaging provides physicians withthe ability to generate 3D images of the human body in order to detect and diagnose a wide variety of ailments. Making medical imaging portable and more accessible provides a unique set of challenges. In order to increase portability, the power consumed in image acquisition - currently the most power-consuming activity in an imaging device - must be dramatically reduced. this can only be done, however, by using complex image reconstruction algorithms to correct artifacts introduced by low-power acquisition, resulting in image processing becoming the dominant power-consuming task. Current solutions use combinations of digital signal processors, general-purpose processors and, more recently, general-purpose graphics processing units for medical image processing. these solutions fall short for various reasons including high power consumption and an inability to execute the next generation of image reconstruction algorithms. this paper presents the MEDICS architecture a domain-specific multicore architecture designed specifically for medical imaging applications, but with sufficient generality to make it programmable. the goal is to achieve 100 GFLOPs of performance while consuming orders of magnitude less power than the existing solutions. MEDICS has a throughput of 128 GFLOPs while consuming as little as 1.6W of power on advanced CT reconstruction applications. this represents up to a 20X increase in computation efficiency over current designs.
A wavelet-based parallel implementation is presented for image encoding on a multi-DSP system. the implementation is utilizing the discrete wavelet transform (DWT) and is realized in parallel processor architecture. T...
详细信息
ISBN:
(纸本)9780780397361
A wavelet-based parallel implementation is presented for image encoding on a multi-DSP system. the implementation is utilizing the discrete wavelet transform (DWT) and is realized in parallel processor architecture. the implementation has a very flexible architecture, which allows addition of extra slave processors (SPs) to the system whenever more computational power is needed. Performance of the implementation is measured and compared to a sequential reference implementation. Experimental results show that the parallel implementation is very efficient and overpowers the sequential counterpart considerably.
Innovations in powerful high-performance computing (HPC) architecture are enabling high-fidelity whole-core neutron transport simulations at reasonable time. Especially, the currently fashionable heterogeneous archite...
详细信息
Graph algorithms play a prominent role in several fields of sciences and engineering. Notable among them are graph traversal, finding the connected components of a graph, and computing shortest paths. there are severa...
详细信息
ISBN:
(纸本)9781479907298
Graph algorithms play a prominent role in several fields of sciences and engineering. Notable among them are graph traversal, finding the connected components of a graph, and computing shortest paths. there are several efficient implementations of the above problems on a variety of modern multiprocessor architectures. It can be noticed in recent times that the size of the graphs that correspond to real world data sets has been increasing. parallelism offers only a limited succor to this situation as current parallelarchitectures have severe short-comings when deployed for most graph algorithms. At the same time, these graphs are also getting very sparse in nature. this calls for particular work efficient solutions aimed at processing large, sparse graphs on modern parallelarchitectures. In this paper, we introduce graph pruning as a technique that aims to reduce the size of the graph. Certain elements of the graph can be pruned depending on the nature of the computation. Once a solution is obtained for the pruned graph, the solution is extended to the entire graph. We apply the above technique on three fundamental graph algorithms: breadth first search (BFS), Connected Components (CC), and All Pairs Shortest Paths (APSP). To validate our technique, we implement our algorithms on a heterogeneous platform consisting of a multicore CPU and a GPU. On this platform, we achieve an average of 35% improvement compared to state-of-the-art solutions. Such an improvement has the potential to speed up other applications that rely on these algorithms.
暂无评论