Our research goal is to retarget image processing programs written in sequential languages (e.g., C) to architectures with data-parallelprocessing capabilities. Image processingalgorithms are often inherently data-p...
详细信息
ISBN:
(纸本)0769520278
Our research goal is to retarget image processing programs written in sequential languages (e.g., C) to architectures with data-parallelprocessing capabilities. Image processingalgorithms are often inherently data-parallel, but the artifacts imposed by the sequential programming language (e.g., loops, pointer variables, linear address spaces) can obscure the parallelism and prohibit generation of efficient parallel code. this paper proposes a program representation and pattern-recognition approach for generating a data-parallel program specification from sequential source code. the representation is based on an extension of the multidimensional synchronous dataflow (MDSDF) model of computation. Central to extracting this representation front code is understanding the mapping between iterations and array variables in the source code and the operations over array regions (e.g., rows, columns, tiled blocks) that they implement. Examples are presented to illustrate this mapping, and a set of patterns for recognizing these regions are proposed. the correctness of the retargeted MDSDF specifications are validated and the potential speedup from parallel execution shown.
Network packet processing applications increasingly execute at speeds of 1-40 Gigabits per second, often running on multi-core chips that contain multithreaded network processing units (NPUs) and a general-purpose pro...
详细信息
ISBN:
(纸本)9783642131356
Network packet processing applications increasingly execute at speeds of 1-40 Gigabits per second, often running on multi-core chips that contain multithreaded network processing units (NPUs) and a general-purpose processor core. Such applications are typically programmed in a language that exposes NPU specifics needed to optimize low-level thread control and resource management. this facilitates optimization at the cost of increased software complexity and reduced portability. In contrast, our approach provides portability by combining coarse-grained, SPMD parallelism with programming in the packetC language's high-level constructs. this paper focuses on searching packet contents for packet protocol headers. We require the host system to locate protocol headers for layers 2, 3 and 4, and to encode their offsets data in a packet information block (PIB). packetC provides descriptors, C-style structures superimposed on the packet array at runtiine-calculable, user or PIB-supplied offsets. We deliver state-of-the-practice performance via an FPGA for locating layer offsets and via micro-coded interpretation that treats PIB layer offsets as a special addressing mode.
the use of reconfigurable computer vision architecture for image processing tasks is an important and challenging application in real time systems with limited resources. It is an emerging field as new computing archi...
详细信息
ISBN:
(纸本)9781450347860
the use of reconfigurable computer vision architecture for image processing tasks is an important and challenging application in real time systems with limited resources. It is an emerging field as new computing architectures are developed, new algorithms are proposed and users define new emerging applications in surveillance. In this paper, a computer vision architecture capable of reconfiguring the processing chain of computer vision algorithms is summarised. the processing chain consists of multiple computer vision tasks, which can be distributed over various computing units. One key characteristic of the designed architecture is graceful degradation, which prevents the system from failure. this system characteristic is achieved by distributing computer vision tasks to other nodes and parametrizing each task depending on the specified quality-of-service. Experiments using an object detector applied to a public dataset are presented.
Investigations of the parallel computing of the non-ideal 3-D space detonation wave propagation are presented in this paper on the hi-performance computer based on CC-NUMA architecture. Upon analyzing and testing the ...
详细信息
ISBN:
(纸本)0769515126
Investigations of the parallel computing of the non-ideal 3-D space detonation wave propagation are presented in this paper on the hi-performance computer based on CC-NUMA architecture. Upon analyzing and testing the previous serial program, the computation of curvature, the first-order and the second-order difference were determined to be the main objects of parallelization. Some processing techniques were applied to convert the serial program into parallel program, such as the strategy of "Divide and Conquer", the balance of the loading distribution. Numerical simulation computation of the parallel program results in a great increase of computing speed of the non-ideal 3-D space detonation wave propagation.
Work-efficient task-parallelalgorithms enforce ordering between tasks using queuing primitives. Such algorithms offer limited parallelism due to queuing constraints that result in data movement and synchronization bo...
详细信息
ISBN:
(纸本)9781728136134
Work-efficient task-parallelalgorithms enforce ordering between tasks using queuing primitives. Such algorithms offer limited parallelism due to queuing constraints that result in data movement and synchronization bottlenecks. Speculatively relaxing order of tasks across cores using the Galois framework shows promise as false dependencies generated by strict queuing constraints are mitigated to unlock task parallelism. However, relaxed ordering results in redundant work, for which Galois relies on static measures to improve work-efficiency. this paper proposes a dynamic multi-level parent-child task dependency checking mechanism in Galois to prune redundant work by exploiting monotonic properties of shared data values. Evaluation on a 40-core Intel Xeon multicore shows an average of 2x performance improvements over state-of-the-art ordered and relax ordered graph algorithms.
Co-clustering has been extensively used in varied applications because of its potential to discover latent local patterns that are otherwise unapparent by usual unsupervised algorithms such as k-means. Recently, a uni...
详细信息
ISBN:
(纸本)9783642131189
Co-clustering has been extensively used in varied applications because of its potential to discover latent local patterns that are otherwise unapparent by usual unsupervised algorithms such as k-means. Recently, a unified view of co-clustering algorithms, called Bregman co-clustering (BCC), provides a general framework that even contains several existing co-clustering algorithms, thus we expect to have more applications of this framework to varied data types. However, the amount of data collected from real-life application domains easily grows too big to fit in the main memory of a single processor machine. Accordingly, enhancing the scalability of BCC can be a critical challenge in practice. To address this and eventually enhance its potential for rapid deployment to wider applications with larger data, we parallelize all the twelve co-clustering algorithms in the BCC framework using message passing interface (MPI). In addition, we validate their scalability on eleven synthetic datasets as well as one real-life dataset, where we demonstrate their speedup performance in terms of varied parameter settings.
processing of big scale-free graphs on parallelarchitectures with high parallelization opportunities connected with a lot of overheads. Due to skewed degree distribution each thread receives different amount of compu...
详细信息
ISBN:
(纸本)9783319654829;9783319654812
processing of big scale-free graphs on parallelarchitectures with high parallelization opportunities connected with a lot of overheads. Due to skewed degree distribution each thread receives different amount of computational workload. In this paper we present a method devoted to address this challenge by modificating CSR data structure and redistributing work across threads. the method was implemented in breadth-first search and single source shortest pathalgorithms for GPU architecture.
this paper presents two alternative NLP architectures to analyze massive amounts of documents, using parallelprocessing. the two architectures focus on different processing scenarios, namely batch-processing and stre...
详细信息
ISBN:
(纸本)9782951740891
this paper presents two alternative NLP architectures to analyze massive amounts of documents, using parallelprocessing. the two architectures focus on different processing scenarios, namely batch-processing and streaming processing. the batch-processing scenario aims at optimizing the overall throughput of the system, i.e., minimizing the overall time spent on processing all documents. the streaming architecture aims to minimize the time to process real-time incoming documents and is therefore especially suitable for live feeds. the paper presents experiments with botharchitectures, and reports the overall gain when they are used for batch as well as for streaming processing. All the software described in the paper is publicly available under free licenses.
Finite volume numerical methods have been widely studied, implemented and parallelized on multiprocessor systems or on clusters. Modern graphics processing units (GPU) provide architectures and new programing models t...
详细信息
ISBN:
(纸本)9783642131356
Finite volume numerical methods have been widely studied, implemented and parallelized on multiprocessor systems or on clusters. Modern graphics processing units (GPU) provide architectures and new programing models that enable to harness their large processing power and to design computational fluid dynamics simulations at both high performance and low cost. We report on solving the 2D compressible Euler equations on modern Graphics processing Units (GPU) with high-resolution methods, i.e. able to handle complex situations involving shocks and discontinuities. We implement two different second order numerical schemes, a Godunov-based scheme with quasi-exact Riemann solver and a fully discrete second-order central scheme as originally proposed by Kurganov and Tadmor. Performance measurements show that these two numerical schemes can achieves x30 to x70 speed-up on recent GPU hardware compared to a mono-thread CPU reference implementation. these first results provide very promising perpectives for designing a GPU-based software framework for applications in computational astrophysics by further integrating MHD codes and N-body simulations.
暂无评论