A number of parallel formulations of dense matrix multiplication algorithm have been developed. For arbitrarily large number of processors, any of these algorithms or their variants can provide near linear speedup for...
详细信息
this paper presents a parallel architecture that can simultaneously perform block-matching motion estimation (ME) and discrete cosine transform (DCT). Because DCT and ME are both processed block by block, it is prefer...
详细信息
ISBN:
(纸本)9783540729044
this paper presents a parallel architecture that can simultaneously perform block-matching motion estimation (ME) and discrete cosine transform (DCT). Because DCT and ME are both processed block by block, it is preferable to put them in one module for resource sharing. Simulation results performed using Simulink demonstrate that the parallel fashioned architecture improves the performance in terms of running time by 18.6% compared to the conventional sequential fashioned architecture.
Markov decision process (MDP) provides the foundations for a number of problems, such as artificial intelligence Studying, automated planning and reinforcement learning. MDP can be solved efficiently in theory. Howeve...
详细信息
ISBN:
(纸本)9783642030949
Markov decision process (MDP) provides the foundations for a number of problems, such as artificial intelligence Studying, automated planning and reinforcement learning. MDP can be solved efficiently in theory. However, for large scenarios, more investigations are needed to reveal practical algorithms. algorithms for solving MDP have a natural concurrency. In this paper, we present parallelalgorithms based on dynamic programming Meanwhile, the cost of computation and communication complexity of this method is analyzed. Moreover, experimental results demonstrate excellent speedups and scalability.
Scheduling algorithms published in the scientific literature are often difficult to evaluate or compare due to differences between the experimental evaluations in any two papers on the topic. Very few researchers shar...
详细信息
ISBN:
(纸本)9781467375894
Scheduling algorithms published in the scientific literature are often difficult to evaluate or compare due to differences between the experimental evaluations in any two papers on the topic. Very few researchers share the details about the scheduling problem instances they use in their evaluation section, the code that allows them to transform the numbers they collect into the results and graphs they show, nor the raw data produced in their experiments. Also, many scheduling algorithms published are not tested against a real processor architecture to evaluate their efficiency in a realistic setting. In this paper, we describe Mimer, a modular evaluation tool-chain for static schedulers that enables the sharing of evaluation and analysis tools employed to elaborate scheduling papers. We propose Schedeval that integrates into Mimer to evaluate static schedules of streaming applications under throughput constraints on actual target execution platforms. We evaluate the performance of Schedeval at running streaming applications on the Intel Single-Chip Cloud computer (SCC), and we demonstrate the usefulness of our tool-chain to compare existing scheduling algorithms. We conclude that Mimer and Schedeval are useful tools to study static scheduling and to observe the behavior of streaming applications when running on manycore architectures.
this paper proposes a parallel architecture for a fast three step search (FTSS) algorithm, which is used in motion estimation. FTSS algorithm involves reduced number of search points and is thus less computationally e...
详细信息
ISBN:
(纸本)9781424422050
this paper proposes a parallel architecture for a fast three step search (FTSS) algorithm, which is used in motion estimation. FTSS algorithm involves reduced number of search points and is thus less computationally expensive compared to the standard three step search (TSS) algorithm. Degradation of performance while applying the FTSS algorithm to several standard images has been shown to be insignificant compared to the standard TSS algorithm. the proposed architecture uses only three processing elements accompanied with use of intelligent data arrangement and memory configuration. A technique for reducing external memory accesses has also been developed. the proposed architecture for FTSS provides an efficient solution for applications requiring real-time motion estimations, because it requires smaller area and power than what would be required to implement TSS. the proposed architecture provides the solution for low bit-rate video applications like video telephony and teleconferencing.
Concurrently, withthe rise of Big Data systems, relational database management systems (RDBMS) are still widely exploited in servers, client devices, and even embedded inside end-user applications. In this paper, it ...
详细信息
ISBN:
(纸本)9789897581939
Concurrently, withthe rise of Big Data systems, relational database management systems (RDBMS) are still widely exploited in servers, client devices, and even embedded inside end-user applications. In this paper, it is suggest to improve the performance of SQLite, the most deployed embedded RDBMS. the proposed solution, named CuDB, is an "In-Memory" Database System (IMDB) which attempts to exploit specificities of modern CPU / GPU architectures. In this study massively parallelprocessing was combined with strategic data placement, closer to computing units. According to content and selectivity of queries, the measurements reveal an acceleration range between 5 to 120 times
the proceedings contain 149 papers. the special focus in this conference is on parallel, Distributed architectures, Scheduling and Load Balancing. the topics include: Session guarantees to achieve pram consistency of ...
ISBN:
(纸本)3540219463
the proceedings contain 149 papers. the special focus in this conference is on parallel, Distributed architectures, Scheduling and Load Balancing. the topics include: Session guarantees to achieve pram consistency of replicated shared objects;an extended atomic consistency protocol for recoverable DSM systems;hyper-threading technology speeds clusters;configurable microprocessor array for DSP applications;on generalized moore digraphs;RDMA communication based on rotating buffers for efficient parallel fine-grain computations;communication on the fly in dynamic SMP clusters;accelerated diffusion algorithms on general dynamic networks;suitability of load scheduling algorithms to workload characteristics;minimizing time-dependent total completion time on parallel identical machines;diffusion based scheduling in the agent-oriented computing system;approximation algorithms for scheduling jobs with chain precedence constraints;combining vector quantization and ant-colony algorithm for mesh-partitioning;wavelet-neuronal resource load prediction for multiprocessor environment;fault-tolerant scheduling in distributed real-time systems;online scheduling of multiprocessor jobs with idle regulation;predicting the response time of a new task on a beowulf cluster;space decomposition solvers and their performance in pc-based parallel computing environments;evaluation of execution time of mathematical library functions based on historical performance information;empirical modelling of parallel linear algebra routines;efficiency of divisible load processing;gray box based data access time estimation for tertiary storage in grid environment;performance modeling of parallel fem computations on clusters;asymptotical behaviour of the communication complexity of one parallel algorithm and analytical modeling of optimized sparse linear code.
Chordal rings have been proposed in the past as networks that combine the simple routing framework of rings withthe lower diameter, wider bisection, and higher resilience of other architectures. Virtually all propose...
详细信息
Chordal rings have been proposed in the past as networks that combine the simple routing framework of rings withthe lower diameter, wider bisection, and higher resilience of other architectures. Virtually all proposed chordal ring networks are node-symmetric;i.e., all nodes have the same in/out degree and interconnection pattern. Unfortunately, such regular chordal rings are not scalable. In this paper, the periodically regular chordal ring network is proposed as a compromise for combining low node degree with small diameter. Discussion is centered on the basic structure, derivation of topological properties, routing algorithms, optimization of parameters, and comparison to competing architectures such as meshes and PEC networks.
this paper makes a comparison of three parallel point-multiplication algorithms on conic curves over ring Zn. We propose one algorithm for paralleling point-multiplication by utilizing Chinese Remainder theorem to div...
详细信息
ISBN:
(纸本)9783642246685
this paper makes a comparison of three parallel point-multiplication algorithms on conic curves over ring Zn. We propose one algorithm for paralleling point-multiplication by utilizing Chinese Remainder theorem to divide point-multiplication over ring Zn into two different point-multiplications over finite field and to compute them respectively. Time complexity and speedup ratio of this parallel algorithm are computed on the basis of our previous research about the basic parallelalgorithms in conic curves cryptosystem. A quantitative performance analysis is made to compare this algorithm with two other algorithms we designed before. the performance comparison demonstrates that the algorithm presented in this paper can reduce time complexity of point-multiplication on conic curves over ring Zn and it is more efficient than the preceding ones.
In recent years, IoT devices have become widespread, and energy-efficient coarse-grained reconfigurable architectures (CGRAs) have attracted attention. CGRAs comprise several processing units called processing element...
详细信息
ISBN:
(纸本)9781665469586
In recent years, IoT devices have become widespread, and energy-efficient coarse-grained reconfigurable architectures (CGRAs) have attracted attention. CGRAs comprise several processing units called processing elements (PEs) arranged in a two-dimensional array. the operations of PEs and the interconnections between them are adaptively changed depending on a target application, and this contributes to a higher energy efficiency compared to general-purpose processors. the application kernel executed on CGRAs is represented as a data flow graph (DFG), and CGRA compilers are responsible for mapping the DFG onto the PE array. thus, mapping algorithms significantly influence the performance and power efficiency of CGRAs as well as the compile lime. this paper proposes POCOCO, a compiler framework for CGRAs that can use pre-optimized subgraph mappings. this contributes to reducing the compiler optimization task. To leverage the subgraph mappings, we extend an existing mapping method based on a genetic algorithm. Experiments on three architectures demonstrated that the proposed method reduces the optimization lime by 48%, on an average, for the best case of the three architectures.
暂无评论