In this paper, a novel approach for tasks scheduling in XQuery's automatic parallel implementation is proposed. The approach solves the scheduling problem on the shared memory multithread environment, which includ...
详细信息
ISBN:
(纸本)9781479938445
In this paper, a novel approach for tasks scheduling in XQuery's automatic parallel implementation is proposed. The approach solves the scheduling problem on the shared memory multithread environment, which includes three strategies, i.e. task parallelism, data parallelism and pipeline parallelism. An automaton model is established for the pipeline parallelism, which is used to reduce the idle time between pipeline stages. The experimental results show that our approach could improve the performance and have good memory efficiency.
Sparse Matrix-Vector Multiplication (SMVM) on parallel hardware is a very sophisticated problem because of the irregular data communication requirements. The communication volume in the parallel hardware is determined...
详细信息
ISBN:
(纸本)9780769548982;9781467345668
Sparse Matrix-Vector Multiplication (SMVM) on parallel hardware is a very sophisticated problem because of the irregular data communication requirements. The communication volume in the parallel hardware is determined by how data is distributed among the processing elements. In this paper we introduce two methods of data mapping for SMVM based on Network-on-Chip (NoC) in order to spread the load among its components. Later, we introduce the effect of recordering of the sparse matrix on those mapping methods. Simulations are performed using an OMNet++ based NoC simulator.
In the traditional multithread programming model, there is no dedicated performance optimization strategy for Many Integrated Core (MIC) heterogeneous system. To fully exploit the high computing power of MIC processor...
详细信息
ISBN:
(纸本)9781479938445
In the traditional multithread programming model, there is no dedicated performance optimization strategy for Many Integrated Core (MIC) heterogeneous system. To fully exploit the high computing power of MIC processor, this paper discusses the specific program porting and performance optimization strategies on the MIC heterogeneous parallel system based on the k-means application program. Experimental results show that the proposed porting and performance optimization strategies are effective, and can be able to guide the programmer to port and optimize applications effectively to MIC heterogeneous parallel system.
Adaptive diagnosis in the locally twisted cube is studied in this paper. Let LTQ(n) denote the n-dimensional locally twisted cube. We prove that for any integer n >= 3, LTQ(n) can be adaptively diagnosed using at m...
详细信息
ISBN:
(纸本)9780769548982;9781467345668
Adaptive diagnosis in the locally twisted cube is studied in this paper. Let LTQ(n) denote the n-dimensional locally twisted cube. We prove that for any integer n >= 3, LTQ(n) can be adaptively diagnosed using at most 4 parallel testing rounds, with at most n faulty nodes, where each node participates in at most one test in each round. The proof and algorithm of adaptive diagnosis in LTQ(n) have been proposed in this paper.
Nowadays, the evolution of multi-core architectures goes towards increasing the number of cores and levels of cache. Meanwhile, current typical parallelprogramming languages are unable to exploit the potential of the...
详细信息
ISBN:
(纸本)9780769548982;9781467345668
Nowadays, the evolution of multi-core architectures goes towards increasing the number of cores and levels of cache. Meanwhile, current typical parallelprogramming languages are unable to exploit the potential of these processors efficiently. In order to achieve desired performance on these hardwares we need to understand architectural parameters appropriately and also apply them in algorithm design. Computational models such as Multi-BSP, illustrate these parameters and explain adequate methods for designing algorithms on multi-cores. One of applicable categories of problems is Branch-and-Bound (BaB) that needs to be adapted by such model for implementing on these systems. In this paper, we have attempted to make a mapping between BaB run-time tree and the Memory Hierarchy Tree (MT) of multi-core processor. Multi-BSP model inspired us to introduce Multi-BaB model. Analogous to Multi-BSP analysis, bounds for communication and synchronization costs have been presented in the paper respecting BaB algorithms. This work is a step towards making multi-core programming efficient and tries to obtain correct analysis of BaB algorithm behavior on multi-core architectures.
In this paper we present an approach to the parallel implementation of the state minimization problem for nondeterministic finite automata. This approach is based on the truncated branch and bound method and also on t...
详细信息
ISBN:
(纸本)9780769548982;9781467345668
In this paper we present an approach to the parallel implementation of the state minimization problem for nondeterministic finite automata. This approach is based on the truncated branch and bound method and also on the usage of basis and COM automata for the given language. Minimum state automata are searched as sub-automata of the COM automaton. Some sufficient conditions for their equivalence to the given nondeterministic automaton are proved in terms of the loops of the basis automaton. We suggest exact and heuristic state minimization algorithms, discuss their implementation details and provide some experimental results.
Based on GPU parallel technology, this paper proposes a parallel SRM feature extraction algorithm to accelerate the extraction of SRM feature for steganalysis of HUGO images. Using the parallel program framework of Op...
详细信息
ISBN:
(纸本)9781479938445
Based on GPU parallel technology, this paper proposes a parallel SRM feature extraction algorithm to accelerate the extraction of SRM feature for steganalysis of HUGO images. Using the parallel program framework of OpenCL for GPU, we parallelize and implement a serial algorithm and employ some optimization technologies for our parallel program to accelerate the extraction process. The techniques include convolution unrolling, combined memory access, aversion of bank conflicts. The experimental results show that the speed of the proposed parallel extraction algorithm for different size images is 25 similar to 55 times faster than the original serial algorithm, and 2 similar to 4.2 times faster than running the parallel method on Quad-core CPU.
In recent years, multi-core digital signal processors (DSPs) have been widely used to improve execution efficiency in a variety of applications. In order to fully explore the parallel processing capacity of DSPs, a we...
详细信息
ISBN:
(纸本)9781479938445
In recent years, multi-core digital signal processors (DSPs) have been widely used to improve execution efficiency in a variety of applications. In order to fully explore the parallel processing capacity of DSPs, a well-designed parallelprogramming model is essential for programmers. In this paper, a parallelprogramming model for a self-designed multi-core audio DSP (MAD) is proposed based on both shared-memory and message-passing communication mechanisms. A set of application program interfaces (APIs) of PPMA are provided to realize inter-core data transmission and synchronization controlling with high efficiency. To evaluate performance improvement of audio applications using PPMA, a low bit-rate speech codec application is ported to the MAD. With the help of PPMA, task scheduling of speech codec can be implemented conveniently. Experimental results also show that the overhead of inter-core communication in MAD is negligible compared to the parallel speedup achieved by PPMA.
A novel, polymorphic array architecture is proposed in this paper. This architecture is capable of supporting a dynamic mixture of data parallel computation (DLP), thread level parallel computation (TLP), and operatio...
详细信息
ISBN:
(纸本)9780769548982;9781467345668
A novel, polymorphic array architecture is proposed in this paper. This architecture is capable of supporting a dynamic mixture of data parallel computation (DLP), thread level parallel computation (TLP), and operation level parallel computation (OLP). We aim at designing a programmable architecture that can approach ASIC performance. This is accomplished through new architectural features and implementation level innovations. The architecture and its implementation are presented in the paper to demonstrate its feasibility and capabilities.
暂无评论