In order to meet the real-time detection and tracking of moving target, a hardware platform based on Field Programmable Gate Array (FPGA) is built. After analyzing Sum of Absolute Difference (SAD) algorithm and cross ...
详细信息
In order to meet the real-time detection and tracking of moving target, a hardware platform based on Field Programmable Gate Array (FPGA) is built. After analyzing Sum of Absolute Difference (SAD) algorithm and cross search algorithm, this paper focuses on the design of parallel search structure of tracking module, parallel matching structure of search module and parallel computing structure of SAD unit. through the analysis of real-time video processing results, the effectiveness of the design scheme is proved.
High Performance Computing (HPC) demand is on the rise, particularly for large distributed computing. HPC systems have, by design, very heterogeneous architectures, both in computation and in communication bandwidth, ...
详细信息
ISBN:
(纸本)9781450362955
High Performance Computing (HPC) demand is on the rise, particularly for large distributed computing. HPC systems have, by design, very heterogeneous architectures, both in computation and in communication bandwidth, resulting in wide variations in the cost of communications between compute units. If large distributed applications are to take full advantage of HPC, the physical communication capabilities must be taken into consideration when allocating workload. Hypergraphs are good at modelling total volume of communication in parallel and distributed applications. To the best of our knowledge, there are no hypergraph partitioning algorithms to date that are architecture-aware. We propose a novel restreaming hypergraph partitioning algorithm (HyperPRAW) that takes advantage of peer to peer physical bandwidth profiling data to improve distributed applications performance in HPC systems. Our results show that not only the quality of the partitions achieved by our algorithm is comparable with state-of-the-art multilevel partitioning, but that the runtime performance in a synthetic benchmark is significantly reduced in 10 hypergraph models tested, with speedup factors of up to 14x.
Withthe rapid development of mobile Internet, the network has become an important medium for people to exchange information. the research on text classification has practical significance. Using the Hadoop platform t...
详细信息
the paper starts with a summary presentation of the demand side management in the context of electric energy consumption. After that, the concept of Internet of things integration in demand side management is presente...
详细信息
this paper solves the problem of approximate nearest neighbor queries on high-dimensional large data sets. the two most representative methods to solve the approximate nearest neighbor query problem on high-dimensiona...
详细信息
this paper solves the problem of approximate nearest neighbor queries on high-dimensional large data sets. the two most representative methods to solve the approximate nearest neighbor query problem on high-dimensional and large-scale data sets are based on inverted multi-index and based on inverted index. the query algorithm based on the inverted multi-index requires frequent access operations to the memory, which causes more waste of resources than the query operation based on the distance order. Traditional query algorithms based on inverted indexes need to divide the entire feature space into a large number of regions to ensure the accuracy of query results when solving the problem of high-dimensional large data sets, so a large amount of space resources are required. Based on the inverted index, a query algorithm is proposed. through the subdivision process of Voronoi Cells that are initially divided into part of the original data set, the purpose is to significantly improve the high-dimensional data set without dividing a large number of regions. On the query performance, and avoid frequent access operations to the random access memory. theoretical analysis and experimental results show that the proposed method not only effectively improves the query processing efficiency, but also ensures the accuracy of the query.
Uniformization is one of the best methods for computing the transient probabilities of continuous-time Markov chains. In this paper, we propose a method for parallelizing uniformization by performing its computation o...
详细信息
ISBN:
(纸本)9781538680520
Uniformization is one of the best methods for computing the transient probabilities of continuous-time Markov chains. In this paper, we propose a method for parallelizing uniformization by performing its computation on graphic processing units residing on several computers communicating with each other via message passing interface. Since Markov chain models are usually sparse, hypergraph partition is used to reduce communications among the computers when performing repeated sparse matrix-vector multiplication operations. this method of parallelization in principle allows for unlimited scalability while still maintains computation speed. Indeed, our results show that the proposed method can solve large models faster. However, our results also show that up to 90% of the computation time is actually still spent for communications between computers.
the conventional A* algorithm consumes a lot of time due to its large number of iterations. In every iteration, the memory is accessed for multiple data structures, functions are evaluated then sorted into queues whic...
详细信息
ISBN:
(纸本)9781665408400
the conventional A* algorithm consumes a lot of time due to its large number of iterations. In every iteration, the memory is accessed for multiple data structures, functions are evaluated then sorted into queues which makes it sometimes not suitable for real-time applications. this paper proposes a fast implementation for the A* algorithm to meet requirements of real-time applications. the proposed implementation uses parallelism and caching to achieve better performance. We used Register Transfer Level (RTL) simulation and formal verification to do functional verification of the implemented *** design is implemented on Xilinx Virtex-7 to be evaluated. Experiments prove that this implementation achieves 100 times enhancement for low obstacle maps and 50 times for high ones relative to software implementation. the design is suitable for real-time applications.
Striped variation of the Smith-Waterman algorithm is known as extremely efficient and easily adaptable for the SIMD architectures. However, the potential for improvement has not been exhausted yet. the popular Lazy-F ...
详细信息
ISBN:
(纸本)9781728146171
Striped variation of the Smith-Waterman algorithm is known as extremely efficient and easily adaptable for the SIMD architectures. However, the potential for improvement has not been exhausted yet. the popular Lazy-F loop heuristic requires additional memory access operations, and the worst-case performance of the loop could be as bad as the nonvectorized version. We demonstrate the progression of the lazy-F loop transformations that improve the loop performance, and ultimately eliminate the loop completely. Our algorithm achieves the best asymptotic performance of all scan-based SW algorithms O(n/p+log(p)), and is very efficient in practice.
the growing scale of applications encoded to Boolean Satisfiability (SAT) problems imposes the need for accelerating SAT simplifications or preprocessing. parallel SAT preprocessing has been an open challenge for many...
详细信息
ISBN:
(纸本)9783030174620;9783030174613
the growing scale of applications encoded to Boolean Satisfiability (SAT) problems imposes the need for accelerating SAT simplifications or preprocessing. parallel SAT preprocessing has been an open challenge for many years. therefore, we propose novel parallelalgorithms for variable and subsumption elimination targeting Graphics processing Units (GPUs). Benchmarks show that the algorithms achieve an acceleration of 66x over a state-of-the-art SAT simplifier (SatELite). Regarding SAT solving, we have conducted a thorough evaluation, combining both our GPU algorithms and SatELite with MiniSat to solve the simplified problems. In addition, we have studied the impact of the algorithms on the solvability of problems with Lingeling. We conclude that our algorithms have a considerable impact on the solvability of SAT problems.
Heterogeneous Computing System (HCS) comprising of accelerators such as GPU, FPGA and DSP are extensively used in the parallel computing domain. the diversity in their micro-architectures makes them suitable for the v...
详细信息
ISBN:
(数字)9781728154756
ISBN:
(纸本)9781728154763
Heterogeneous Computing System (HCS) comprising of accelerators such as GPU, FPGA and DSP are extensively used in the parallel computing domain. the diversity in their micro-architectures makes them suitable for the various parallel scientific applications. Most of the existing systems that address data distribution in HCS heavily depend on the target architecture that limits the design space exploration to a known device micro-architecture. In contrast, this work uses static code analysis to develop a target-independent performance model to suggest the suitability of a data-parallel regular application to CPU or GPU in an heterogeneous node. this model uses information available at compile time to estimate the performance and the objective is to statically obtain relative performance. Withthe performance estimates for boththe CPU and GPU code for varied problem sizes, an application is classified as CPU-GPU or GPU-only. Furthermore, the approach also gives an optimal data distribution ratio for the application. this approach is evaluated using data-parallel applications that have varied speedups on GPU w.r.t. multi-core CPU. Using the proposed technique, an average performance improvement of 38.44% is seen across CPU-GPU benchmarks, withthe co-execution of CPU+GPU as compared to CPU alone.
暂无评论