The computational model on which the algorithms are developed is the array with reconfigurable optical buses (AROB). It integrates the advantages of both optical transmission and electronic computation. The main contr...
详细信息
The computational model on which the algorithms are developed is the array with reconfigurable optical buses (AROB). It integrates the advantages of both optical transmission and electronic computation. The main contributions of this paper are in designing several optimal and/or optimal speed-up template matching algorithms with varying degrees of parallelism on the AROB model. For an N × N digitized image and an M × M template, when the domains of the image and the template are O( log N)-bit integers, we first design several basic operations for window broadcasting and rotation. Then based on these basic operations, three efficient and scalable algorithms for template matching are derived using various numbers of processors on a two-dimensional (2-D) or 3-D AROB. For 1 ≤ r ≤ N, 1 ≤ p ≤ M ≤ q ≤ N, one runs in time using r × r processors, another runs in , (resp. ) time using pN × pN/ log M (resp. pN × pN × log N) processors, and the other runs in (resp. ) time using pq × pq/ log M (or pq × pqN × log N) processors, respectively. The latter two algorithms can be tuned to run in O(1) time on a 2-D AROB. To the best of our knowledge, there are no algorithms which can reach this time complexity for this problem on a 2-D array architecture.
This paper is devoted to the design and evaluation of a parallel version of the algorithm MIII, proposed first by Chu in [7], for the solution of the Inverse Additive Singular Value Problem (IASVP). This new algorithm...
详细信息
This paper is devoted to the design and evaluation of a parallel version of the algorithm MIII, proposed first by Chu in [7], for the solution of the Inverse Additive Singular Value Problem (IASVP). This new algorithm has shown good experimental performance, confirming the theoretical performance predicted and showing an acceptable scalability. It has been compared with the MI parallel algorithm, described in [10]. Both parallel algorithms decrease the sequential execution time for solving the IASVP and have similar parallel execution times, but, in most cases, MIII is more accurate than MI.
Recent advances in shared memory multiprocessor system-on-chip (MP-SOC) architectures include using special step caches to efficiently implement concurrent read concurrent write memory access. Unfortunately the existi...
详细信息
Recent advances in shared memory multiprocessor system-on-chip (MP-SOC) architectures include using special step caches to efficiently implement concurrent read concurrent write memory access. Unfortunately the existing step cache techniques do not support multioperations that can be used to speed up execution of a number of parallel algorithms by a logarithmic factor. This paper proposed an architectural technique for implementing multioperations on step cached MP-SOCs even if the associativity of caches is limited. The technique is based on simple active memory units, faster memory modules, and small processor-level memory blocks called scratchpads. The performance and area requirements of the proposed technique were evaluated on the parametrical MP-SOC framework. According to the evaluation the technique implements multioperations efficiently and provides a speed-up of 4.8 - 7.2 with respect baseline step cached systems and a speed-up of 3.7- 5.0 with respect to existing non-step cached systems with only a minor silicon are overhead
The Kohonen feature maps are commonly employed to process large input data but their effective working abilities can be achieved only after a time-consuming process of learning. Performed tests have shown that the seq...
详细信息
The Kohonen feature maps are commonly employed to process large input data but their effective working abilities can be achieved only after a time-consuming process of learning. Performed tests have shown that the sequential program, solving a typical problem, uses more than 95 percent of its time to localize the winners. The aim of the paper is to present and compare different ways of the algorithm parallelization. We compare two different classes of parallel implementations - the network parallelization and the learning set parallelization. During performed experiments two different ways of experimental evaluation are used: standard evaluation based on such metrics as speedup and efficiency and the approximation method based on the granularity concept
Over time, neural networks have proven to be extremely powerful tools for data exploration with the capability to discover previously unknown dependencies and relationships in the data sets. However, the sheer volume ...
详细信息
Over time, neural networks have proven to be extremely powerful tools for data exploration with the capability to discover previously unknown dependencies and relationships in the data sets. However, the sheer volume of available data and its dimensionality makes data exploration a challenge. Employing neural network training paradigms in such domains can prove to be prohibitively expensive. An algorithm, originally proposed for supervised on-line learning, has been improvised upon to make it suitable for deployment in large volume, high-dimensional domains. The basic strategy is to divide the data into manageable subsets or blocks and maintain multiple copies of a neural network with each copy training on a different block. A method to combine the results has been defined in such a way that convergence towards stationary points of the global error function can be guaranteed. A parallel algorithm has been implemented on a Linux-based cluster. Experimental results on popular benchmarks have been included to endorse the efficacy of our implementation.
Optimal weight extraction of beamforming algorithms based on systolic structures have been the subject of various researches since the well-known article presented by Gentleman and Kung (1981) on recursive least squar...
详细信息
Optimal weight extraction of beamforming algorithms based on systolic structures have been the subject of various researches since the well-known article presented by Gentleman and Kung (1981) on recursive least squares systolic arrays. Systolic algorithms are parallel and fully pipelined structures, this feature improves the performance of the beamforming algorithms and the system. SystemC is a system design language, which was lately accepted by the IEEE as a standard. SystemC has the advantage of designing both the hardware and the software components together so that the design and simulation process of large systems become easier. This work is based on the simulation of the minimum variance distortionless response (MVDR) beamformer, proposed by Tang, Liu, and Tretter (1994), in SystemC environment and evaluate its performance
The quest for high performance drives parallel scientific computing software design. Well over 60% of the high-performance computing (HPC) community writes programs using the MPI library; to gain performance, they are...
详细信息
The quest for high performance drives parallel scientific computing software design. Well over 60% of the high-performance computing (HPC) community writes programs using the MPI library; to gain performance, they are known to perform many manual optimizations. Even tools that accept high level descriptions often generate MPI code, due to its eminent portability. However, since the overall performance of a program does not usually port (due to variations in the target architecture, cluster size, etc.), manual changes to the code are inevitable in today's approaches to MPI programming and optimization. This, together with the vastness and evolving nature of the MPI standard, and the innate complexity of concurrent programming introduces costly bugs. Our research addresses these challenges through specific efforts in the following broad areas: (i) high level expression of the parallel algorithm and compilation thereof into optimized MPI programs, (ii) optimizations of user-written detailed MPI programs through localized transformations such as barrier removal, (iii) formal modeling of complex communication standards, such as the MPI-2 standard and a facility for answering putative queries (this need arises when standard documents are impossibly difficult to manually study in order to answer questions that are not explicitly addressed in the standard), (iv) formal modeling of new (and hence relatively less well understood) features of communication libraries, such as the one-sided communication facility of MPI-2, and (v) formal modeling of intricate control algorithms in these libraries such as the progress engine for TCP and/or shared memory in MPICH2 (a formal model can explicate commonalities, help formally verify, as well as help create better future implementations). Our research gains focus through numerous collaborations
A parallel algorithm of simulated annealing to solve the vehicle routing problem with time windows (VRPTW) is considered. The VRPTW is an NP-hard bicriterion optimization problem in which both the number of vehicles a...
详细信息
A parallel algorithm of simulated annealing to solve the vehicle routing problem with time windows (VRPTW) is considered. The VRPTW is an NP-hard bicriterion optimization problem in which both the number of vehicles and the total distance traveled by vehicles are minimized. The objective is to establish to what extent the computation time required to solve the VRPTW can be decreased by a number of co-operating parallel processes with no loss of quality of solutions. The quality of a solution is meant as its proximity to the optimum (or best known) solution. Furthermore, some factors are proposed which allow to rank the VRPTW benchmarking tests according to their difficulties
parallel programming is facilitated by constructs which, unlike the widely used SPMD paradigm, provide programmers with a global view of the code and data structures. These constructs could be compiler directives cont...
详细信息
ISBN:
(纸本)9781424400546
parallel programming is facilitated by constructs which, unlike the widely used SPMD paradigm, provide programmers with a global view of the code and data structures. These constructs could be compiler directives containing information about data and task distribution, language extensions specifically designed for parallel computation, or classes that encapsulate parallelism. In this paper, we describe a class developed at Illinois and its Matlab implementation. This class can be used to conveniently express both parallelism and locality. A C++ implementation is now underway. Its characteristics will be reported in a future paper. We have implemented most of the NAS benchmarks using our HTA Matlab extensions and found during that HTAs enable the fast prototyping of parallel algorithms and produce programs that are easy to understand and maintain
An approach to development of fine-grain parallel algorithm of artificial neural network training using parallelization of computational operations of each elementary neuron is presented in this paper. A training algo...
详细信息
An approach to development of fine-grain parallel algorithm of artificial neural network training using parallelization of computational operations of each elementary neuron is presented in this paper. A training algorithm of back error propagation is described and parallel section of the algorithm is developed. The results of experimental research of the parallel algorithm are given using analysis of parallelization speedup and efficiency on parallel computer Origin 300.
暂无评论