MrBayes, a popular program for Bayesian inference of phylogeny, has not been fast enough for Biologists when dealing with large real-world data sets. this paper presents a new parallel algorithm that combines the chai...
详细信息
ISBN:
(纸本)9783642131189
MrBayes, a popular program for Bayesian inference of phylogeny, has not been fast enough for Biologists when dealing with large real-world data sets. this paper presents a new parallel algorithm that combines the chain-partitioned parallel algorithm withthe chain-parallel algorithm to obtain higher concurrency. We test the proposed hybrid algorithm withthe two old algorithms on a heterogeneous cluster. the results show that, the hybrid algorithm actually converts more CPU cores into higher speedup compared withthe two control algorithms for all of four real-world DNA data sets, therefore is more practical.
We consider the loop less k-shortest path (KSP) problem. Although this problem has been studied in the sequential setting for at least the last two decades, no good parallel implementations are known. In this paper, w...
详细信息
ISBN:
(纸本)9781450365109
We consider the loop less k-shortest path (KSP) problem. Although this problem has been studied in the sequential setting for at least the last two decades, no good parallel implementations are known. In this paper, we provide (i) a first systematic empirical comparison of various KSP algorithms and heuristic optimisations, (ii) carefully engineer various parallel implementations of these sequential algorithms and (iii) perform an extensive study of these parallel implementations on a range of graph classes and multicore architectures to determine the best algorithm and parallelization strategy for different graph classes. We find that even though the worst-case complexity of the best undirected KSP algorithm O (k (m + n log n)) is significantly better than that of the popular and considerably simpler directed KSP algorithm O(kn(m + n log n)), the two algorithms are fairly competitive in terms of their empirical performance on small diameter graphs. Furthermore, we show that a few simple optimisations help to bridge the gap between these KSP algorithms even more. However, on moderate to large diameter graphs, the undirected KSP algorithm is considerably faster than the directed algorithms, both in sequential and parallel settings. In terms of the parallelisation strategy, simply replacing the shortest path subroutine by parallel A-stepping algorithm can provide a good speed-up for many KSP algorithms on random graphs. In contrast, for graphs with skewed degree distribution, a more complex strategy of parallelizing the different deviations and then parallelizing the shortest path computation inside the deviations withthe remaining threads, provides a better performance.
the Gurevich's thesis stipulates that sequential abstract state machines (ASMS) capture the essence of sequential algorithms. On another hand, the bulk-synchronous parallel (BSP) bridging model is a well known mod...
详细信息
ISBN:
(纸本)9783030050573;9783030050566
the Gurevich's thesis stipulates that sequential abstract state machines (ASMS) capture the essence of sequential algorithms. On another hand, the bulk-synchronous parallel (BSP) bridging model is a well known model for HPC algorithm design. It provides a conceptual bridge between the physical implementation of the machine and the abstraction available to a programmer of that machine. the assumptions of the BSP model are thus provide portable and scalable performance predictions on most HPC systems. We follow Gurevich's thesis and extend the sequential postulates in order to intuitively and realistically capture BSP algorithms.
this paper discusses fast parallelalgorithms for evaluating several centrality indices frequently used in complex network analysis. these algorithms have been optimized to exploit properties typically observed in rea...
详细信息
ISBN:
(纸本)0769526365
this paper discusses fast parallelalgorithms for evaluating several centrality indices frequently used in complex network analysis. these algorithms have been optimized to exploit properties typically observed in real-world large scale networks, such as the low average distance, high local density, and heavy-tailed power law degree distributions. We test our implementations on real datasets such as the web graph, protein-interaction networks, movie-actor and citation networks, and report impressive parallel performance for evaluation of the computationally intensive centrality metrics (betweenness and closeness centrality) on high-end shared memory symmetric multiprocessor and multithreaded architectures. To our knowledge, these are the first parallel implementations of these widely-used social network analysis metrics. We demonstrate that it is possible to rigorously analyze networks three orders of magnitude larger than instances that can be handled by existing network analysis (SNA) software packages. For instance, we compute the exact betweenness centrality value for each vertex in a large US patent citation network (3 million patents, 16 million citations) in 42 minutes on 16 processors, utilizing 20GB RAM of the IBM p5570. Current SNA packages on the other hand cannot handle graphs with more than hundred thousand edges.
Sorting is one of the classic problems of data processing and many practical applications require implementation of parallel sorting algorithms. Only a few algorithms have been implemented using MPI, in this paper a f...
详细信息
ISBN:
(纸本)9781479975051
Sorting is one of the classic problems of data processing and many practical applications require implementation of parallel sorting algorithms. Only a few algorithms have been implemented using MPI, in this paper a few additional parallel sorting algorithms have been implemented using MPI. A unified performance analysis of all these algorithms has been presented using two different architectures. On basis of experimental results obtained some guidelines has been suggested for the selection of proper algorithms.
Barrier algorithms are central to the performance of numerous algorithms on scalable, high-performance architectures. Numerous barrier algorithms have been suggested and studied for Non-Uniform Memory Access (NUMA) ar...
详细信息
ISBN:
(纸本)0818656026
Barrier algorithms are central to the performance of numerous algorithms on scalable, high-performance architectures. Numerous barrier algorithms have been suggested and studied for Non-Uniform Memory Access (NUMA) architectures, but less work has been done for Cache Only Memory Access (COMA) or attraction memory [1] architectures such as the KSR-1. In this paper, we presented two new barrier algorithmsthat offer the best performance we have recorded on the KSR-1 distributed cache multiprocessor. We discuss the trade-offs and the performance of seven algorithms on two architectures. the new barrier algorithms adapt well to a hierarchical caching memory model and take advantage of parallel communication offered by most multiprocessor interconnection networks,. Performance results are shown for a 256-processor KSR-1 and a 20-processor Sequent Symmetry.
Memory-CPU single communication channel bottleneck of the von Neumann architecture is quickly stalling the growth of computer processors. A probable solution to this problem is to fuse processing and memory elements. ...
详细信息
ISBN:
(纸本)9783642246494
Memory-CPU single communication channel bottleneck of the von Neumann architecture is quickly stalling the growth of computer processors. A probable solution to this problem is to fuse processing and memory elements. A simple low latency single on-chip memory and processor cannot solve the problem as the fundamental channel bottleneck will still be there due to the logical splitting of processor and memory. this paper presents that a paradigm shift is possible by combining Arithmetic logic unit and Random Access Memory (ARAM) elements at bit level. this bit level modest ARAM is used to perform word level ALU instructions with minor modifications. this makes the ARAM cells capable of executing instructions in parallel. It is also asynchronous and hence reduces power consumption significantly. A CMOS implementation is presented that verifies the practicality of the proposed ARAM.
Image processing is an unceasingly growing area with a range of applications including cryptography, medicine, video surveillance, remote sensing, and many more. Implementing sophisticated algorithms to process the la...
详细信息
ISBN:
(纸本)9781728188676
Image processing is an unceasingly growing area with a range of applications including cryptography, medicine, video surveillance, remote sensing, and many more. Implementing sophisticated algorithms to process the large amount of data using software solutions makes the response slower, and that's where hardware implementation comes into the picture. Field Programmable Gate Arrays (FPGAs) are getting popular due to low latency, connectivity, parallel computing, and flexibility. the unique architecture of the FPGA has made it possible to use the technology for applying in many applications to have better and faster results. this paper is aiming at providing a comprehensive survey on the hardware implementations of image processingalgorithms to facilitate the improvement in efficiency using FPGAs. the widely used Xilinx PYNQ is also presented in this paper as they play a major role in reducing the development time.
We evaluate the applicability of many-core architectures for the simulation of networks on chips (NoC). Compared to the well established shared memory multi-core architectures, many-core architectures significantly di...
详细信息
ISBN:
(纸本)9781467387767
We evaluate the applicability of many-core architectures for the simulation of networks on chips (NoC). Compared to the well established shared memory multi-core architectures, many-core architectures significantly differ not only in the number of processing elements but also in the on-chip communication architecture, the memory subsystem, and the computational performance of an individual core. Proven multi-core simulation approaches do not consider such architectural aspects and thus suffer limited performance when being applied to many-core architectures. To enable high performance simulation, we identify conceptual drawbacks of state of the art parallel simulation approaches and consequently propose a novel globally asynchronous locally synchronous (GALS) simulation concept suited for many core architectures. Our results show that our GALS simulation approach yields a speedup of up to 2.3 over parallel discrete event simulation.
In this paper, two distinct image processingarchitectures have been outlined, the first withparallel scaleable features, and the second incorporating boundary scan test. Boththese architectures utilize high perform...
详细信息
In this paper, two distinct image processingarchitectures have been outlined, the first withparallel scaleable features, and the second incorporating boundary scan test. Boththese architectures utilize high performance digital signal processing (DSP)'s and have allowed fast image processing systems to be developed at a reasonable cost. the parallel architecture shows significant speed-up when processing image, with a 2-D FFT illustrating certain aspect of the system.
暂无评论