Segmentation of an image into superpixel clusters is a necessary part of many imaging pathways. In this article, we describe a new routine for superpixel image segmentation (F-DBSCAN) based on the DBSCAN algorithm tha...
详细信息
Segmentation of an image into superpixel clusters is a necessary part of many imaging pathways. In this article, we describe a new routine for superpixel image segmentation (F-DBSCAN) based on the DBSCAN algorithm that is six times faster than previous existing methods, while being competitive in terms of segmentation quality and resistance to noise. The gains in speed are achieved through efficient parallelization of the cluster search process by limiting the size of each cluster thus enabling the processes to operate in parallel without duplicating search areas. Calculations are performed in large consolidated memory buffers which eliminate fragmentation and maximize memory cache hits thus improving performance. When tested on the Berkeley Segmentation Dataset, the average processing speed is 175 frames/s with a Boundary Recall of 0.797 and an Achievable Segmentation Accuracy of 0.944.
We present Parallel Rapidly Exploring Random Tree (PRRT) and Parallel RRT* (PRRT*), which are sampling-based methods for feasible and optimal motion planning designed for modern multicore CPUs. We parallelize RRT and ...
详细信息
We present Parallel Rapidly Exploring Random Tree (PRRT) and Parallel RRT* (PRRT*), which are sampling-based methods for feasible and optimal motion planning designed for modern multicore CPUs. We parallelize RRT and RRT* such that all threads concurrently build a single-motion planning tree. Parallelization in this manner requires data structures, such as the nearest neighbor search tree and the motion planning tree, to be safely shared across multiple threads. Rather than relying on the traditional locks which can result in slowdowns due to lock contention, we introduce algorithms that are based on lock-free concurrency using atomic operations. We further improve scalability by using partition-based sampling (which shrinks each core's working dataset to improve cache efficiency) and parallel work-saving (in reducing the number of rewiring steps performed in PRRT*). Because PRRT and PRRT* are CPU-based, they can be directly integrated with existing libraries. In scenarios such as the Alpha Puzzle and Cubicles scenario and the Aldebaran Nao performing a two-handed task, we demonstrate that PRRT and PRRT* scale well as core counts increase, and in some cases they exhibit superlinear speedup.
Many concurrent data-structure implementations - both blocking and non-blocking - use the well-known compare-and-swap (CAS) operation, supported in hardware by most modern multiprocessor architectures, for inter-threa...
详细信息
Many concurrent data-structure implementations - both blocking and non-blocking - use the well-known compare-and-swap (CAS) operation, supported in hardware by most modern multiprocessor architectures, for inter-thread synchronization. A key weakness of the CAS operation is its performance in the presence of memory contention. When multiple threads concurrently attempt to apply CAS operations to the same shared variable, at most a single thread will succeed in changing the shared variable's value and the CAS operations of all other threads will fail. Moreover, significant degradation in performance occurs when variables manipulated by CAS become contention 'hot spots', because failed CAS operations congest the interconnect and memory devices and slow down successful CAS operations. In this work, we study the following question: can software-based contention management improve the efficiency of hardware-provided CAS operations? In other words, can a software contention management layer, encapsulating invocations of hardware CAS instructions, improve the performance of CAS-based concurrent data structures? To address this question, we conduct what is, to the best of our knowledge, the first study on the impact of contention management algorithms on the efficiency of the CAS operation. We implemented several Java classes, that extend Java's AtomicReference class, and encapsulate calls to the native CAS instruction with simple contention management mechanisms tuned for different hardware platforms. A key property of our algorithms is the support for an almost-transparent interchange with Java's AtomicReference objects, used in implementations of concurrent data structures. We evaluate the impact of these algorithms on both a synthetic micro-benchmark and on CAS-based concurrent implementations of widely-used data structures such as stacks and queues. Our performance evaluation establishes that lightweight software-based contention management support can greatly improve
Hashing has long been recognized as a fast method for accessing records by key in large relatively static databases. However, when the amount of data is likely to grow significantly, traditional hashing suffers from p...
详细信息
Hashing has long been recognized as a fast method for accessing records by key in large relatively static databases. However, when the amount of data is likely to grow significantly, traditional hashing suffers from performance degradation and may eventually require rehashing all the records into a larger space. Recently, a number of techniques for dynamic hashing have appeared. In this paper, we present a solution to allow for concurrency in one of these dynamic hashing data structures, namely extendible hashfiles. The solution is based on locking protocols and minor modifications in the data structure.
The widespread availability of local-area networks has made the combined processing power of workstations a viable approach for compute-intensive analyses. In this paper, we describe several distributed algorithms for...
详细信息
The widespread availability of local-area networks has made the combined processing power of workstations a viable approach for compute-intensive analyses. In this paper, we describe several distributed algorithms for structural analysis using finite element methods, and we assess their performance on a conventional Ethernet-connected workstation network. Direct, iterative and hybrid equation solvers are evaluated for their performance on plane-elasticity problems, and are contrasted with respect to overall solution time and efficiency in distributing computations over a network. Equations modeling the costs of network communication and structural analysis computations are derived, and are subsequently used to predict the performance of several variations on the implemented algorithms. Our results show that each of the methods performs well on network architectures, and in particular that, while direct methods usually minimize network communication, certain iterative and hybrid methods can often be used to minimize overall solution time. Copyright (C) 1996 Civil-Comp Limited and Elsevier Science Limited.
Sequential versions of those optimisation algorithms which are based on random search heuristics are often too slow to be of value to the interactive user of a CAD workstation. A significant gain in speed can be achie...
详细信息
Sequential versions of those optimisation algorithms which are based on random search heuristics are often too slow to be of value to the interactive user of a CAD workstation. A significant gain in speed can be achieved by using concurrent algorithms to drive an optimising accelerator attached to the workstation. The paper discusses the design and performance of a hardware accelerator which incorporates INMOS transputers. concurrent versions of two algorithms are described, one relevant to combinatorial optimisation and the other to global optimisation. The mapping of these algorithms on to the transputer hardware is discussed. The application and performance of each algorithm is illustrated by means of a representative problem from the field of electronic engineering.
This paper describes two new versions of the controlled random search procedure for global optimization (CRS). Designed primarily to suit the user of a CAD workstation, these algorithms can also be used effectively in...
详细信息
This paper describes two new versions of the controlled random search procedure for global optimization (CRS). Designed primarily to suit the user of a CAD workstation, these algorithms can also be used effectively in other contexts. The first, known as CRS3, speeds the final convergence of the optimization by combining a local optimization algorithm with the global search procedure. The second, called CCRS, is a concurrent version of CRS3. This algorithm is intended to drive an optimizing accelerator, based on a concurrent processing architecture, which can be attached to a workstation to achieve a significant increase in speed. The results are given of comparative trials which involve both unconstrained and constrained optimization.
Synchrony continues to be an important concern in concurrent programming. Existing languages and models have introduced a great diversity of constructs for expressing and managing synchronization among sequential proc...
详细信息
Synchrony continues to be an important concern in concurrent programming. Existing languages and models have introduced a great diversity of constructs for expressing and managing synchronization among sequential processes or atomic actions. This paper puts forth a model in which synchrony is viewed as a relation among atomic actions, a relation which may evolve with time. The model is shown to be convenient for expressing formally the semantics of synchrony as it appears in many of the languages and models proposed to date. Among such models Swarm is singled out for its use of dynamic synchrony. The Swarm notation is briefly reviewed. A new concurrent algorithm for the Leader Election problem provides a vehicle for illustrating the use of dynamic synchrony in Swarm.
Sequential versions of combinatorial optimisation algorithms which are based on random search heuristics are generally too slow to be of value to the interactive user of a CAD workstation. This paper describes a concu...
详细信息
Sequential versions of combinatorial optimisation algorithms which are based on random search heuristics are generally too slow to be of value to the interactive user of a CAD workstation. This paper describes a concurrent version of the simulated annealing algorithm, and also a variant of this algorithm called CCO. The results are given of comparative trials of these algorithms. A significant gain in speed can be achieved by using concurrent algorithms to drive an optimising accelerator attached to the workstation. Also discussed is a divide and conquer procedure for decomposing complex combinatorial problems into minimally interdependent subproblems of managable size. This decomposition procedure makes use of the CCO algorithm.
concurrent data structures provide fundamental building blocks for concurrent programming. Standard concurrent data structures may be extended by allowing a sequence of operations to be submitted as a batch for later ...
详细信息
concurrent data structures provide fundamental building blocks for concurrent programming. Standard concurrent data structures may be extended by allowing a sequence of operations to be submitted as a batch for later execution. A sequence of such operations can then be executed more efficiently than the standard execution of one operation at a time. In this article, we develop a novel algorithmic extension to the prevalent FIFO queue data structure that exploits such batching scenarios. An implementation in C++ on a multicore demonstrates significant performance improvement of more than an order of magnitude (depending on the batch lengths and the number of threads) compared to previous queue implementations.
暂无评论