The analysis of a crystal nanostructure is provided by the information obtained from the electron microscopy. Mathematically, a crystal structure is described by unit cells – minimum building blocks, which form the e...
详细信息
The analysis of a crystal nanostructure is provided by the information obtained from the electron microscopy. Mathematically, a crystal structure is described by unit cells – minimum building blocks, which form the entire crystal lattice by parallel transfer. Parametric identification is an important problem in the field of three-dimensional crystal lattice research. Application of the constant step size gradient descent method to solve this problem ensured sufficient increase of the identification accuracy. However, computational complexity of the applied algorithm significantly exceeds the computational complexity of the existing parametric identification algorithms, which has caused substantial increment of the execution time. In order to eliminate such disadvantage this work proposes vector algorithm of crystal lattice parametric identification implemented with CUDA technology.
LDA is a widely used machine learning technique for big data analysis. The application includes an inference algorithm that iteratively updates a model until it converges. A major challenge is the scaling issue in par...
详细信息
LDA is a widely used machine learning technique for big data analysis. The application includes an inference algorithm that iteratively updates a model until it converges. A major challenge is the scaling issue in parallelization owing to the fact that the model size is huge and parallel workers need to communicate the model continually. We identify three important features of the model in parallel LDA computation: 1. The volume of model parameters required for local computation is high; 2. The time complexity of local computation is proportional to the required model size; 3. The model size shrinks as it converges. By investigating collective and asynchronous methods for model communication in different tools, we discover that optimized collective communication can improve the model update speed, thus allowing the model to converge faster. The performance improvement derives not only from accelerated communication but also from reduced iteration computation time as the model size shrinks during the model convergence. To foster faster model convergence, we design new collective communication abstractions and implement two Harp-LDA applications, “lgs” and “rtt”. We compare our new approach with Yahoo! LDA and Petuum LDA, two leading implementations favoring asynchronous communication methods in the field, on a 100-node, 4000-thread Intel Haswell cluster. The experiments show that “lgs” can reach higher model likelihood with shorter or similar execution time compared with Yahoo! LDA, while “rtt” can run up to 3.9 times faster compared with Petuum LDA when achieving similar model likelihood.
In this work we address the parallel complexity of two combinatorial problems, specifically the problems of the existence and of the construction of a parity base of preassigned weight ( exact parity base for short) i...
详细信息
In this work we address the parallel complexity of two combinatorial problems, specifically the problems of the existence and of the construction of a parity base of preassigned weight ( exact parity base for short) in a 0-1 weighted, represented matroid, subject to parity conditions. We prove that these problems lie in the parallel complexity class RNC 2 , i.e. they are solvable with one-sided error by a logspace uniform family of bounded fan-in circuits of polynomial size and quadratic logarithmic depth which receive, in addition to the problem input, a polynomial number of random input bits. We also show that the more general cases of these problems, defined over matroids weighted with integral instead of 0-1 weights, also belong to RNC 2 , as long as the weights are given in unary notation. As a consequence some special cases of these problems, which are of independent interest, belong to the same parallel complexity class: examples of these are the problem of the construction of a perfect matching of preassigned weight in a 0-1 weighted graph, recently addressed in [1], or that of the construction of a base of preassigned weight, in the intersection of two 0-1 weighted represented matroids.
This paper shows that the prefix-sums of n binary values can be computed in time on an n × m reconfigurable mesh of the word model. It also shows that prefix-sums of n binary values can be computed in time on an ...
详细信息
This paper shows that the prefix-sums of n binary values can be computed in time on an n × m reconfigurable mesh of the word model. It also shows that prefix-sums of n binary values can be computed in time on an n × m reconfigurable mesh of the word model if the reconfigurable mesh has communication capability that allows simultaneous sending to the same bus.
This paper presents an algorithm which sums up n binary values on an n x m reconfigurable mesh in O(log n/square-root m log m) time. This algorithm also yields a corollary which states that n binary values can be summ...
详细信息
This paper presents an algorithm which sums up n binary values on an n x m reconfigurable mesh in O(log n/square-root m log m) time. This algorithm also yields a corollary which states that n binary values can be summed up on an n x log2 n/ log log n reconfigurable mesh in constant time.
We study efficient parallel solutions to the problem of selectingrelements at specified ranks from a set of n arbitrary elements, known asmultiselection, in a hypercube withp<nprocessors. We propose two parallel al...
详细信息
We study efficient parallel solutions to the problem of selectingrelements at specified ranks from a set of n arbitrary elements, known asmultiselection, in a hypercube withpparallel algorithms based on different approaches, where one requires processors to operate in the SIMD mode, and the other in the MIMD mode. Our SIMD algorithm runs inO(nϵmin{r, logp}) time whenp=n1−rfor any 0<ϵ<1, which is cost-optimal whenr≥p. With the same number of processors, our MIMD algorithm runs inO(nϵlogr) time and is cost-optimal for any values ofr. Both algorithms are more efficient than straightforward solutions and that of direct simulation of the optimal EREW algorithm.
A model of cellular automata (CA) is considered to be a well-studied non-linear model of complex systems in which an infinite one-dimensional array of finite state machines (cells) updates itself in a synchronous mann...
详细信息
A model of cellular automata (CA) is considered to be a well-studied non-linear model of complex systems in which an infinite one-dimensional array of finite state machines (cells) updates itself in a synchronous manner according to a uniform local rule. A sequence generation problem on the CAs has been studied and many scholars proposed several real-time sequence generation algorithms for a variety of non-regular sequences such as prime, Fibonacci, and {2n|n=1,2,3,...} sequences etc. The paper describes the sequence generation powers of CAs having a small number of states, focusing on the CAs with one, two, and three internal states, respectively. The authors enumerate all of the sequences generated by two-state CAs and present several non-regular sequences that can be generated in real-time by three-state CAs, but not generated by any two-state CA. It is shown that there exists a sequence generation gap among the powers of those small CAs.
Many popular entropy definitions for signals, including approximate and sample entropy, are based on the idea of embedding the time series into an m-dimensional space, aiming to detect complex, deeper and more informa...
详细信息
Many popular entropy definitions for signals, including approximate and sample entropy, are based on the idea of embedding the time series into an m-dimensional space, aiming to detect complex, deeper and more informative relationships among samples. However, for both approximate and sample entropy, the high computational cost is a severe limitation. Especially when large amounts of data are processed, or when parameter tuning is employed premising a large number of executions, the necessity of fast computation algorithms becomes urgent. In the past, our research team proposed fast algorithms for sample, approximate and bubble entropy. In the general case, the bucket-assisted algorithm was the one presenting the lowest execution times. In this paper, we exploit the opportunities given by the multithreading technology to further reduce the computation time. Without special requirements in hardware, since today even our cost-effective home computers support multithreading, the computation of entropy definitions can be significantly accelerated. The aim of this paper is threefold: (a) to extend the bucket-assisted algorithm for multithreaded processors, (b) to present updated execution times for the bucket-assisted algorithm since the achievements in hardware and compiler technology affect both execution times and gain, and (c) to provide a Python library which wraps fast C implementations capable of running in parallel on multithreaded processors.
摘要摘要由於批量之決定影響生產系統之效率甚大,故在MRP架構中一直扮演著很重要的角色。雖然目前已有不少這方面之研究,大部份的最佳批量演算法卻受限於龐大之計算量而較不受實務界重視。隨著平行處理機的性能價格比日漸提昇,如何運用平行演算法以求解如動態批量這樣計算繁雜的問題便是一値得重視的研究方向。本文提出了二個動態批量平行演算法,在問題大小爲n時,前者複雜度爲O(n{su2})(如果有n個處理器),後者則爲O(n{su3}/p+np{su2})(如果有P個處理器,且P<algorithms is hindered by the huge amount of computer resources required to solve the models, even for a modest problem. Since the powerful parallel computers are becoming cost-effective nowadays, it is necessary to explore paraIlel algorithms that can be used to solve these laborious computational problems. This paper presents two parallel algorithms for solving dynamic lot sizing problem using the cost path concept. Given n is the size of the problem. it is shown that the first proposed parallel algorithm is O(n2) with n processors and the second proposed parallel algorithm iswith p processors (p<parallel algorithms and some future research directions are also provided.
This letter presents the modelling of a morphological thinning algorithm suggested by Jang and Chin [1] on the four models of shared memory SIMD computers. The time and cost complexity analyses for the models have bee...
详细信息
This letter presents the modelling of a morphological thinning algorithm suggested by Jang and Chin [1] on the four models of shared memory SIMD computers. The time and cost complexity analyses for the models have been given. The performance of this algorithm on SIMD computers has been compared with the performance of a conventional thinning algorithm [2] proposed recently.
暂无评论