This report presents results on a parallel implementation of the expectation-maximization (EM) algorithm for multidimensional latent variable models. The developments presented here are based on code that parallelizes...
详细信息
This report presents results on a parallel implementation of the expectation-maximization (EM) algorithm for multidimensional latent variable models. The developments presented here are based on code that parallelizes both the E step and the M step of the parallel-E parallel-M algorithm. Examples presented in this report include item response theory, diagnostic classification models, multitrait–multimethod (MTMM) models, and discrete mixture distribution models. These types of models are frequently applied to the analysis of multidimensional responses of test takers to a set of items, for example, in the context of proficiency testing. The algorithm presented here is based on a direct implementation of massive parallelism using a paradigm that allows the distribution of work among a number of processor cores. Modern desktop computers as well as many laptops are using processors that contain 2–4 cores and potentially twice the number of virtual cores. Many servers use 2, 4, or more multicore #central processing units (CPUs), which brings the number of cores to 8, 12, 32, or even 64 or more. The algorithm presented here scales the time reduction in the most calculation-intense part of the program almost linearly for some problems, which means that a server with 32 physical cores executes the parallel-E step algorithm up to 24 times faster than a single-core computer or the equivalent nonparallel algorithm. The overall gain (including parts of the program that cannot be executed in parallel) can reach a reduction in time by a factor of 6 or more for a 12-core machine. The basic approach is to utilize the architecture of modern CPUs, which often involves the design of processors with multiple cores that can run programs simultaneously. The use of this type of architecture for algorithms that produce posterior moments has straightforward appeal: The calculations conducted for each respondent or each distinct response pattern can be split up into simultaneous calculations
Transactional Memory (TM) is an alternative way of synchronizing concurrent accesses to shared memory by adopting the abstraction of transactions in place of low-level mechanisms like locks and barriers. TMs usually a...
详细信息
Transactional Memory (TM) is an alternative way of synchronizing concurrent accesses to shared memory by adopting the abstraction of transactions in place of low-level mechanisms like locks and barriers. TMs usually apply optimistic concurrency control to provide a universal and easy-to-use method of maintaining correctness. However, this approach performs a high number of aborts in high contention workloads, which can adversely affect perform Optimistic TMs can cause problems when transactions contain irrevocable operations. Hence, pessimistic TMs were proposed to solve some of these problems. However, an important way of achieving efficiency in pessimistic TMs is to use early release. On the other hand, early release is seemingly at odds with opacity, the gold standard of TM safety properties, which does not allow transactions to make their state visible until they commit. In this paper we propose a proof technique that makes it possible to demonstrate that a TM with early release can be opaque as long as it prevents inconsistent views.
With problem size and complexity increasing, several parallel and distributed programming models and frameworks have been developed to efficiently handle such problems. This paper briefly reviews the parallel computin...
详细信息
With problem size and complexity increasing, several parallel and distributed programming models and frameworks have been developed to efficiently handle such problems. This paper briefly reviews the parallel computing models and describes three widely recognized parallel programming frameworks: OpenMP, MPI, and MapReduce. OpenMP is the de facto standard for parallel programming on shared memory systems. MPI is the de facto industry standard for distributed memory systems. MapReduce framework has become the de facto standard for large scale data-intensive applications. Qualitative pros and cons of each framework are known, but quantitative performance indexes help get a good picture of which framework to use for the applications. As benchmark problems to compare those frameworks, two problems are chosen: all-pairs-shortest-path problem and data join problem. This paper presents the parallel programs for the problems implemented on the three frameworks, respectively. It shows the experiment results on a cluster of computers. It also discusses which is the right tool for the jobs by analyzing the characteristics and performance of the paradigms.
The efficiency of parallel preconditioned conjugate gradient (PCG) algorithm for solving large sparse linear systems arising from application of interior point methods to conic optimisation problems in the context of ...
详细信息
The efficiency of parallel preconditioned conjugate gradient (PCG) algorithm for solving large sparse linear systems arising from application of interior point methods to conic optimisation problems in the context of nonlinear finite element limit analysis (FELA) for computational geomechanics is studied. For large 3D problems, the use of direct solvers in general becomes prohibitively expensive owing to exponentially growing memory requirements and computational time. And the so-called saddle-point systems resulting from use of optimisation framework is not an exemption. On the other hand, although preconditioned iterative methods have moderate storage requirements and therefore can be applied to much larger problems than direct methods, they usually exhibit high number of iterations to reach convergence. In the present paper, we show that this problem can be effectively tackled using efficient variants of sparse approximate inverse preconditioners along with an elaborate parallel implementation on multicore CPUs and significant improvements can be achieved by parallel implementation on graphic processing unit (GPU). Furthermore, the efficiency of our proposed implementation is verified by the presented numerical results.
The classical solution of electromagnetic problems using the finite element (FE) method needs to assemble, store and solve an Ax = b matrix system. A new technique for solving FE cases, considered much simpler than tr...
详细信息
ISBN:
(纸本)9781424470594
The classical solution of electromagnetic problems using the finite element (FE) method needs to assemble, store and solve an Ax = b matrix system. A new technique for solving FE cases, considered much simpler than traditional methods, shows that the assembling of the matrix A is unnecessary [1]. The difference between these two techniques is the computation and processing time. The new one requires more iterations to converge, observing, nevertheless, that the results are reliable. One possible way to improve its performance is the application of parallelization techniques.
We are witnessing an increase in the parallel power of computers for the foreseeable future, which requires parallel programming tools and models that can take advantage of the higher number of hardware threads. For s...
详细信息
A semiring is an algebraic structure satisfying the usual axioms for a not necessarily commutative ring, but without the requirement that addition be invertible. Aside from rings, well-studied instances in cryptograph...
详细信息
A semiring is an algebraic structure satisfying the usual axioms for a not necessarily commutative ring, but without the requirement that addition be invertible. Aside from rings, well-studied instances in cryptographic applications include the Boolean semiring and the tropical semiring. The latter, in particular, behaves to a large extent like a field and exhibits interesting properties in the cryptographic context. This short note explores a GPU-based highly parallel implementation of a protocol recently proposed by Grigoriev and Shpilrain [7], in the context of Diffie-Hellman key agreements.
In the past, speedup has been achieved in a processor by increasing clock speed. Multicore processors are the new direction semiconductor companies are focusing on to get a boost in the performance. This tutorial firs...
详细信息
In the past, speedup has been achieved in a processor by increasing clock speed. Multicore processors are the new direction semiconductor companies are focusing on to get a boost in the performance. This tutorial first covers the concept of multicore, introducing its need and the challenges. The key aspects of multicore architecture design and the detailed architecture with reference to XMOS multicore microcontroller will be presented. The tutorial then covers the parallel programming concepts and introduces the language constructs that exploits the architectural features specific to XMOS processors. A few case studies on the application-specific design in the domains of industrial communication and image processing will be presented. Sample programs will be demonstrated to get a clear understanding of programming on multicores. The participants will also try these demos for getting hands-on experience in multicore programming.
The Unscented Kalman Filter (UKF) is widely used to solve nonlinear systems, like submarine tracking, aircraft surveillance, autonomous robotics and mobile systems. One of the typical problems solved using UKF is Bear...
详细信息
ISBN:
(纸本)9781479981656
The Unscented Kalman Filter (UKF) is widely used to solve nonlinear systems, like submarine tracking, aircraft surveillance, autonomous robotics and mobile systems. One of the typical problems solved using UKF is Bearing-Only Target Motion Analysis (BOTMA) for manoeuvring and non manoeuvring targets. This paper proposes a methodology for parallel execution of UKF with an aim to enhance its performance in terms of computational throughput. parallel algorithm and its execution of UKF for BOTMA will use multi-core processor environment. The study concentrate on identifying the phases of UKF enabled BOTMA that can be parallelized to execute on the hardware underneath to enhance the response time. The performance is observed and results are verified.
暂无评论