We speed up the solution of the mobile sequential recommendation (MSR) problem that requires searching optimal routes for empty taxi cabs through mining massive taxi GPS data. We develop new methods that combine paral...
详细信息
We speed up the solution of the mobile sequential recommendation (MSR) problem that requires searching optimal routes for empty taxi cabs through mining massive taxi GPS data. We develop new methods that combine parallel computing and the simulated annealing with novel global and local searches. While existing approaches usually involve costly offline algorithms and methodical pruning of the search space, our new methods provide direct real-time search for the optimal route without the offline preprocessing. Our methods significantly reduce computational time for the high dimensional MSR problems from days to seconds based on the real-world data as well as the synthetic ones. We efficiently provide solutions to MSR problems with thousands of pick-up points without offline training, compared to the published record of 25 pick-up points.
To segment regions of interest (ROIs) from ultrasound images, one novel dynamic texture based algorithm is presented with surfacelet transform, hidden Markov tree (HMT) model and parallel computing. During surfacelet ...
详细信息
To segment regions of interest (ROIs) from ultrasound images, one novel dynamic texture based algorithm is presented with surfacelet transform, hidden Markov tree (HMT) model and parallel computing. During surfacelet transform, the image sequence is decomposed by pyramid model, and the 3D signals with high frequency are decomposed by directional filter banks. During HMT modeling, distribution of coefficients is described with Gaussian mixture model (GMM), and relationship of scales is described with scale continuity model. From HMT parameters estimated through expectation maximization, the joint probability density is calculated and taken as feature value of image sequence. Then ROIs and non-ROIs in collected sample videos are used to train the support vector machine (SVM) classifier, which is employed to identify the divided 3D blocks from input video. To improve the computational efficiency, parallel computing is implemented with multi-processor CPU. Our algorithm has been compared with the existing texture based approaches, including gray level co-occurrence matrix (GLCM), local binary pattern (LBP), Wavelet, for ultrasound images, and the experimental results prove its advantages of processing noisy ultrasound images and segmenting higher accurate ROIs.
This paper presents feasibility studies in utilizing graphics processing units (GPUs) as high-performance computing hardware with front-end electronics in high-scale magnetic confinement thermal fusion experiments. Th...
详细信息
This paper presents feasibility studies in utilizing graphics processing units (GPUs) as high-performance computing hardware with front-end electronics in high-scale magnetic confinement thermal fusion experiments. The objective of the research is to provide scalable, high-throughput, and low-latency measurements for the runtime tokamak metallic impurities X-ray diagnostic for the Tungsten Environment in Steady-State Tokamak (WEST) reactor. The heterogeneous system of front-end with field-programmable gate arrays and the back-end server was introduced to decompose workloads efficiently. It allows the comprehensive evaluation of CPUs and accelerators. In particular, a novel implementation of the back-end algorithm for GPU with the performance analysis are presented.
A finite-element method-based parallel computing simulator for multiphysics effects in resistive random access memory (RRAM) array, which is suitable for supercomputer platforms even with thousands of cores, is develo...
详细信息
A finite-element method-based parallel computing simulator for multiphysics effects in resistive random access memory (RRAM) array, which is suitable for supercomputer platforms even with thousands of cores, is developed to simulate oxygen vacancy migration, current transport, and thermal conduction. Exponentially fit flux Galerkin method is introduced to improve algorithm convergence when solving the 3-D oxygen vacancy drift-diffusion equation. The accuracy of our algorithm is validated by comparison with commercial software. Scalability of our parallel algorithm is also investigated. The simulation results for the high-density integration RRAM array indicate that the heat generated during the writing process can result in high temperature, and lead to severe reliability problem. Even the RRAM cells without bias voltage applied can be transferred from low-resistance state to high-resistance state unintentionally, and lose their stored information. Increasing the feature size or equivalently decreasing the integration density lowers the power density, hence improves reliability performance. Large electrode thickness with Dirichlet boundary applied on their side surfaces can drain out heat faster and enhance reliability of RRAM array.
The support vector machine (SVM) algorithm is widely used in various fields because of its good classification effect, simplicity and practicability. However, the support vector machine calculates the support vector b...
详细信息
The support vector machine (SVM) algorithm is widely used in various fields because of its good classification effect, simplicity and practicability. However, the support vector machine calculates the support vector by quadratic programming, and the solution of quadratic programming will calculate the n-order matrix. When the amount of data is large, the calculation and storage of the n-order matrix will make the optimization speed very slow, even lead to memory overflow and interrupt operation. Using the big data computing platform Spark to improve the support vector machine algorithm can solve the above problems, but it's not competent for multi-classification problems. Therefore, this paper starts with constructing multiple classifiers, combines the Spark framework of big data programming model and the classification characteristics of support vector machine to realize a parallel one-to-many SVM optimization algorithm based on large data sets and compares them through UCI data sets. In the experiments, the one-to-many support vector machine improved by Spark is obviously better than the one-to-many support vector machine in the single-machine environment. The simulation results show that the proposed algorithm has better performance.
While modern parallel computing systems offer high performance, utilizing these powerful computing resources to the highest possible extent demands advanced knowledge of various hardware architectures and parallel pro...
详细信息
While modern parallel computing systems offer high performance, utilizing these powerful computing resources to the highest possible extent demands advanced knowledge of various hardware architectures and parallel programming models. Furthermore, optimized software execution on parallel computing systems demands consideration of many parameters at compile-time and run-time. Determining the optimal set of parameters in a given execution context is a complex task, and therefore to address this issue researchers have proposed different approaches that use heuristic search or machine learning. In this paper, we undertake a systematic literature review to aggregate, analyze and classify the existing software optimization methods for parallel computing systems. We review approaches that use machine learning or meta-heuristics for software optimization at compile-time and run-time. Additionally, we discuss challenges and future research directions. The results of this study may help to better understand the state-of-the-art techniques that use machine learning and meta-heuristics to deal with the complexity of software optimization for parallel computing systems. Furthermore, it may aid in understanding the limitations of existing approaches and identification of areas for improvement.
In parallel computing systems, the interconnection network forms the critical infrastructure which enables robust and scalable communication between hundreds of thousands of nodes. The traditional packet-switched netw...
详细信息
In parallel computing systems, the interconnection network forms the critical infrastructure which enables robust and scalable communication between hundreds of thousands of nodes. The traditional packet-switched network tends to suffer from long communication time when network congestion occurs. In this context, we explore the use of circuit switching (CS) to replace packet switches with custom hardware that supports circuit-based switching efficiently with low latency. In our target CS network, a certain amount of bandwidth is guaranteed for each communication pair so that the network latency can be predictable when a limited number of node pairs exchange messages. The number of allocated time slots in every switch is a direct factor to affect the end-to-end latency, we thereby improve the slot utilization and develop a network topology generator to minimize the number of time slots optimized to target applications whose communication patterns are predictable. By a quantitative discrete-event simulation, we illustrate that the minimum necessary number of slots can be reduced to a small number in a generated topology by our design methodology while maintaining network cost 50% less than that in standard tori topologies.
The implementation of time-domain Green's functions (TDGFs) in the graphics processing unit (GPU) and the central processing unit (CPU) using a finite-difference scheme is shown. The TDGFs represent the transient ...
详细信息
The implementation of time-domain Green's functions (TDGFs) in the graphics processing unit (GPU) and the central processing unit (CPU) using a finite-difference scheme is shown. The TDGFs represent the transient electric scalar and magnetic vector potentials due to a horizontal electric dipole (HED) in open-layered media. The layered media is bounded with a perfectly matched layer (PML), symmetry axis, and perfect electric conductor (PEC). We adopted four different parallel approaches as follows: 1) open multiprocessing (OpenMP) CPU implementation;2) message passing interface (MPI) CPU implementation;3) open accelerators (OpenACC) GPU implementation;and 4) compute unified device architecture (CUDA) GPU implementation. The accuracy and efficiency of the utilized programming models are validated by comparing and verifying the obtained results using a sequential CPU implementation. Compared to single threaded CPU implementation, speed-ups obtained by the OpenMP, MPI, OpenACC, and CUDA programming models are4.8 ,6.12 ,45.97 , and96.53higher, respectively. The final result shows that GPU implementation leads to a considerable speed-up while the solution's accuracy is fixed.
作者:
Asteasuain, MarianoUniv Nacl Sur
Dept Ingn Quim Av Alem 1253 RA-8000 Bahia Blanca Buenos Aires Argentina PLAPIQUI UNS CONICET
Planta Piloto Ingn Quim Camino La Carrindanga Km 7 RA-8000 Bahia Blanca Buenos Aires Argentina
High-fidelity models of polymer processes should include the prediction of distributions of polymer properties, including multivariate distributions. Deterministic models with this capability usually involve a large s...
详细信息
High-fidelity models of polymer processes should include the prediction of distributions of polymer properties, including multivariate distributions. Deterministic models with this capability usually involve a large system of equations, which compromises the model performance in terms of CPU time. The probability generating function (pgf) technique is a powerful method for modeling distributions of polymer properties, including multivariate distributions. It can be applied to systems described by complex kinetic mechanism and requires no a priori assumptions about the distribution shape. The structure of this modeling method makes it particularly suitable for parallel computing. This work describes the application of the pgf technique for modeling uni- and bi-variate distributions of polymer properties with parallelization of the model code. It is shown that accurate results can be achieved in very short running times, which makes the technique suitable for models to be employed in optimization and online control tasks. (C) 2019 Elsevier Ltd. All rights reserved.
The Web Ontology Language (OWL) is a widely used knowledge representation language for describing knowledge in application domains by using classes, properties, and individuals. Ontology classification is an important...
详细信息
The Web Ontology Language (OWL) is a widely used knowledge representation language for describing knowledge in application domains by using classes, properties, and individuals. Ontology classification is an important and widely used service that computes a taxonomy of all classes occurring in an ontology. It can require significant amounts of runtime, but most OWL reasoners do not support any kind of parallel processing. We present a novel thread-level parallel architecture for ontology classification, which is ideally suited for shared-memory SMP servers, but does not rely on locking techniques and thus avoids possible race conditions. We evaluated our prototype implementation with a set of real-world ontologies. Our experiments demonstrate a very good scalability resulting in a speedup that is linear to the number of available cores. (C) 2018 Elsevier B.V. All rights reserved.
暂无评论