The solution of sparse linear systems of large dimension is a critical step in problems that span a diverse range of applications. For this reason, a number of iterative solvers have been developed, among which ILUPAC...
详细信息
ISBN:
(纸本)9783319589435;9783319589428
The solution of sparse linear systems of large dimension is a critical step in problems that span a diverse range of applications. For this reason, a number of iterative solvers have been developed, among which ILUPACK integrates an inverse-based multilevel ILU preconditioner with appealing numerical properties. In this paper, we enhance the computational performance of ILUPACK by off-loading the execution of several key computational kernels to a Graphics processing Unit (GPU). In particular, we target the preconditioned GMRES and BiCG methods for sparse general systems and the preconditioned SQMR method for sparse symmetric indefinite problems in ILUPACK. The evaluation on a NVIDIA Kepler GPU shows a sensible reduction of the execution time, while maintaining the convergence rate and numerical properties of the original ILUPACK solver.
Neural network based deep learning algorithm is a study hotspot in artificial intelligence. Moreover, embedded artificial intelligence and mobile computing are becoming more and more important in industry. For these a...
详细信息
ISBN:
(纸本)9781538612309
Neural network based deep learning algorithm is a study hotspot in artificial intelligence. Moreover, embedded artificial intelligence and mobile computing are becoming more and more important in industry. For these applications, not only high performance computing is required, but also low power consumption restriction cannot be ignored. DSP has special hardware architecture with characteristics of high performance and low power consumption, which is an ideal computing platform for embedded artificial intelligence. This paper concerns the energy efficiency of DSP under deep learning applications. However, many relative researches are insufficient in application scale and optimization techniques. This research extends application scale and puts forward some optimization methods in detail. Specifically, for the Long Short-Term Memory (LSTM) model based word prediction application, we use TI's high-performance multi-core DSP to accelerate its inference process. We apply a variety of optimization techniques to our initial DSP program. Relative experimental results show that these techniques bring notable performance improvement. Furthermore, we regard the MATLAB program which runs on general CPU and C program which runs on ARM as contrast. In terms of performance and power ratio, DSP is 7.79 times over general CPU and 2.28 times over ARM, which indicates that DSP is a suitable platform for embedded artificial intelligence.
iCaveats is a Project on the integration of components and architectures for embedded vision in transport and security applications. A compact and efficient implementation of autonomous vision systems is difficult to ...
详细信息
ISBN:
(纸本)9781450365116
iCaveats is a Project on the integration of components and architectures for embedded vision in transport and security applications. A compact and efficient implementation of autonomous vision systems is difficult to be accomplished by using the conventional image processing chain. In this project we have targeted alternative approaches, that exploit the inherent parallelism in the visual stimulus, and hierarchical multilevel optimization. A set of demos showcase the advances at sensor level, in adapted architectures for signal processing and in power management and energy harvesting.
Recent advances in parallel and distributed computing have made it very challenging for programmers to reach the performance potential of current systems. In addition, recent advances in numerical algorithms and softw...
详细信息
ISBN:
(纸本)9781424477159
Recent advances in parallel and distributed computing have made it very challenging for programmers to reach the performance potential of current systems. In addition, recent advances in numerical algorithms and software optimizations have tremendously increased the number of alternatives for solving a problem, which further complicates the software tuning process. Indeed, no single algorithm can represent the universal best choice for efficient solution of a given problem on all compute substrates. In this paper, we develop a framework that addresses the design of efficient parallel algorithms in hierarchical computing environments. More specifically, given multiple choices for solving a particular problem, the framework uses a judicious combination of analytical performance models and empirical approaches to automate the algorithm selection by determining the most suitable execution scheme expected to perform the best at the specific setting. Preliminary experimental results obtained by implementing two different numerical kernels demonstrated the interest of the hybrid performance modeling approach integrated in the framework.
Fault localization is one of the most expensive activities of program debugging, which is why the recent years have witnessed the development of many different fault localization techniques. This paper proposes a grou...
详细信息
Enormous amount of data is being generated at a tremendous rate by multiple sources, often this data exists in different formats thus making it quite difficult to process the data using traditional methods. The platfo...
详细信息
ISBN:
(纸本)9781509035199
Enormous amount of data is being generated at a tremendous rate by multiple sources, often this data exists in different formats thus making it quite difficult to process the data using traditional methods. The platforms used for processing this type of data rely on distributed architecture like Cloud computing, Hadoop etc. The processing of big data can be efficiently carried out by exploring the characteristics of underlying platforms. With the advent of efficient algorithms, software metrics and by identifying the relationship amongst these measures, system characteristics can be evaluated in order to improve the overall performance of the computing system. By focusing on these measures which play important role in determining the overall performance, service level agreements can also be revised. This paper presents a survey of different performance modeling techniques of big data applications. One of the key concepts in performance modeling is finding relevant parameters which accurately represent performance of big data platforms. These extracted relevant performances measures are mapped onto software qualify concepts which are then used for defining service level agreements.
Pattern recognition is a resource intensive task which includes feature extraction, feature selection and classification. Optimizing any of these steps can significantly improve performance. Evolutionary computation m...
详细信息
ISBN:
(纸本)9781932415988
Pattern recognition is a resource intensive task which includes feature extraction, feature selection and classification. Optimizing any of these steps can significantly improve performance. Evolutionary computation methods are utilized to address optimization problems that explore a huge, nonlinear and multidimensional search space. In this paper a new distributed framework is introduced which greatly reduces the computation time of such systems, concentrating on feature extraction and feature selection which will operate parallel. This software architecture incorporates Mother Nature's most powerful tools, "evolution" and "parallelism", in its design, while maintaining robustness.
To execute MPI applications reliably, fault tolerance mechanisms are needed. Message logging is a well known solution to provide fault tolerance for MPI applications. It as been proved that it can tolerate higher fail...
详细信息
ISBN:
(纸本)9783642038686
To execute MPI applications reliably, fault tolerance mechanisms are needed. Message logging is a well known solution to provide fault tolerance for MPI applications. It as been proved that it can tolerate higher failure rate than coordinated checkpointing. However pessimistic and causal message logging can induce high overhead on failure free execution. In this paper, we present O2P, a new optimistic message logging protocol, based oil active optimistic message logging. Contrary to existing optimistic message logging protocols that saves dependency information on reliable storage periodically, O2P logs dependency information as soon as possible to reduce the amount of data piggybacked on application messages. Thus it reduces the overhead of the protocol on failure free execution, making it more scalable and simplifying recovery. O2P is implemented as a module of the Open MPI library. Experiment's show that active message logging is promising to improve scalability and performance of optimistic message logging.
Wearable Internet of Things (IoT) is seen as one of the advance paradigm in widespread applications such as health care, fitness, etc. There is a huge gap in the technology that helps today the visually disabled peopl...
详细信息
ISBN:
(纸本)9781450348447
Wearable Internet of Things (IoT) is seen as one of the advance paradigm in widespread applications such as health care, fitness, etc. There is a huge gap in the technology that helps today the visually disabled people to this advance level. There is a very much need to have a single robust solution utilizing the information from Computer vision and combining with a 360 degree protection for the user, without constraining the user from anything. A novel scalable solution for visually impaired is developed and a prototype is made using Intel Edison, camera and sensors. The proposed methodology provides the visually impaired with detail information about their surroundings. In this paper, we presents scalable solution using computer vision techniques and sensors to achieve independent navigation.
Many-core framework, represented by Graphic processing Unit (GPU), has been becoming more popular in scientific computing and data analysis, and Compute Unified Device Architecture (CUDA) allows developers to carry ou...
详细信息
ISBN:
(纸本)9781450389129
Many-core framework, represented by Graphic processing Unit (GPU), has been becoming more popular in scientific computing and data analysis, and Compute Unified Device Architecture (CUDA) allows developers to carry out computation on GPUs more conveniently. Matrix operation is one of the most important application scenarios of GPU computation and CUDA, widely appearing in domains like graph applications and artificial intelligence. This paper focuses on accelerating matrix power operation (MPO) process, which appears in many engineering problems and may be time-consuming when the scale of matrixes is large, with GPUs and CUDA. In this paper, three techniques are used to speed up the MPO process: matrix parallel reduction, matrix parallel multiplication and CUDA dynamic parallelism. The experiments were carried out on NVIDIA RTX 2080 Super and Intel(R) Core(TM) i7-10750H. The code is compiled and executed in Visual Studio 2019 community, the GPU driver version is 461.40, the CUDA version is 11.1, and the NVCC version is 11.1.74. With all three techniques, the MPO process can get hundreds of times speedup compared with the sequential multiplication process executed by Eigen library, even though the code of algorithm is hardly optimized. Moreover, with the number of power increasing, the efficiency of algorithm can be even higher because of the relatively fixed time consumed in data transmission and GPU memory allocation.
暂无评论