In the paper we present parallel implementations as well as execution times and speed-ups of three different algorithms run in various environments such as on a workstation with multi-core CPUs and a cluster. The para...
详细信息
ISBN:
(纸本)9783319672205;9783319672199
In the paper we present parallel implementations as well as execution times and speed-ups of three different algorithms run in various environments such as on a workstation with multi-core CPUs and a cluster. The parallel codes, implementing the master-slave model in C+MPI, differ in computation to communication ratios. The considered problems include: a genetic algorithm with various ratios of master processing time to communication and fitness evaluation times, matrix multiplication and numerical integration. We present how the codes scale in the aforementioned systems. For the numerical integration code that scales very well we also show performance in a hybrid CPU+Xeon Phi environment.
Explicit parallel programming for shared and distributed memory architectures is an efficient way to deal with data intensive computations. However approaches such as explicit threads or MPI remain difficult solutions...
详细信息
ISBN:
(数字)9781728144849
ISBN:
(纸本)9781728144856
Explicit parallel programming for shared and distributed memory architectures is an efficient way to deal with data intensive computations. However approaches such as explicit threads or MPI remain difficult solutions for most programmers. Indeed they have to face different constraints such as explicit inter-processors communications or data distribution.
Multi-, many-core, hybrid processors and parallel programming languages are slowly becoming pervasive in mainstream computing. It is expected that they will affect a large spectrum of systems, from embedded and genera...
详细信息
ISBN:
(纸本)9781538666289
Multi-, many-core, hybrid processors and parallel programming languages are slowly becoming pervasive in mainstream computing. It is expected that they will affect a large spectrum of systems, from embedded and general-purpose, to high-end computing systems. This architectural change has already challenged programmers to efficiently write an application code that can scale over many cores to utilize its computational power. Moreover, many heterogeneous architectures exist today, hence there was an emergent need for a uniform interface to these architectures. Recently, Khronos Group defined the Open Computing Language (OpenCL) for abstracting the underlying hardware, which enables software developers to write a portable code across different shared-memory architectures. In this paper, we introduce a new parallel implementation of one of the fastest image segmentation algorithms known as Simple Linear Iterative Clustering based on OpenCL. We evaluate the effectiveness of this implementation using only multi-core GPCPU. Our implementation is fully compatible with sequential implementation. When the algorithm is executed sequentially it utilizes only 25% of total computational power of a GPCPU for any image resolution, while its modified algorithm is able to utilize close to 100% for high resolution images. The resulting algorithm is up to 5x faster than its sequential counterpart.
Integration of intermittent renewable energy resources to the power system necessitates the development of fast computational methods and tools to enable real-time monitoring, control, and decision making in the power...
详细信息
ISBN:
(纸本)9781538671382
Integration of intermittent renewable energy resources to the power system necessitates the development of fast computational methods and tools to enable real-time monitoring, control, and decision making in the power grid. Generally, techniques which can be used to increase the computational speed are summarized in algorithm improvement and hardware acceleration. In this paper, the serial version of the Newton-Raphson power flow algorithm has been transformed to a parallel solution by using OpenMP standard. The parallel implementation is tested on several power systems and the computational efficiency is compared with varying thread numbers. The experimental results show more than three times speedup ratio achievement and significant computational time reduction.
Frequent itemset mining is one pillar of machine learning and is very important for many data mining applications. There are many different algorithms for frequent itemset mining, but to our knowledge no implementatio...
详细信息
ISBN:
(纸本)9781538693803
Frequent itemset mining is one pillar of machine learning and is very important for many data mining applications. There are many different algorithms for frequent itemset mining, but to our knowledge no implementation has been proven correct using computer aided verification. Hu et al. derived on paper an efficient algorithm for this problem, starting from an inefficient functional program and by using program calculation derived an efficient version. Based on their work, we propose a formally verified functional implementation for frequent itemset mining developed with the Coq proof assistant. All the proposed programs are evaluated on classical datasets and are compared to a non verified Java implementation of the Apriori algorithm.
The success of Deep Learning (DL) algorithms in computer vision tasks have created an on-going demand of dedicated hardware architectures that could keep up with the their required computation and memory complexities....
详细信息
ISBN:
(纸本)9781450371896
The success of Deep Learning (DL) algorithms in computer vision tasks have created an on-going demand of dedicated hardware architectures that could keep up with the their required computation and memory complexities. This task is particularly challenging when embedded smart camera platforms have constrained resources such as power consumption, Processing Element (PE) and communication. This article describes a heterogeneous system embedding an FPGA and a GPU for executing CNN inference for computer vision applications. The built system addresses some challenges of embedded CNN such as task and data partitioning, and workload balancing. The selected heterogeneous platform embeds an Nvidia® Jetson TX2 for the CPU-GPU side and an Intel Altera® Cyclone10GX for the FPGA side interconnected by PCIe Gen2 with a MIPI-CSI camera for prototyping. This test environment will be used as a support for future work on a methodology for optimized model partitioning.
Transactional memory is a promising paradigm for shared-memory parallel programming model. On TMs, transactions are executed speculatively in parallel as long as any access conflict is not detected. On general hardwar...
详细信息
ISBN:
(纸本)9781538695623
Transactional memory is a promising paradigm for shared-memory parallel programming model. On TMs, transactions are executed speculatively in parallel as long as any access conflict is not detected. On general hardware transactional memories (HTMs), conflicts degrade the performance because of the overhead for retrying transactions, and it is important to avoid conflicts. HTM generally detects access conflicts on cache line granularity, and this causes accesses on different variables that are on a cache line to be falsely detected as conflicting accesses. In this paper, we analyze how frequently such false conflicts occur and what type of coding can cause them. As a result of the analysis, we confirmed that the false conflicts account for 27.4% on average and even 99.9% at a maximum of all detected conflicts. We also propose a light-weight fine-grained conflict detection mechanism and show that it can reduce the execution cycles by 17.7% on average and 36.5% at a maximum.
Improve the speed of your code and optimize the performance of your apps Key Features Understand the common performance pitfalls and improve your application's performance Get to grips with multi-threaded and asyn...
详细信息
ISBN:
(数字)9781788474603
ISBN:
(纸本)9781788470049
Improve the speed of your code and optimize the performance of your apps Key Features Understand the common performance pitfalls and improve your application's performance Get to grips with multi-threaded and asynchronous programming in C# Develop highly performant applications on .NET Core using microservice architecture Book Description While writing an application, performance is paramount. Performance tuning for realworld applications often involves activities geared toward fnding bottlenecks; however, this cannot solve the dreaded problem of slower code. If you want to improve the speed of your code and optimize an application's performance, then this book is for you. C# 7 and .NET Core 2.0 High Performance begins with an introduction to the new features of what'explaining how they help in improving an application's performance. Learn to identify the bottlenecks in writing programs and highlight common performance pitfalls, and learn strategies to detect and resolve these issues early. You will explore multithreading and asynchronous programming with .NET Core and learn the importance and effcient use of data structures. This is followed with memory management techniques and design guidelines to increase an application's performance. Gradually, the book will show you the importance of microservices architecture for building highly performant applications and implementing resiliency and security in .NET Core. After reading this book, you will learn how to structure and build scalable, optimized, and robust applications in C#7 and .NET. What you will learn Measure application performance using BenchmarkDotNet Explore the techniques to write multithreaded applications Leverage TPL and PLinq libraries to perform asynchronous operations Get familiar with data structures to write optimized code Understand design techniques to increase your application's performance Learn about memory management techniques in .NET Core Develop a containerized application based on micr
Adaptive discretizations are important in compressible/incompressible flow problems since it is often necessary to resolve details on multiple levels, allowing large regions of space to be modeled using a reduced numb...
详细信息
Adaptive discretizations are important in compressible/incompressible flow problems since it is often necessary to resolve details on multiple levels, allowing large regions of space to be modeled using a reduced number of degrees of freedom (reducing the computational time). There are a wide variety of methods for adaptively discretizing space, but Cartesian grids have often outperformed them even at high resolutions due to their simple and accurate numerical stencils and their superior parallel performances. Such performance and simplicity are in general obtained applying a finite-difference scheme for the resolution of the problems involved, but this discretization approach does not present, by contrast, an easy adapting path. In a finite-volume scheme, instead, we can incorporate different types of grids, more suitable for adaptive refinements, increasing the complexity on the stencils and getting a greater flexibility. The Laplace operator is an essential building block of the Navier-Stokes equations, a model that governs fluid flows, but it occurs also in differential equations that describe many other physical phenomena, such as electric and gravitational potentials, and quantum mechanics. So, it is a very important differential operator, and all the studies carried out on it, prove its relevance. In this work will be presented 2D finite-difference and finite-volume approaches to solve the Laplacian operator, applying patches of overlapping grids where a more fined level is needed, leaving coarser meshes in the rest of the computational domain. These overlapping grids will have generic quadrilateral shapes. Specifically, the topics covered will be: 1) introduction to the finite difference method, finite volume method, domain partitioning, solution approximation; 2) overview of different types of meshes to represent in a discrete way the geometry involved in a problem, with a focus on the octree data structure, presenting PABLO and PABLitO. The first one is an
Real-time simulation of a large-scale biologically representative spiking neural network is presented, through the use of a heterogeneous parallelisation scheme and SpiNNaker neuromorphic hardware. A published cortica...
详细信息
暂无评论