the main goal of this work is to analyze the behavior of a nighttime image processing module and find out basic estimates of required computational time and energy consumption for processing large data archives. As pa...
详细信息
ISBN:
(纸本)9783319996738;9783319996721
the main goal of this work is to analyze the behavior of a nighttime image processing module and find out basic estimates of required computational time and energy consumption for processing large data archives. As part of this work, we have performed the code refactoring of the most computing-intensive module in a system for detecting fishing boat lights. the algorithm is capable of detecting isolated bright spikes that are sharply visible on the sea surface at night. the refactored module has been optimized for effective usage of multi- and many-core Intel Xeon architectures. In the paper, we describe the algorithmic complexity for all computational stages of the module. Also, we have collected detailed statistic data for two data sets, different input parameter sets, and three test beds: Intel (R) Xeon (R) E5-2697A (codename Broadwell), Intel (R) Xeon (R) Gold 6148 (Skylake), and Intel (R) Xeon Phi (R) 7250 (KNL). Key correlations between module behavior and energy consumption are also included in the paper. the results of the study were used for calculations of the estimate time and energy requirements for a whole year archive of day/night band (DNB) images from the Visible Infrared Imaging Radiometer Suite (VIIRS). Moreover, driving factors, including price and legacy software systems, are presented for discussion.
In parallel computing, a valid graph coloring yields a lock-free processing of the colored tasks, data points, etc., without expensive synchronization mechanisms. However, coloring is not free and the overhead can be ...
详细信息
ISBN:
(纸本)9781538610428
In parallel computing, a valid graph coloring yields a lock-free processing of the colored tasks, data points, etc., without expensive synchronization mechanisms. However, coloring is not free and the overhead can be significant. In particular, for the bipartite-graph partial coloring (BGPC) and distance-2 graph coloring (D2GC) problems, which have various use-cases within the scientific computing and numerical optimization domains, the coloring overhead can be in the order of minutes with a single thread for many real-life graphs. In this work, we propose parallelalgorithms for bipartite-graph partial coloring on shared-memory architectures. Compared to the existing shared-memory BGPC algorithms, the proposed ones employ greedier and more optimistic techniques that yield a better parallel coloring performance. In particular, on 16 cores, the proposed algorithms are more than 4x faster than their counterparts in the ColPack library which is, to the best of our knowledge, the only publicly-available coloring library for multicore architectures. In addition to BGPC, the proposed techniques are employed to devise parallel distance-2 graph coloring algorithms and similar performance improvements have been observed. Finally, we propose two costless balancing heuristics for BGPC that can reduce the skewness and imbalance on the cardinality of color sets (almost) for free. the heuristics can also be used for the D2GC problem and in general, they will probably yield a better color-based parallelization performance especially on many-core architectures.
processing of big scale-free graphs on parallelarchitectures with high parallelization opportunities connected with a lot of overheads. Due to skewed degree distribution each thread receives different amount of compu...
详细信息
ISBN:
(数字)9783319654829
ISBN:
(纸本)9783319654829;9783319654812
processing of big scale-free graphs on parallelarchitectures with high parallelization opportunities connected with a lot of overheads. Due to skewed degree distribution each thread receives different amount of computational workload. In this paper we present a method devoted to address this challenge by modificating CSR data structure and redistributing work across threads. the method was implemented in breadth-first search and single source shortest pathalgorithms for GPU architecture.
Modern high performance computing and cloud computing infrastructures often leverage Graphic processing Units (GPUs) to provide accelerated, massively parallel computational power. this performance gain, however, may ...
详细信息
ISBN:
(纸本)9781538620748
Modern high performance computing and cloud computing infrastructures often leverage Graphic processing Units (GPUs) to provide accelerated, massively parallel computational power. this performance gain, however, may also introduce higher energy consumption. the energy challenge has become more and more pronounced when the system scales. To address this challenge, we propose Archon, a framework for supporting energy-efficient computing on CPU-GPU heterogeneous architectures. Specifically, Archon takes user's programs as input, automatically distribute the workload between CPU and GPU, and dynamically tunes the distribution ratio at runtime for an energy-efficient execution. Experiments have been carried out to evaluate the effectiveness of Archon, and the results show that it can achieve considerable energy savings at runtime, without significant efforts from the programmers.
the availability of high performance computing resources enables us to perform very large numerical simulations and in this way to tackle challenging real life problems. At the same time, in order to efficiently utili...
详细信息
ISBN:
(纸本)9783319780245;9783319780238
the availability of high performance computing resources enables us to perform very large numerical simulations and in this way to tackle challenging real life problems. At the same time, in order to efficiently utilize the computational power at our disposal, the ever growing complexity of the computer architecture poses high demands on the algorithms and their implementation. Performing large scale high performance simulations can be done by utilizing available general libraries, writing libraries that suit particular classes of problems or developing software from scratch. Clearly, the possibilities to enhance the efficiency of the software tools in the three cases is very different, ranging from nearly impossible to full capacity. In this work we exemplify the efficiency of the three approaches on benchmark problems, using monitoring tools that provide a very rich spectrum of data on the performance of the applied codes as well as on the utilization of the supercomputer itself.
Resource Description Framework (RDF) graphs are widely used for representing semantically linked data in various domains. Many modern RDF specific storage, indexing, and query optimization systems internally represent...
详细信息
the solving of tridiagonal systems is one of the most computationally expensive parts in many applications, so that multiple studies have explored the use of NVIDIA GPUs to accelerate such computation. However, these ...
详细信息
ISBN:
(纸本)9783319780245;9783319780238
the solving of tridiagonal systems is one of the most computationally expensive parts in many applications, so that multiple studies have explored the use of NVIDIA GPUs to accelerate such computation. However, these studies have mainly focused on using parallelalgorithms to compute such systems, which can efficiently exploit the shared memory and are able to saturate the GPUs capacity with a low number of systems, presenting a poor scalability when dealing with a relatively high number of systems. We propose a new implementation (cuthomasBatch) based on the thomas algorithm. To achieve a good scalability using this approach is necessary to carry out a transformation in the way that the inputs are stored in memory to exploit coalescence (contiguous threads access to contiguous memory locations). the results given in this study proves that the implementation carried out in this work is able to beat the reference code when dealing with a relatively large number of Tridiagonal systems (2,000-256,000), being closed to 3x (in double precision) and 4x (in single precision) faster using one Kepler NVIDIA GPU.
Dynamic programming techniques are well-established and employed by-various practical algorithms, including the edit-distance algorithm or the dynamic time warping algorithm. these algorithms usually operate in an ite...
详细信息
Dynamic programming techniques are well-established and employed by-various practical algorithms, including the edit-distance algorithm or the dynamic time warping algorithm. these algorithms usually operate in an iteration-based manner where new values are computed from values of the previous iteration. the data dependencies enforce synchronization which limits possibilities for internal parallelprocessing. In this paper, we investigate parallel approaches to processing matrix-based dynamic programming algorithms on modern multicore CPUs, Intel Xeon Phi accelerators, and general purpose GPUs. We address boththe problem of computing a single distance on large inputs and the problem of computing a number of distances of smaller inputs simultaneously (e.g., when a similarity query is being resolved). Our proposed solutions yielded significant improvements in performance and achieved speedup of two orders of magnitude when compared to the serial baseline. (C) 2016 Elsevier Ltd. All rights reserved.
In this paper we present PMORSya new parallel software package for symmetric sparse matrix ordering on shared memory systems. the NP-complete fill-in minimization problem is solved by means of multilevel nested dissec...
详细信息
In this paper we present PMORSya new parallel software package for symmetric sparse matrix ordering on shared memory systems. the NP-complete fill-in minimization problem is solved by means of multilevel nested dissection algorithm with modifications for vertex separators. parallelprocessing is done in a task-based fashion withthe granularity tuning. We employ threading techniques on shared memory using OpenMP 3.0 technology as opposed to the Message Passing Interface-based approach widely used for parallel sparse matrix ordering. Experimental results on symmetric matrices from the University of Florida Sparse Matrix Collection and matrices from finite-element analysis of three-dimensional strength problems show that our implementation is competitive to the ParMETIS and PT-Scotch libraries both in ordering quality and performance. the PMORSy library is publicly available from the Lobachevsky State University Supercomputing Center web-site.
As Android operating system and applications on the device play important roles, the security requirements of Android applications increased as well. Withthe upgrade of Android system, Android runtime mode (ART mode)...
详细信息
暂无评论