This paper presents a network-based template for analyzing large-scale dynamic data. Specifically, we propose a novel shared-memory parallel algorithm for updating tree-based structures or properties, such as connecte...
详细信息
This paper presents a network-based template for analyzing large-scale dynamic data. Specifically, we propose a novel shared-memory parallel algorithm for updating tree-based structures or properties, such as connected components (CC) and minimum spanning trees (MST), on dynamic networks. The underlying idea is to update the information in a rooted tree data structure that stores the edges of the network that are most relevant to the analysis. Extensive experiments on real-world and synthetic networks demonstrate that, with the exception of the inherently sequential component for creating the rooted tree, our proposed updatiing algorithm is scalable and, in most cases, also requires significantly less memory, energy, and time than recomputing-from-scratch algorithm. To the best of our knowledge, this is the first parallel algorithm for updating MST on weighted dynamic networks. The rooted-tree based framework that we propose in this paper can be extended for updating other weighted and unweighted tree-based properties such as single source shortest path and betweenness and closeness centrality.
In the last two decades, great attention has been devoted to the design of non-blocking and linearizable data structures, which enable exploiting the scaled-up degree of parallelism in off-the-shelf shared-memory mult...
详细信息
In the last two decades, great attention has been devoted to the design of non-blocking and linearizable data structures, which enable exploiting the scaled-up degree of parallelism in off-the-shelf shared-memory multi-core machines. In this context, priority queues are highly challenging. Indeed, concurrent attempts to extract the highest-priority item are prone to create detrimental thread conflicts that lead to abort/retry of the operations. In this article, we present the first priority queue that jointly provides: (i) lock-freedom and linearizability;(ii) conflict resiliency against concurrent extractions;(iii) adaptiveness to different contention profiles;and (iv) amortized constant-time access for both insertions and extractions. Beyond presenting our solution, we also provide proof of its correctness based on an assertional approach. Also, we present an experimental study on a 64-CPU machine, showing that our proposal provides performance improvements over state-of-the-art non-blocking priority queues.
Large-scale capturing of real-world scenes as 3D point clouds (e.g., using LIDAR scanning) generates billions of points that are challenging to visualize. High storage requirements prevent the quick and easy inspectio...
详细信息
Large-scale capturing of real-world scenes as 3D point clouds (e.g., using LIDAR scanning) generates billions of points that are challenging to visualize. High storage requirements prevent the quick and easy inspection of captured datasets on user-grade hardware. The fastest real-time rendering methods are limited by the available GPU memory and render only around 1 billion points interactively. We show that we can achieve state-of-the-art in both while simultaneously supporting datasets that surpass the capabilities of other methods. We present an on-the-fly point cloud decompression scheme that tightly integrates with software rasterization to reduce on-chip memory requirements by more than 4×. Our method compresses geometry losslessly and provides high visual quality at real-time framerates. We use a GPU-friendly, clipped Huffman encoding for compression. Point clouds are divided into equal-sized batches, which are Huffman-encoded independently. Batches are further subdivided to form easy-to-consume streams of data for massively parallel execution. The compressed point clouds are stored in an access-aware manner to achieve coherent GPU memory access and a high L1 cache hit rate at render time. Our approach can decompress and rasterize up to 120 million Huffman-encoded points per millisecond on-the-fly. We evaluate the quality and performance of our approach on various large datasets against the fastest competing methods. Our approach renders massive 3D point clouds at competitive frame rates and visual quality while consuming significantly less memory, thus unlocking unprecedented performance for the visualization of challenging datasets on commodity GPUs.
Edge Computing (EC) has emerged as a solution to reduce energy demand and greenhouse gas emissions from digital technologies. EC supports low latency, mobility, and location awareness for delay-sensitive applications ...
详细信息
ISBN:
(纸本)9798400705977
Edge Computing (EC) has emerged as a solution to reduce energy demand and greenhouse gas emissions from digital technologies. EC supports low latency, mobility, and location awareness for delay-sensitive applications by bridging the gap between cloud computing services and end-users. Machine learning (ML) methods have been applied in EC for data classification and information processing. Ensemble learners have often proven to yield high predictive performance on data stream classification problems. Mini-batching is a technique proposed for improving cache reuse in multi-core architectures of bagging ensembles for the classification of online data streams, which benefits application speedup and reduces energy consumption. However, the original mini-batching presents limited benefits in terms of cache reuse and it hinders the accuracy of the ensembles (i.e., their capacity to detect behavior changes in data streams). In this paper, we improve mini-batching by fusing continuous training and test loops for the classification of data streams. We evaluated the new strategy by comparing its performance and energy efficiency with the original mini-batching for data stream classification using six ensemble algorithms and four benchmark datasets. We also compare mini-batching strategies with two hardware-based strategies supported by commodity multi-core processors commonly used in EC. Results show that mini-batching strategies can significantly reduce energy consumption in 95% of the experiments. Mini-batching improved energy efficiency by 96% on average and 169% in the best case. Likewise, our new mini-batching strategy improved energy efficiency by 136% on average and 456% in the best case. These strategies also support better control of the balance between performance, energy efficiency, and accuracy.
In this paper, a system is presented which implements transactions migration to an asymmetric multiprocessor in order to decrease the probability of conflicts and improve execution performance. Applications paralleliz...
详细信息
In this paper, a system is presented which implements transactions migration to an asymmetric multiprocessor in order to decrease the probability of conflicts and improve execution performance. Applications parallelization makes programming and testing much more difficult, so the goal is to avoid putting additional burden on a programmer. Therefore, the proposed solution should be fully implemented in hardware. In the asymmetric multiprocessor that is analyzed, all cores have the same instruction set, but they are asymmetric in terms of microarchitectural properties, so that N - 1 "small'' cores are identical, while the N-th "big'' core is different, as it provides better performance and higher capacities of its units. The idea is to perform transaction migration from the "small'' core to the "big'' one, based on the history of transaction execution. The experiments were performed using a significantly upgraded Gem5 simulator and eight parallel applications from the STAMP benchmark suite. The experimental results show the speedup and the rate of successfully executed transactions for five different multiprocessor configurations, including symmetric and asymmetric multiprocessors with or without transaction migration. The improvement our algorithm achieves for suitable applications is up to 14% (10% on average) in turnaround time compared to the solutions which do not make use of asymmetry for scheduling transactions.
Triangular meshes of superior quality are important for geometric processing in practical applications. Existing approximative CVT-based remeshing methodology uses planar polygonal facets to fit the original surface, ...
详细信息
Triangular meshes of superior quality are important for geometric processing in practical applications. Existing approximative CVT-based remeshing methodology uses planar polygonal facets to fit the original surface, simplifying the computational complexity. However, they usually do not consider surface curvature. Topological errors and outliers can also occur in the close sheet surface remeshing, resulting in wrong meshes. With this regard, we present a novel method named PowerRTF, an extension of the restricted tangent face (RTF) in conjunction with the power diagram, to better approximate the original surface with curvature adaption. The idea is to introduce a weight property to each sample point and compute the power diagram on the tangent face to produce area-controlled polygonal facets. Based on this, we impose the variable-capacity constraint and centroid constraint to the PowerRTF, providing the trade-off between mesh quality and computational efficiency. Moreover, we apply a normal verification-based inverse side point culling method to address the topological errors and outliers in close sheet surface remeshing. Our method independently computes and optimizes the PowerRTF per sample point, which is efficiently implemented in parallel on the GPU. Experimental results demonstrate the effectiveness, flexibility, and efficiency of our method.
This paper presents new parallel algorithms for generating Euclidean minimum spanning trees and spatial clustering hierarchies (known as HDBSCAN*). Our approach is based on generating a well-separated pair decompositi...
详细信息
ISBN:
(纸本)9781450383431
This paper presents new parallel algorithms for generating Euclidean minimum spanning trees and spatial clustering hierarchies (known as HDBSCAN*). Our approach is based on generating a well-separated pair decomposition followed by using Kruskal's minimum spanning tree algorithm and bichromatic closest pair computations. We introduce a new notion of well-separation to reduce the work and space of our algorithm for HDBSCAN*. We also give a new parallel divide-and-conquer algorithm for computing the dendrogram and reachability plots, which are used in visualizing clusters of different scale that arise for both EMST and HDBSCAN*. We show that our algorithms are theoretically efficient: they have work (number of operations) matching their sequential counterparts, and polylogarithmic depth (parallel time). We implement our algorithms and propose a memory optimization that requires only a subset of well-separated pairs to be computed and materialized, leading to savings in both space (up to 10x) and time (up to 8x). Our experiments on large real-world and synthetic data sets using a 48-core machine show that our fastest algorithms outperform the best serial algorithms for the problems by 11.13-55.89x, and existing parallel algorithms by at least an order of magnitude.
The Morse-Smale complex is a well studied topological structure that represents the gradient flow behavior of a scalar function. It supports multi-scale topological analysis and visualization of large scientific data....
详细信息
ISBN:
(纸本)9781728180144
The Morse-Smale complex is a well studied topological structure that represents the gradient flow behavior of a scalar function. It supports multi-scale topological analysis and visualization of large scientific data. Its computation poses significant algorithmic challenges when considering large scale data and increased feature complexity. Several parallel algorithms have been proposed towards the fast computation of the 3D Morse-Smale complex. The non-trivial structure of the saddle-saddle connections are not amenable to parallel computation. This paper describes a fine grained parallel method for computing the Morse-Smale complex that is implemented on a GPU. The saddle-saddle reachability is first determined via a transformation into a sequence of vector operations followed by the path traversal, which is achieved via a sequence of matrix operations. Computational experiments show that the method achieves up to 7 x speedup over current sharedmemory implementations.
Matrix factorization is an efficient technique used for disclosing latent features of real-world data. It finds its application in areas such as text mining, image analysis, social network and more recently and popula...
详细信息
Matrix factorization is an efficient technique used for disclosing latent features of real-world data. It finds its application in areas such as text mining, image analysis, social network and more recently and popularly in recommendation systems. Alternating Least Squares (ALS), Stochastic Gradient Descent (SGD) and Coordinate Descent (CD) are among the methods used commonly while factorizing large matrices. SGD-based factorization has proven to be the most successful among these methods after Netflix and KDDCup competitions where the winners' algorithms relied on methods based on SGD. Parallelization of SGD then became a hot topic and studied extensively in the literature in recent years. We focus on parallel SGD algorithms developed for sharedmemory and distributed memory systems. sharedmemory parallelizations include works such as HogWild, FPSGD and MLGF-MF, and distributed memory parallelizations include works such as DSGD, GASGD and NOMAD. We design a survey that contains exhaustive analysis of these studies, and then particularly focus on DSGD by implementing it through message-passing paradigm and testing its performance in terms of convergence and speedup. In contrast to the existing works, many real-wold datasets are used in the experiments that we produce using published raw data. We show that DSGD is a robust algorithm for large-scale datasets and achieves near-linear speedup with fast convergence rates.
Operating Systems (OS) is a course in undergraduate computer science curricula to teach students concepts relating to the environment on which their applications run. In practice, OS software is very complicated, and ...
详细信息
ISBN:
(纸本)9781509022991
Operating Systems (OS) is a course in undergraduate computer science curricula to teach students concepts relating to the environment on which their applications run. In practice, OS software is very complicated, and the internal processes and mechanisms are often difficult for students to grasp, particularly those that still struggle with programming. Many OS courses are taught by describing high-level abstractions of structures and algorithms from a textbook, and then providing homework or project assignments that, in the interest of being tractable for the student, may be disconnected from the way an operating system actually performs its tasks. These methods only present a theoretical display of essential concepts which lack concrete examples to anchor the concepts. What many students need is a way to connect the low-level details of an operating system's implementation with the high-level abstractions provided in the class, all while being accessible to people who are still improving newly acquired programming skills. To bridge the gap between OS theory and implementation, we propose an interactive tutoring system to present the concepts involved with process synchronization and sharedmemory management. In this paper, first, we discuss the research performed to frame the requirements for the tool development. Second, we describe the design architecture, concepts involved and features of the tool. Third, we outline the test plan, user experiments and future improvements planned for this system.
暂无评论