Advances in deep neural networks have provided a significant improvement in accuracy and speed across a large range of Computer Vision (CV) applications. However, our ability to perform real-time CV on edge devices is...
详细信息
ISBN:
(纸本)9781450383318
Advances in deep neural networks have provided a significant improvement in accuracy and speed across a large range of Computer Vision (CV) applications. However, our ability to perform real-time CV on edge devices is severely restricted by their limited computing capabilities. In this paper we employ Vega, a parallel graph-based framework, to study the performance limitations of four heterogeneous edge-computing platforms, while running 12 popular deep learning CV applications. We expand the framework's capabilities, introducing two new performance enhancements: 1) an adaptive stage instance controller (ASI-C) that can improve performance by dynamically selecting the number of instances for a given stage of the pipeline;and 2) an adaptive input resolution controller (AIR-C) to improve responsiveness and enable real-time performance. These two solutions are integrated together to provide a robust real-time solution. Our experimental results show that ASI-C improves run-time performance by 1.4x on average across all heterogeneous platforms, achieving a maximum speedup of 4.3x while running face detection executed on a high-end edge device. We demonstrate that our integrated optimization framework improves performance of applications and is robust to changing execution patterns.
Iterative parallel algorithms can be implemented by synchronizing after each round. This bulk-synchronous parallel (BSP) pattern is inefficient when strict synchronization is not required: global synchronization is co...
详细信息
ISBN:
(纸本)9781665432832
Iterative parallel algorithms can be implemented by synchronizing after each round. This bulk-synchronous parallel (BSP) pattern is inefficient when strict synchronization is not required: global synchronization is costly at scale and prohibits amortizing load imbalance over the entire execution, and termination detection is challenging with irregular data-dependent communication. We present an asynchronous communication protocol that efficiently interleaves communication with computation. The protocol includes global termination detection without obstructing computation and communication between nodes. The user's computational primitive only needs to indicate when local work is done;our algorithm detects when all processors reach this state. We do not assume that global work decreases monotonically, allowing processors to create new work. We illustrate the utility of our solution through experiments, including two large data analysis and visualization codes: parallel particle advection and distributed union-find. Our asynchronous algorithm is several times faster with better strong scaling efficiency than the synchronous approach.
In this paper we show that the Minimum Spanning Tree problem (MST) can be solved deterministically in O(1) rounds of the Congested Clique model. In the Congested Clique model there are n players that perform computati...
详细信息
ISBN:
(纸本)9781450380539
In this paper we show that the Minimum Spanning Tree problem (MST) can be solved deterministically in O(1) rounds of the Congested Clique model. In the Congested Clique model there are n players that perform computation in synchronous rounds. Each round consist of a phase of local computation and a phase of communication, in which each pair of players is allowed to exchange O(log n) bit messages. The studies of this model began with the MST problem: in the paper by Lotker, Pavlov, Patt-Shamir, and Peleg [SPAA'03, SICOMP'05] that defines the Congested Clique model the authors give a deterministic O(log log n) round algorithm that improved over a trivial O(log n) round adaptation of Boravka's algorithm. There was a sequence of gradual improvements to this result: an O(log log log n) round algorithm by Hegeman, Pandurangan, Pemmaraju, Sardeshmukh, and Scquizzato [PODC'15], an O(log* n) round algorithm by Ghaffari and Parter, [PODC'16] and an O(1) round algorithm by Jurdzinski and Nowicki, [SODA'18], but all those algorithms were randomized. Therefore, the question about the existence of any deterministic o(log log n) round algorithms for the Minimum Spanning Tree problem remains open since the seminal paper by Lotker, Pavlov, Patt-Shamir, and Peleg [SPAA'03, SICOMP'05]. Our result resolves this question and establishes that O(1) rounds is enough to solve the MST problem in the Congested CI iq ue model, even if we are not allowed to use any randomness. Furthermore, the amount of communication needed by the algorithm makes it applicable to a variant of the M PC model using machines with local memory of size O(n).
In this paper, we propose a parallel algorithm for computing all-pairs shortest paths (APSP) for sparse graphs on the distributed memory system with p processors. To exploit the graph sparsity, we first preprocess the...
详细信息
ISBN:
(纸本)9781450390682
In this paper, we propose a parallel algorithm for computing all-pairs shortest paths (APSP) for sparse graphs on the distributed memory system with p processors. To exploit the graph sparsity, we first preprocess the graph by utilizing several known algorithmic techniques in linear algebra such as fill-in reducing ordering and elimination tree parallelism. Then we map the preprocessed graph on the distributed memory system for both load balancing and communication reduction. Finally, we design a new scheduling strategy to minimize the communication cost. The bandwidth cost (communication volume) and the latency cost (number of messages) of our algorithm are O( n(2) log(2) p/p + |S |(2) log(2) p) and O(log(2) p), respectively, where S is a minimal vertex separator that partitions the graph into two components of roughly equal size. Compared with the state-of-the-art result for dense graphs where the bandwidth and latency costs are O( n(2) / root p) and O(root p log(2) p), respectively, our algorithm reduces the latency cost by a factor of O(root p), and reduces the bandwidth cost by a factor of O( root p/ log(2) p) for sparse graphs with |S | = O/( n/ root p). We also present the bandwidth and latency costs lower bounds for computing APSP on sparse graphs, which are Omega( n(2) / p + |S |(2)) and O(log(2) p), respectively. This implies that the bandwidth cost of our algorithm is nearly optimal and the latency cost is optimal.
Word2Vec remains one of the highly-impactful innovations in the field of Natural Language Processing (NLP) that represents latent grammatical and syntactical information in human text with dense vectors in a lowdimens...
详细信息
ISBN:
(纸本)9781450383356
Word2Vec remains one of the highly-impactful innovations in the field of Natural Language Processing (NLP) that represents latent grammatical and syntactical information in human text with dense vectors in a lowdimension. Word2Vec has high computational cost due to the algorithm's inherent sequentiality, intensive memory accesses, and the large vocabularies it represents. While prior studies have investigated technologies to explore parallelism and improve memory system performance, they struggle to effectively gain throughput on powerful GPUs. We identify memory data access and latency as the primary bottleneck in prior works on GPUs, which prevents highly optimized kernels from attaining the architecture's peak performance. We present a novel algorithm, FULL-W2V, which maximally exploits the opportunities for data reuse in the W2V algorithm and leverages GPU architecture and resources to reduce access to low memory levels and improve temporal locality. FULL-W2V is capable of reducing accesses to GPU global memory significantly, e.g., by more than 89%, compared to prior state-of-the-art GPU implementations, resulting in significant performance improvement that scales across successive hardware generations. Our prototype implementation achieves 2.97X speedup when ported from Nvidia Pascal P100 to Volta V100 cards, and outperforms the state-of-the-art by 5.72X on V100 cards with the same embedding quality. In-depth analysis indicates that the reduction of memory accesses through register and shared memory caching and high-throughput shared memory reduction leads to a significantly improved arithmetic intensity. FULL-W2V can potentially benefit many applications in NLP and other domains.
Randomized algorithms often outperform their deterministic counterparts in terms of simplicity and efficiency. In this paper, we consider Randomized Incremental Constructions (RICs) that are very popular, in particula...
详细信息
ISBN:
(纸本)9781665435772
Randomized algorithms often outperform their deterministic counterparts in terms of simplicity and efficiency. In this paper, we consider Randomized Incremental Constructions (RICs) that are very popular, in particular in combinatorial optimization and computational geometry. Our contribution is Collaborative parallel RIC (CPRIC) - a novel approach to parallelizing RIC for modern parallel architectures like vector processors and GPUs. We show that our approach based on a work-stealing mechanism avoids the control-flow divergence of parallel threads, thus improving the performance of parallel implementation. Our extensive experiments on CPU and GPU demonstrate the advantages of our CPRIC approach that achieves an average speedup between 4x and 5x compared to the naively parallelized RIC.
Flash memory devices operations like read, program and erase are performed by sequences called "algorithms". An algorithm is mainly composed of phases where accurate voltages are computed and applied to the ...
详细信息
ISBN:
(纸本)9781728162744
Flash memory devices operations like read, program and erase are performed by sequences called "algorithms". An algorithm is mainly composed of phases where accurate voltages are computed and applied to the array cells and phases where data are moved through different latches in data buffers. The fast-increasing complexity of multi-bit NAND memory algorithms made it natural to realize control logic on a microprocessor, in the form of firmware routines. The HW/FW co-design enables a safer complexity management, allows development of algorithms in parallel with physical design and allows acceleration of development time to fit aggressive time-to-market requirements. Again, to follow the increasingly performance requirements different kind of multi-thread microprocessor solutions have been proposed in the years to get the best performance-power-area (PPA) trade-off. This article proposes one possible approach to performance optimization by a multi-threaded approach, without an immediate downside for area and power. The most innovative point of the new architecture is the deep adherence between intrinsic paralleliz able physical processes inside a NAND Flash and the number of threads, busses and physical executors. As shown, this solution introduces tangible advantages in terms of performance.
The multi-scale character of skeletal muscle models requires simulations with high spatial resolution to capture all relevant effects. This naturally involves high computational load that can only be tackled by parall...
详细信息
The multi-scale character of skeletal muscle models requires simulations with high spatial resolution to capture all relevant effects. This naturally involves high computational load that can only be tackled by parallel computations. We simulate electrophysiology and muscle contraction using a state-of-the-art, biophysical chemo-electro-mechanical model that requires meshes of the 3D domain with embedded, aligned 1D meshes for muscle fibers. We present novel algorithms to construct highly-resolved meshes with robust properties for real muscle geometries from surface triangulations. We demonstrate their use and suitability in a simulation of the biceps brachii muscle and tendons. In addition, the respective simulations showcase several functional enhancements of our simulation framework OpenDiHu.
In this work, we present methods for distributed domain generation within the constraints of our decentral domain management concept. Here, all participating actors only have knowledge of their immediate neighbours, w...
详细信息
ISBN:
(纸本)9781665432818
In this work, we present methods for distributed domain generation within the constraints of our decentral domain management concept. Here, all participating actors only have knowledge of their immediate neighbours, which are defined by geometric and hierarchical relations between nodes that represent subsets of the computational domain. We generate this domain following a hierarchical spacetree refinement. First, an initial tree is generated on every participating process. Second, this tree is distributed following a space-filling curve linearisation locally. Every process is assigned at least one leaf node of the initial tree, which acts as a starting point for the subsequent domain generation. From here, every process independently refines a subdomain using a decomposition method, which transforms a triangular surface-based geometry description into a volume-based one, using increasingly complex intersection tests. The resulting domain tree is distributed, yet neighbourhood references of neighbouring subtrees are not resolved. We combine the resolution of these relations with a 2:1 tree balancing, which involves the transfer of the surface of neighbouring subtrees. We provide results of a domain generation testcase, using an input geometry with 84,072 triangles on up to 896 processes of the CoolMUC-2 cluster segment of LRZ's Linux Cluster System. Here, we bring down the overall time it takes to generate an adaptively refined and balanced octree with depth d = 7 from 5.5 hours on one process to two seconds on 896 processes.
For distributions over discrete product spaces Qni=1 Ω′i, Glauber dynamics is a Markov chain that at each step, resamples a random coordinate conditioned on the other coordinates. We show that k-Glauber dynamics, whi...
详细信息
暂无评论