Stochastic Gradient Descent (SGD) is an essential element in Machine Learning (ML) algorithms. Asynchronous shared-memory parallel SGD (AsyncSGD), including synchronization-free algorithms, e.g. HOGWILD!, have receive...
详细信息
ISBN:
(纸本)9781665440660
Stochastic Gradient Descent (SGD) is an essential element in Machine Learning (ML) algorithms. Asynchronous shared-memory parallel SGD (AsyncSGD), including synchronization-free algorithms, e.g. HOGWILD!, have received interest in certain contexts, due to reduced overhead compared to synchronous parallelization. Despite that they induce staleness and inconsistency, they have shown speedup for problems satisfying smooth, strongly convex targets, and gradient sparsity. Recent works take important steps towards understanding the potential of parallel 50D for problems not conforming to these strong assumptions, in particular for deep learning (DL). there is however a gap in current literature in understanding when AsyncSGD algorithms are useful in practice, and in particular how mechanisms for synchronization and consistency play a role. We contribute with answering questions in this gap by studying a spectrum of parallel algorithmic implementations of AsyncSGD, aiming to understand how shared-data synchronization influences the convergence properties in fundamental DL applications. We focus on the impact of consistency-preserving non-blocking synchronization in SGD convergence, and in sensitivity to hyperparameter tuning. We propose Leashed-SGD, an extensible algorithmic framework of consistency-preserving implementations of AsyncSGD, employing lock-free synchronization, effectively balancing throughput and latency. Leashed-SGD features a natural contention-regulating mechanism, as well as dynamic memory management, allocating space only when needed. We argue analytically about the dynamics of the algorithms, memory consumption, the threads' progress over time, and the expected contention. We provide a comprehensive empirical evaluation, validating the analytical claims, benchmarking the proposed Leashed-SGD framework, and comparing to baselines for two prominent deep learning (DL) applications: multilayer perceptrons (MLP) and convolutional neural networks (CNN). We o
In recent years, key-value stores (KV stores) [1]-[3] begin to gain popularity as storage engines for large-scale data applications. KV stores are fundamentally different from traditional SQL databases and withthe ke...
详细信息
ISBN:
(数字)9798350364606
ISBN:
(纸本)9798350364613
In recent years, key-value stores (KV stores) [1]-[3] begin to gain popularity as storage engines for large-scale data applications. KV stores are fundamentally different from traditional SQL databases and withthe key-value data model, they have various advantages, such as ease of use, flexibility and higher performance. However, some essential features in SQL databases, most notably atomicity, consistency, isolation and durability (ACID) transaction processing [4], are considered impractical and thus are generally not included in a KV store design. Withthe recent advancement, these features begin to be integrated into KV stores, giving birth to a brand new class of database management systems called NewSQL databases [5], [6]. the rationale behind NewSQL databases is that with ACID transaction support, KV stores can serve as storage engines for an upper-layer SQL query processor to handle SQL queries [7]-[9]. In this way, NewSQL databases can provide boththe scalability of KV stores and the ACID guarantees required for online transaction processing (OLTP), thus providing high performance for different types of workloads in a large-scale data system.
Timely and efficient air traffic flow statistics play a significant role in improving the accuracy and intelligence of air traffic flow management (ATFM). the enormous spatio-temporal data collected by location-based ...
详细信息
ISBN:
(纸本)9781665435741
Timely and efficient air traffic flow statistics play a significant role in improving the accuracy and intelligence of air traffic flow management (ATFM). the enormous spatio-temporal data collected by location-based services (LBS) intensely aggravate the burden of the statistical tasks. the traditional approaches of calculating such tasks show their weakness in two parts: 1) they fail to capture the features of complicated three-dimensional time-dependent airspace, and 2) they are not optimized to deal with big volume spatio-temporal data covering high-dimensional features. Spatio-temporal range queries have advantages in calculating the eligible flow records. therefore, exploring the efficiency of distributed range query processing methods helps improve the performance of air traffic flow statistics and gain insights into the rationality of the air traffic. To analyze the large-scale spatio-temporal aviation data efficiently, we propose two spatio-temporal range query MapReduce algorithms: 1) spatio-temporal polygon range query, which aims to find all records from a polygonal location in a time interval, 2) spatio-temporal k nearest neighbors query, which directly searches the k closest neighbors of the query point. Moreover, we design an air traffic flow statistic strategy to accurately calculate traffic flow in arbitrary airspace based on real-world aviation trajectory datasets. the experimental results demonstrate that our algorithms perform better in answering spatio-temporal range queries over counterpart algorithms and the average response time is reduced by 81%. the evaluation also proves the effectiveness of our algorithms concerning air traffic flow statistics.
We present a scalable parallel I/O system for a logical-inferencing application built atop a deductive database. Deductive databases can make logical deductions (i.e. conclude additional facts), based on a set of prog...
详细信息
ISBN:
(纸本)9781665435772
We present a scalable parallel I/O system for a logical-inferencing application built atop a deductive database. Deductive databases can make logical deductions (i.e. conclude additional facts), based on a set of program rules, derived from facts already in the database. Datalog is a language or family of languages commonly used to specify rules and queries for a deductive database. applications built using Datalog can range from graph mining (such as computing transitive closure or k-cliques) to program analysis (control and data-flow analysis). In our previous papers, we presented the first implementation of a data-parallel Datalog built using MPI. In this paper, we present a parallel I/O system used to checkpoint and restart applications built on top of our Datalog system. State of the art Datalog implementations, such as Souffle, only support serial I/O, mainly because the implementation itself does not support many-node parallel execution. Computing the transitive closure of a graph is one of the simplest logical-inferencing applications built using Datalog;we use it as a micro-benchmark to demonstrate the efficacy of our parallel I/O system. Internally, we use a nested B-tree data-structure to facilitate fast and efficient in-memory access to relational data. Our I/O system therefore involves two steps, converting the application data-layout (a nested B-tree) to a stream of bytes followed by the actual parallel I/O. We explore two popular I/O techniques POSIX I/O and MPI collective I/O. For extracting performance out of MPI Collective I/O we use adaptive striping, and for POSIX I/O we use file-per-process I/O. We demonstrate the scalability of our system at up to 4,096 processes on the theta supercomputer at the Argonne National Laboratory.
Scientific computing applications generate enormous datasets that are continuously increasing exponentially in both complexity and volume, making their analysis, archival, and sharing one of the grand challenges of mo...
详细信息
ISBN:
(纸本)9781450391993
Scientific computing applications generate enormous datasets that are continuously increasing exponentially in both complexity and volume, making their analysis, archival, and sharing one of the grand challenges of modern big data analytics. Supported by the rise of artificial intelligence and deep learning, such enormous datasets are becoming valuable resources even beyond their original scope, opening new opportunities to learn patterns and extract new knowledge at large scale, potentially without human intervention. However, this leads to an increasing complexity of the workflows that combine traditional HPC simulations with big data analytics and AI applications. An initial wave that opened this direction was the shift from compute-intensive to data-intensive, which saw several ideas from big data analytics (in-situ processing, shipping computations close to data, complex and dynamic workflows) fused withthe tightly coupled patterns addressed by the AI and the high performance computing ecosystems. In a quest to keep up withthe complexity of the workflows, the design and operation of the infrastructures capable of running them efficiently at scale has evolved accordingly. Extreme heterogeneity at all levels (combinations of CPUs and accelerators, various types of memories and local storage and network links, parallel file systems and object stores, etc.) is now the norm. ideas pioneered by cloud and edge computing (aspects related to elasticity, multi-tenancy, geo-distributedprocessing, stream computing) are also beginning to be adopted in the HPC ecosystem (containerized workflows, on-demand jobs to complement batch jobs, streaming of experimental data from instruments directly to supercomputers, etc.). thus, modern scientific applications need to be integrated into an entire Compute Continuum from the edge all the way to supercomputers and large data-centers using flexible infrastructures and middlewares. the 12th workshop on AI and Scientific Computing at
Emerging HPC platforms are becoming more difficult to program as a result of systems with different node architectures, some with a small number of "fat" heterogenous nodes (consisting of multiple accelerato...
详细信息
ISBN:
(纸本)9781665435772
Emerging HPC platforms are becoming more difficult to program as a result of systems with different node architectures, some with a small number of "fat" heterogenous nodes (consisting of multiple accelerators) and others with a large number of "thin" homogenous nodes consisting of multi-core CPUs connected with high speed interconnects. New programming models are emerging to address performance portability of the applications as well as a set of scientific libraries that applications can use to exploit these architectures efficiently. To port applications to new architectures, developers need information about their source code characteristics including static and dynamic (e.g. performance) information to refactor the code, understand their data and code structure, and library usage as well as program information to direct their optimisation efforts and make key decisions. In this paper, we describe a tool that combines compiler and profiler information to query program characteristics in a given programming environment. Static and dynamic data about applications is collected and stored together in an SQL database that can be later queried to study application characteristics and patterns. We will demonstrate the capabilities of this tool with an application-driven case study that aims at understanding application code and its use of scientific libraries via a real world example from the molecular simulation application CP2K.
Data collected by sensors has hidden value that can be used to infer valuable knowledge about the system, such as identifying faults in transmission or functioning faults in various system components. Solutions for ex...
详细信息
ISBN:
(纸本)9781665432818
Data collected by sensors has hidden value that can be used to infer valuable knowledge about the system, such as identifying faults in transmission or functioning faults in various system components. Solutions for exploring and exploiting data need to be developed to extract such knowledge. this paper shows how the identification of transmission regularities can be used to extract knowledge about the overall system state. the focus of this work is defining a methodology for detecting transmission periodicity. In our approach, we evaluated other strategies, addressed various limitations they have, and narrowed their utility on real-world data. We further expand the scope by defining strategies for the identification of transmission gaps and duplicates. Finally, we validate the algorithms on samples of real industrial data obtained from monitoring different parts of home appliances.
Today's high-performance computing (HPC) applications are producing vast volumes of data, which are challenging to store and transfer efficiently during the execution, such that data compression is becoming a crit...
详细信息
ISBN:
(纸本)9781665440660
Today's high-performance computing (HPC) applications are producing vast volumes of data, which are challenging to store and transfer efficiently during the execution, such that data compression is becoming a critical technique to mitigate the storage burden and data movement cost. Huffman coding is arguably the most efficient Entropy coding algorithm in information theory, such that it could be found as a fundamental step in many modern compression algorithms such as DEFLATE. On the other hand, today's HPC applications are more and more relying on the accelerators such as GPU on supercomputers, while Huffman encoding suffers from low throughput on GPUs, resulting in a significant bottleneck in the entire data processing. In this paper, we propose and implement an efficient Huffman encoding approach based on modern GPU architectures, which addresses two key challenges: (1) how to parallelize the entire Huffman encoding algorithm, including codebook construction, and (2) how to fully utilize the high memory-bandwidth feature of modern GPU architectures. the detailed contribution is fourfold. (1) We develop an efficient parallel codebook construction on GPUs that scales effectively withthe number of input symbols. (2) We propose a novel reduction based encoding scheme that can efficiently merge the codewords on GPUs. (3) We optimize the overall GPU performance by leveraging the state-of-the-art CUDA APIs such as Cooperative Groups. (4) We evaluate our Huffman encoder thoroughly using six real-world application datasets on two advanced GPUs and compare with our implemented multi-threaded Huffman encoder. Experiments show that our solution can improve the encoding throughput by up to 5.0x and 6.8x on NVIDIA RTX 5000 and V100, respectively, over the state-of-the-art GPU Huffman encoder, and by up to 3.3x over the multi-thread encoder on two 28-core Xeon Platinum 8280 CPUs.
In this article we present PARSIR (parallel SImulation Runner), a package that enables the effective exploitation of shared-memory multi-processor machines for running discrete event simulation models. PARSIR is a com...
详细信息
ISBN:
(数字)9798331527211
ISBN:
(纸本)9798331527228
In this article we present PARSIR (parallel SImulation Runner), a package that enables the effective exploitation of shared-memory multi-processor machines for running discrete event simulation models. PARSIR is a compile/run-time environment for discrete event simulation models developed withthe C programming language. the architecture of PARSIR has been designed in order to keep low the amount of CPU-cycles required for running models. this is achieved via the combination of a set of techniques like: 1) causally consistent batch-processing of simulation events at an individual simulation object for caching effectiveness; 2) high likelihood of disjoint access parallelism; 3) the favoring of memory accesses on local NUMA (Non-Uniform-Memory-Access) nodes in the architecture, while still enabling well balanced workload distribution via work-stealing from remote nodes; 4) the use of RMW (Read-Modify-Write) machine instructions for fast access to simulation engine data required by the worker threads for managing the concurrent simulation objects and distributing the workload. Furthermore, any architectural solution embedded in the PARSIR engine is fully transparent to the application level code implementing the simulation model. We also provide experimental results showing the effectiveness of PARSIR when running the reference PHOLD benchmark on a NUMA shared-memory multi-processor machine equipped with40 CPUs.
Withthe tendency of running large-scale data-intensive applications on High-Performance Computing (HPC) systems, the I/O workloads of HPC storage systems are becoming more complex, such as the increasing metadata-int...
详细信息
暂无评论