Lexical analysis, which converts input text into a list of tokens, plays an important role in many applications, including compilation and data extraction from texts. To recognize token patterns, a lexer incorporates ...
详细信息
ISBN:
(纸本)9781665440660
Lexical analysis, which converts input text into a list of tokens, plays an important role in many applications, including compilation and data extraction from texts. To recognize token patterns, a lexer incorporates a sequential computation model - automaton as its basic building component. As such, it is considered difficult to parallelize due to the inherent data dependency. Much work has been done to accelerate lexical analysis through parallel techniques. Unfortunately, existing attempts mainly rely on language-specific remedies for input segmentation, which makes it not only tricky for language extension, but also challenging for automatic lexer generation. this paper presents Plex - an automated tool for generating parallel lexers from user-defined grammars. To overcome the inherent sequentiality, Plex applies a fast prescanning phase to collect context information prior to scanning. To reduce the overheads brought by prescanning, Plex adopts a special automaton, which is derived from that of the scanner, to avoid backtracking behavior and exploits data-parallel techniques. the evaluation under several languages shows that the prescanning overhead is small, and consequently Plex is scalable and achieves 9.8-11.5X speedups using 18 threads.
the increasing use of real-time data-intensive applications and the growing interest in Heterogeneous Architectures have led to the need for increasingly complex embedded computing systems. An example of this is the r...
详细信息
ISBN:
(纸本)9781665435772
the increasing use of real-time data-intensive applications and the growing interest in Heterogeneous Architectures have led to the need for increasingly complex embedded computing systems. An example of this is the research carried out by boththe scientific community and companies toward embedded multi-FPGA systems for the implementation of the inference phase of Convolutional Neural Networks. In this paper, we focus on optimizing the management system of these embedded FPGA-based distributed systems. We extend the state-of-the-art FARD framework to data-intensive applications in an embedded scenario. Our orchestration and management infrastructure benefits from compiled language and is accessible to end-users by the means of Python APIs, which provides a simple way to interact withthe cluster and design apps to run on the embedded nodes. the proposed prototype system consists of a PYNQ-based cluster of multiple FPGAs and has been evaluated by running an FPGA-based You Only Look Once (YOLO) image classification algorithm.
this paper revisits distributed termination detection algorithms in the context of High-Performance Computing (HPC) applications. We introduce an efficient variant of the Credit Distribution Algorithm (CDA) and compar...
详细信息
ISBN:
(纸本)9781665435772
this paper revisits distributed termination detection algorithms in the context of High-Performance Computing (HPC) applications. We introduce an efficient variant of the Credit Distribution Algorithm (CDA) and compare it to the original algorithm (HCDA) as well as to its two primary competitors: the Four Counters algorithm (4C) and the Efficient Delay-Optimal distributed algorithm (EDOD). We analyze the behavior of each algorithm for some simplified task-based kernels and show the superiority of CDA in terms of the number of control messages.
this paper presents a workflow for synthesizing near-optimal FPGA implementations of structured-mesh based stencil applications for explicit solvers. It leverages key characteristics of the application class and its c...
详细信息
ISBN:
(纸本)9781665440660
this paper presents a workflow for synthesizing near-optimal FPGA implementations of structured-mesh based stencil applications for explicit solvers. It leverages key characteristics of the application class and its computation-communication pattern and the architectural capabilities of the FPGA to accelerate solvers for high-performance computing applications. Key new features of the workflow are (1) the unification of standard state-of-the-art techniques with a number of high-gain optimizations such as batching and spatial blocking/tiling, motivated by increasing throughput for real-world workloads and (2) the development and use of a predictive analytical model to explore the design space, and obtain resource and performance estimates. three representative applications are implemented using the design workflow on a Xilinx Alveo U280 FPGA, demonstrating near-optimal performance and over 85% predictive model accuracy. these are compared with equivalent highly-optimized implementations of the same applications on modern HPC-grade GPUs (Nvidia V100), analyzing lime to solution, bandwidth, and energy consumption. Performance results indicate comparable runtimes withthe V100 GPU, with over 2x energy savings for the largest non-trivial application on the FPGA. Our investigation shows the challenges of achieving high performance on current generation FPGAs compared to traditional architectures. We discuss determinants for a given stencil code to be amenable to FPGA implementation, providing insights into the feasibility and profitability of a design and its resulting performance.
One-way Wave Equation Migration (OWEM) is a classic seismic imaging method offering a good trade-off between quality and compute cost in most geological cases. In recent years, GPU-based heterogeneous architecture has...
详细信息
ISBN:
(纸本)9781665440660
One-way Wave Equation Migration (OWEM) is a classic seismic imaging method offering a good trade-off between quality and compute cost in most geological cases. In recent years, GPU-based heterogeneous architecture has gained popularity for seismic imaging. In this paper, we present a generic design for asynchronous processing and data management. By applying this design, we present an efficient GPU implementation of OWEM combining OpenACC and CUDA. Our approach improves upon classic designs by exploring asynchronous compute and data transfer between CPU and GPU using high-speed NVLink, completely masking the cost of MPI communications and I/O. Using 3,018 GPUs, our tine-tuned OWEM can process 11,172 seismic shots in less than 75 minutes. By tuning CPU and GPU clock frequencies, we achieve around 30% energy saving with only 4% loss of performance on PANGEA III supercomputer. We believe our design combined withthe energy-aware tuning will be beneficial to many GPU applications.
Performance of applications in production environments can he sensitive to network congestion. Cray Aries supports adaptively routing each network packet independently based on the load or congestion encountered as a ...
详细信息
ISBN:
(纸本)9781665440660
Performance of applications in production environments can he sensitive to network congestion. Cray Aries supports adaptively routing each network packet independently based on the load or congestion encountered as a packet traverses the network. Software can dictate different routing policies, adjusting between minimal and non-minimal bias, for each posted message. We have extensively evaluated the sensitivity of the routing bias selection on application performance as well as whole system performance in both production and controlled conditions. We show that the default routing bias used in Aries-based systems is often sub-optimal and that using a higher bias towards minimal routes will not only reduce the congestion effects on the application but also will decrease the overall congestion on the network. this routing scheme results in not only improved mean performance (by up to 12%) of most production applications hut also reduced run-to-run variability. Our study prompted the two supercomputing facilities (ALCF and NERSC) to change the default routing mode on their Aries-based systems. We present the substantial improvement measured in the overall congestion management and interconnect performance in production after making this change.
Conventional High-Level Synthesis (HLS) tools exploit parallelism mostly at the Instruction Level (ILP). they statically schedule the input specifications and build centralized Finite State Machine (FSM) controllers. ...
详细信息
ISBN:
(纸本)9781665440660
Conventional High-Level Synthesis (HLS) tools exploit parallelism mostly at the Instruction Level (ILP). they statically schedule the input specifications and build centralized Finite State Machine (FSM) controllers. However, aggressive exploitation of ILP in many applications has diminishing returns and, usually, centralized approaches do not efficiently exploit coarser parallelism, because FSMs are inherently serial. In this paper we present an HIS framework able to synthesize applicationsthat, beside ILP, also expose Task Level parallelism (TLP). An application can expose TLP through annotations that identify the parallel functions (i.e., tasks). To generate accelerators that efficiently execute concurrent tasks, we need to solve several issues: devise a mechanism to support concurrent execution flows, exploit memory parallelism, and manage synchronization. To support concurrent execution flows, we introduce a novel adaptive controller. the adaptive controller is composed of a set of interacting control elements that independently manage the execution of a task. these control elements check dependencies and resource constraints at runtime, enabling as soon as possible execution. To support parallel access to shared memories and synchronization, we integrate with a novel Hierarchical Memory Interface (HMI). With respect to previous solutions, the proposed interface supports multi-ported memories and atomic memory operations, which commonly occur in parallel programming. Our framework can generate the hardware implementation of C functions by employing two different approaches, depending on its characteristics. If a function exposes TLP, then the framework generates hardware implementations based on the adaptive controller. Otherwise, the framework implements the function through the FSM approach, which is optimized for ILP exploitation. We evaluate our framework on a set of parallelapplications and show substantial performance improvements (average speedup of 4.
Timely and efficient air traffic flow statistics play a significant role in improving the accuracy and intelligence of air traffic flow management (ATFM). the enormous spatio-temporal data collected by location-based ...
详细信息
ISBN:
(纸本)9781665435741
Timely and efficient air traffic flow statistics play a significant role in improving the accuracy and intelligence of air traffic flow management (ATFM). the enormous spatio-temporal data collected by location-based services (LBS) intensely aggravate the burden of the statistical tasks. the traditional approaches of calculating such tasks show their weakness in two parts: 1) they fail to capture the features of complicated three-dimensional time-dependent airspace, and 2) they are not optimized to deal with big volume spatio-temporal data covering high-dimensional features. Spatio-temporal range queries have advantages in calculating the eligible flow records. therefore, exploring the efficiency of distributed range query processing methods helps improve the performance of air traffic flow statistics and gain insights into the rationality of the air traffic. To analyze the large-scale spatio-temporal aviation data efficiently, we propose two spatio-temporal range query MapReduce algorithms: 1) spatio-temporal polygon range query, which aims to find all records from a polygonal location in a time interval, 2) spatio-temporal k nearest neighbors query, which directly searches the k closest neighbors of the query point. Moreover, we design an air traffic flow statistic strategy to accurately calculate traffic flow in arbitrary airspace based on real-world aviation trajectory datasets. the experimental results demonstrate that our algorithms perform better in answering spatio-temporal range queries over counterpart algorithms and the average response time is reduced by 81%. the evaluation also proves the effectiveness of our algorithms concerning air traffic flow statistics.
Many science and industry IoT applications necessitate data processing across the edge-to-cloud continuum to meet performance, security, cost, and privacy requirements. However, diverse abstractions and infrastructure...
详细信息
ISBN:
(纸本)9781665435772
Many science and industry IoT applications necessitate data processing across the edge-to-cloud continuum to meet performance, security, cost, and privacy requirements. However, diverse abstractions and infrastructures for managing resources and tasks across the edge-to-cloud scenario are required. We propose Pilot-Edge as a common abstraction for resource management across the edge-to-cloud continuum. Pilot-Edge is based on the pilot abstraction, which decouples resource and workload management, and provides a Function-as-a-Service (FaaS) interface for application-level tasks. the abstraction allows applications to encapsulate common functions in high-level tasks that can then be configured and deployed across the continuum. We characterize Pilot-Edge on geographically distributed infrastructures using machine learning workloads (e. g., k-means and auto-encoders). Our experiments demonstrate how Pilot-Edge manages distributed resources and allows applications to evaluate task placement based on multiple factors (e. g., model complexities, throughput, and latency).
Empirical performance modeling is a proven instrument to analyze the scaling behavior of HPC applications. Using a set of smaller-scale experiments, it can provide important insights into application behavior at large...
详细信息
ISBN:
(纸本)9781665440660
Empirical performance modeling is a proven instrument to analyze the scaling behavior of HPC applications. Using a set of smaller-scale experiments, it can provide important insights into application behavior at larger scales. Extra-P is an empirical modeling tool that applies linear regression to automatically generate human-readable performance models. Similar to other regression-based modeling techniques, the accuracy of the models created by Extra-I' decreases as the amount of noise in the underlying data increases. this is why the performance variability observed in many contemporary systems can become a serious challenge. In this paper, we introduce a novel adaptive modeling approach that makes Extra-P more noise resilient, exploiting the ability of deep neural networks to discover the effects of numerical parameters, such as the number of processes or the problem size, on performance when dealing with noisy measurements. Using synthetic analysis and data from three different case studies, we demonstrate that our solution improves the model accuracy at high noise levels by up to 25% while increasing their predictive power by about 15%.
暂无评论