Convolutional neural networks (CNNs) have gained great success in various computer vision applications. However, state-of-the-art CNN models are computation-intensive and hence are mainly processed on high performance...
详细信息
ISBN:
(纸本)9781509056026
Convolutional neural networks (CNNs) have gained great success in various computer vision applications. However, state-of-the-art CNN models are computation-intensive and hence are mainly processed on high performance processors like server CPUs and GPUs. Owing to the advantages of high performance, energy efficiency and reconfigurability, Field-Programmable Gate Arrays (FPGAs) have been widely explored as CNN accelerators. In this paper, we propose parallel structures to exploit the inherent parallelism and efficient computation units to perform operations in convolutional and fully-connected layers. Further, an automatic generator is proposed to generate Verilog HDL source code automatically according to high-level hardware description language. Execution time, DSP consumption and performance are analytically modeled based on some critical design variables. We demonstrate the automatic methodology by implementing two representative CNNs (LeNet and AlexNet) and evaluate the execution time models by comparing estimated and measured values. Our results show that the proposed automatic methodology yields hardware design with good performance and saves much developing round time.
Data-intensive queries are common in business intelligence, data warehousing and analytics applications. Typically, processing a query involves full inspection of large in-storage data sets by CPUs. An intuitive way t...
详细信息
ISBN:
(纸本)9781467389471
Data-intensive queries are common in business intelligence, data warehousing and analytics applications. Typically, processing a query involves full inspection of large in-storage data sets by CPUs. An intuitive way to speed up such queries is to reduce the volume of data transferred over the storage network to a host system. This can be achieved by filtering out extraneous data within the storage, motivating a form of near-data processing. This work presents Biscuit, a novel near-data processing framework designed for modern solid-state drives. It allows programmers to write a data-intensive application to run on the host system and the storage system in a distributed, yet seamless manner. In order to offer a high-level programming model, Biscuit builds on the concept of data flow. Data processing tasks communicate through typed and data-ordered ports. Biscuit does not distinguish tasks that run on the host system and the storage system. As the result, Biscuit has desirable traits like generality and expressiveness, while promoting code reuse and naturally exposing concurrency. We implement Biscuit on a host system that runs the Linux OS and a high-performance solid-state drive. We demonstrate the effectiveness of our approach and implementation with experimental results. When data filtering is done by hardware in the solid-state drive, the average speed-up obtained for the top five queries of TPC-H is over 15x.
Node architectures of extreme-scale systems are rapidly increasing in complexity. Emerging homogenous and heterogeneous designs provide massive multi-level parallelism, but developing efficient runtime systems and mid...
详细信息
Node architectures of extreme-scale systems are rapidly increasing in complexity. Emerging homogenous and heterogeneous designs provide massive multi-level parallelism, but developing efficient runtime systems and middleware that allow applications to efficiently and productively exploit these architectures is extremely challenging. Moreover, current state-of-the-art approaches may become unworkable once energy consumption, resilience, and data movement constraints are added. The goal of this workshop is to attract the international research community to share new and bold ideas that will address the challenges of design, implementation, deployment, and evaluation of future runtime systems and middleware.
distributed embedded systems are increasingly prevalent in numerous applications, and with pervasive network access within these systems, security is also a critical design concern. In this paper, we present a modelin...
详细信息
ISBN:
(纸本)9781509036837
distributed embedded systems are increasingly prevalent in numerous applications, and with pervasive network access within these systems, security is also a critical design concern. In this paper, we present a modeling and optimization framework for distributed reconfigurable embedded systems, which maps tasks on a distributed embedded system with the goal of optimizing latency, energy, and/or security across all computing and communication levels. The proposed modeling framework for dataflow applications integrates models for computational latency, security levels for inter-task and intra-task communication, communication latency, and power consumption. We evaluate the proposed methodology using a video-based object detection and tracking application.
Summary form only given. Emerging real-world graph problems include: detecting community structure in large social networks; improving the resilience of the electric power grid; and detecting and preventing disease in...
详细信息
ISBN:
(纸本)9781509036837
Summary form only given. Emerging real-world graph problems include: detecting community structure in large social networks; improving the resilience of the electric power grid; and detecting and preventing disease in human populations. Unlike traditional applications in computational science and engineering, solving these problems at scale often raises new challenges because of the sparsity and lack of locality in the data, the need for additional research on scalable algorithms and development of frameworks for solving these problems on high performance computers, and the need for improved models that also capture the noise and bias inherent in the torrential data streams. In this talk, I will discuss opportunities and challenges in massive data-intensive computing for applications in computational science and engineering.
One of the factors that limits the scale, performance, and sophistication of distributedapplications is the difficulty of concurrently executing them on multiple distributed computing resources. In part, this is due ...
详细信息
ISBN:
(纸本)9781509021413
One of the factors that limits the scale, performance, and sophistication of distributedapplications is the difficulty of concurrently executing them on multiple distributed computing resources. In part, this is due to a poor understanding of the general properties and performance of the coupling between applications and dynamic resources. This paper addresses this issue by integrating abstractions representing distributedapplications, resources, and execution processes into a pilot-based middleware. The middleware provides a platform that can specify distributedapplications, execute them on multiple resource and for different configurations, and is instrumented to support investigative analysis. We analyzed the execution of distributedapplications using experiments that measure the benefits of using multiple resources, the late-binding of scheduling decisions, and the use of backfill scheduling.
parallel design patterns have been developed to help programmers efficiently design and implement parallelapplications. However, identifying a suitable parallel pattern for a specific code region in a sequential appl...
详细信息
ISBN:
(纸本)9781509021413
parallel design patterns have been developed to help programmers efficiently design and implement parallelapplications. However, identifying a suitable parallel pattern for a specific code region in a sequential application is a difficult task. Transforming an application according to support structures applicable to these parallel patterns is also very challenging. In this paper, we present a novel approach to automatically find parallel patterns in the algorithm structure design space of sequential applications. In our approach, we classify code blocks in a region according to the appropriate supportstructure of the detected pattern. This classification eases the transformation of a sequential application into its parallel version. Weevaluated our approach on 17 applications from four different benchmark suites. Our method identified suitable algorithm structure patterns in the sequential applications. We confirmed our results by comparing them with the existing parallel versions of these applications. We also implemented the patterns we detected in cases in which parallel implementations were not available and achieved speedups of up to 14x.
The proceedings contain 100 papers. The topics discussed include: Tyrex: size-based resource allocation in MapReduce frameworks;demand-aware power management for power-constrained HPC systems;service level and perform...
ISBN:
(纸本)9781509024520
The proceedings contain 100 papers. The topics discussed include: Tyrex: size-based resource allocation in MapReduce frameworks;demand-aware power management for power-constrained HPC systems;service level and performance aware dynamic resource allocation in overbooked data centers;SHMEMPMI - shared memory based PMI for improved performance and scalability;DiBA: distributed power budget allocation for large-scale computing clusters;KOALA-F: a resource manager for scheduling frameworks in clusters;in-memory caching orchestration for Hadoop;increasing the performance of data centers by combining remote GPU virtualization with Slurm;CVSS: a cost-efficient and QoS-aware video streaming using cloud services;AMRZone: a runtime AMR data sharing framework for scientific applications;evaluation of in-situ analysis strategies at scale for power efficiency and scalability;a distributed system for storing and processing data from earth-observing satellites: system design and performance evaluation of the visualisation tool;and infrastructure cost comparison of running web applications in the cloud using AWS lambda and monolithic and microservice architectures.
This paper presents an efficient strategy to implement parallel and distributed computing for image processing on a neuromorphic platform. We use SpiNNaker, a many-core neuromorphic platform inspired by neural connect...
详细信息
ISBN:
(纸本)9781509052523
This paper presents an efficient strategy to implement parallel and distributed computing for image processing on a neuromorphic platform. We use SpiNNaker, a many-core neuromorphic platform inspired by neural connectivity in the brain, to achieve fast response and low power consumption. Our proposed method is based on fault-tolerant fine-grained parallelism that uses SpiNNaker resources optimally for process pipelining and decoupling. We demonstrate that our method can achieve a performance of up to 49.7 MP/J for Sobel edge detector, and can process 1600 x 1200 pixel images at 697 fps. Using simulated Canny edge detector, our method can achieve a performance of up to 21.4 MP/J. Moreover, the framework can be extended further by using larger SpiNNaker machines. This will be very useful for applications such as energy-aware and time-critical-mission robotics as well as very high resolution computer vision systems.
暂无评论