OpenCL programming ability combined with OpenCL high-Level Synthesis (OpenCL-HLS) tools have made tremendous improvements in the reconfigurable computing field. FPGAs inherent pipelined parallelism capability provides...
详细信息
OpenCL programming ability combined with OpenCL high-Level Synthesis (OpenCL-HLS) tools have made tremendous improvements in the reconfigurable computing field. FPGAs inherent pipelined parallelism capability provides not only faster execution times but also power-efficient solutions when executing massively parallel applications. A major execution bottleneck affecting FPGA performance is the high number of memory stalls exposed to pipelined data-path that hinders the benefits of data-path *** paper explores the efficiency of “OpenCL Pipe” to hide memory access latency on cloud FPGAs by decoupling memory access from computation. The Pipe semantic is leveraged to split OpenCL kernels into “read”, “compute” and “write back” sub-kernels which work concurrently to overlap the computation of current threads with the memory access of future threads. For evaluation, we use a mix of seven massively parallel high-performance applications from the Rodinia suite vs. 3.1. All our tests are conducted on the Xilinx VU9FP FPGA platform of Amazon cloud-based AWS EC2 F1 instance. On average, we observe 5.2x speedup with a 2.2x increase in memory bandwidth utilization with about 2.5x increase in FPGA resource utilization over the baseline synthesis (Xilinx OpenCL-HLS). 11 This work has been funded and supported by the Xilinx University program (XUP)..
Power reduction and speed-up of image processing algorithms remain of high interest as image resolutions continue to increase. Neuromorphic-circuits are inspired by the nervous system aiming to reduce power consumptio...
详细信息
ISBN:
(数字)9781538661000
ISBN:
(纸本)9781538661000
Power reduction and speed-up of image processing algorithms remain of high interest as image resolutions continue to increase. Neuromorphic-circuits are inspired by the nervous system aiming to reduce power consumption and speed-up. This paper presents a neuromorphic smart image sensor designed by the pixel-parallel 3D hierarchical architecture with an on-chip attention module. The module dynamically detects regions with relevant information and produces a feedback path to sample those regions at high speed. On the other hand, by sampling non-relevant regions with a low-speed, the sensor can reduce redundancy and enable high-performance computing by ensuring low-power operation. The image sensor is comprised of several hierarchical planes and each plane has small and independent reconfigurable computational units (XPU). In each plane, all XPUs operate in parallel with a different operating speed which gives a pixel-parallel architecture. When the raw image passes through the hierarchical planes, necessary image processing algorithms are performed in parallel on different planes at a variable clock rate for saving power and reducing redundancy. The goal of this work is to prototype the focal plane image sensor which emulates the brain features. The results show that the prototype achieves remarkable power saving and speed-up at different stages.
We evaluate the performance of a Deep Convolutional Neural Network in grading the severity of prenatal hydronephrosis (PHN), one of the most common congenital urological anomalies, from renal ultrasound images. We pre...
详细信息
ISBN:
(纸本)9781538664810
We evaluate the performance of a Deep Convolutional Neural Network in grading the severity of prenatal hydronephrosis (PHN), one of the most common congenital urological anomalies, from renal ultrasound images. We present results on a variety of classification tasks based on clinically defined grades of severity, including predictions of whether or not an ultrasound image represents a case that is at high risk for further complications requiring surgical intervention with approximately 80% accuracy. The prediction rates obtained by the model are well beyond the rates of agreement among trained clinicians, suggesting that this work can lead to a useful diagnostic aid.
In this paper, we present a matrix assembly technique for arbitrary polynomial order finite element simulations on simplex meshes for graphics processing units (GPU). Compared to the current state of the art in GPU-ba...
详细信息
In this paper, we present a matrix assembly technique for arbitrary polynomial order finite element simulations on simplex meshes for graphics processing units (GPU). Compared to the current state of the art in GPU-based matrix assembly, we avoid the need for an intermediate sparse matrix and perform assembly directly into the final, GPU-optimized data structure. Thereby, we avoid the resulting 180% to 600% memory overhead, depending on polynomial order, and associated allocation time, while simplifying the assembly code and using a more compact mesh representation. We compare our method with existing algorithms and demonstrate significant speedups.
A big data platform to for financial quantitative data is designed and implemented on HPC system. Key technologies including data storage mechanism and distributed computing framework are resolved. Based on the platfo...
详细信息
A big data platform to for financial quantitative data is designed and implemented on HPC system. Key technologies including data storage mechanism and distributed computing framework are resolved. Based on the platform, several important feature for financial quantitative strategy research is developed, which are indicator computing, large scale backtest and distributed hyperparameter tuning. Tests shows that the platform can achieve much higher performance than single PC program, and can be used to design strategy base on large scale financial data.
Transportation systems are becoming increasingly complex with the evolution of emerging technologies, including deeper connectivity and automation, which will require more advanced control mechanisms for efficient ope...
详细信息
ISBN:
(纸本)9781728103235
Transportation systems are becoming increasingly complex with the evolution of emerging technologies, including deeper connectivity and automation, which will require more advanced control mechanisms for efficient operation (in terms of energy, mobility, and productivity). Stakeholders, including government agencies, industry, and local populations, all have an interest in efficient outcomes, yet there are few tools for developing a holistic understanding of urban dynamics. Simulating large-scale, high-fidelity transportation systems can help, but remains a challenging task, due to the computational demand of processing massive numbers of events and the nonlinear interactions between system components and traveling agents. In this paper, we introduce Mobiliti, a proof-of-concept, scalable transportation system simulator that implements parallel discrete event simulation on high-performancecomputers. We instantiated millions of nodes, links, and agents to simulate the movement of the population through the San Francisco Bay Area road network and provide estimates of the associated congestion, energy usage, and productivity loss. Our preliminary results show excellent scalability on multiple compute nodes for statically-routed agents, simulating 9.5 million trip legs over a road network with 1.1 million nodes and 2.2 million links, processing 2.4 billion events in less than 30 seconds using 1,024 cores on NERSC's Cori computer.
We present a ParaViewWeb based visual analytics application running on large high-resolution display supporting standard mouse and keyboard interaction. The application relies on SAGE2 for user interaction and multi-d...
详细信息
ISBN:
(纸本)9781538650356
We present a ParaViewWeb based visual analytics application running on large high-resolution display supporting standard mouse and keyboard interaction. The application relies on SAGE2 for user interaction and multi-display visualization. We also employ a scalable middleware system called "Cloudberry" that allows users to interactively query and analyze large amounts of temporal and spatial data stored on a back end Apache AsterixDB store to enable big data analytics and interactive visualization. Our Visual Analyzing Billion Tweets application shows interactive query and visualization of result from over a billion twitter feeds streamed in real-time to the back end Apache AsterixDB. In our setup, we ran the visual analytics application on a large high-resolution display with a 24-tiled display in a 6 x 4 configuration. We also run a comparative study of the application running on a single 24 inch display and the 24-tiled display with some very interesting findings supporting the benefit of using large high-resolution display for visual analytics.
As the critical pipeline stage in on-chip routers, switch allocation assigns output ports to input ports and allow flits transiting through the switch without conflicts. Previous works strive to design efficient switc...
详细信息
ISBN:
(纸本)9781538684771
As the critical pipeline stage in on-chip routers, switch allocation assigns output ports to input ports and allow flits transiting through the switch without conflicts. Previous works strive to design efficient switch allocaiton strategies by maximizing the matching at each cycle, with the information from the current cycle or multiple cycles in time series. However, those works have not taken endpoint congestion into considerations. Tree-saturation, caused by endpoint congestion, can degrade NoC performance due to the congestion fanning out from the original point to upstream routers. In this paper, a novel router design, Eca-Router, is proposed to relieve the impact of endpoint congestion by switch allocation optimization. Eca-Router detects endpoint congestion by recording the destinations of packets in switch allocation. Endpoint congestion is decided in switch allocation once there are multiple input ports competing for the same output port and the packets in these input ports contain the same destination. During switch allocation, requests that contribute to endpoint congestion will be given lower priority to be allocated, and starvation control is also introduced to ensure allocation fairness. Evaluation results show that Eca-Router is efficient in reducing packet latency.
In this paper, direct-code generalized space shift keying (DC-SSK) has been studied for indoor visible light communication (VLC) system with the purpose to improve the system transmission rate. Symbol error rate (SER)...
详细信息
ISBN:
(纸本)9781538636527
In this paper, direct-code generalized space shift keying (DC-SSK) has been studied for indoor visible light communication (VLC) system with the purpose to improve the system transmission rate. Symbol error rate (SER) of the DC-SSK scheme in VLC system has been derived based on the maximal likelihood (ML) detection and low complexity power allocation scheme has been further presented to enhance the system performance over high correlation optical channel. Simulation results and theoretical analysis show the proposed DC-SSK scheme with low complexity power allocation can achieve better spectral efficiency and SER performance of VLC system. Effect of the different semiangles of the LEDs at half-power on the system performance has also been analyzed.
Learning subtle yet discriminative features (e.g., beak and eyes for a bird) plays a significant role in fine-grained image recognition. Existing attention-based approaches localize and amplify significant parts to le...
详细信息
ISBN:
(纸本)9781728132945
Learning subtle yet discriminative features (e.g., beak and eyes for a bird) plays a significant role in fine-grained image recognition. Existing attention-based approaches localize and amplify significant parts to learn fine-grained details, which often suffer from a limited number of parts and heavy computational cost. In this paper, we propose to learn such fine-grained features from hundreds of part proposals by Trilinear Attention Sampling Network (TASN) in an efficient teacher-student manner. Specifically, TASN consists of 1) a trilinear attention module, which generates attention maps by modeling the inter-channel relationships, 2) an attention-based sampler which highlights attended-parts with high resolution, and 3) a feature distiller, which distills part features into an object-level feature by weight sharing and feature preserving strategies. Extensive experiments verify that TASN yields the best performance under the same settings with the most competitive approaches, in iNaturalist-2017, CUB-Bird, and Stanford-Cars datasets.
暂无评论