The state-of-the-art intelligent vehicle, autonomous guided vehicle and mobile robotics application domains can be described as collection of interacting highly autonomous complex dynamical systems. Extensive formal a...
详细信息
The indexing of complex data and similarity search plays an important role in many application areas. Traditional centralized index structure can not scale with the rapid proliferation of data volume. In this paper, w...
详细信息
Network security applications such as to detect malware, security breaches, and covert channels require packet inspection and processing. Performing these functions at very high network line rates and low power is cri...
详细信息
ISBN:
(纸本)9780769539393
Network security applications such as to detect malware, security breaches, and covert channels require packet inspection and processing. Performing these functions at very high network line rates and low power is critical to safe guarding enterprise networks from various cyber-security threats. Solutions based on FPGA and single or multi-core CPUs has several limitations with regards to power and the ability to match the ever increasing line rates. This paper describes a MPPA (Massively parallelized processing Architecture) framework based on the Ambric parallelprocessing device that can speed up computation of network packet processing and analysis tasks. This is accomplished with a programmable processor interconnection that enables parallelizing the application and replication of data through channels. In this paper, we consider three network security applications - detecting malware, detecting covert timing channels, and a symmetric encryption engine. Experimental analyses of parallel implementations of the detection algorithms show that MPAA can easily achieve throughput greater than 1Gbps with low power usage.
In this paper, we propose a low-power distributed arithmetic (DA)-based method specifically for the field programmable field array (FPGA) implementation of video processing systems. By taking advantage of the correlat...
详细信息
ISBN:
(纸本)0780390059
In this paper, we propose a low-power distributed arithmetic (DA)-based method specifically for the field programmable field array (FPGA) implementation of video processing systems. By taking advantage of the correlation of the input data and using a parallel structure, the FPGA implementation of a convolution example based on the proposed method requires only 75% of power while maintaining the same throughput compared to the existing method of [4], [5]. The proposed method can be applied to many video processingapplications such as discrete cosine transform, discrete wavelet transform, discrete Fourier transform and motion estimation.
Analysis of tissue using image processing techniques is essential for dealing with a number of problems in cancer research. The identification of normal and cancerous colonic mucosa is such a problem. In this paper te...
详细信息
ISBN:
(纸本)0780376676
Analysis of tissue using image processing techniques is essential for dealing with a number of problems in cancer research. The identification of normal and cancerous colonic mucosa is such a problem. In this paper texture analysis techniques are used to measure certain characteristics of normal and cancerous tissue images. A genetic algorithm undertakes the analysis of those results in order to determine the operations useful for the given problem and in the most appropriate operation combination for the purpose of maximising the classification accuracy. The system developed for undertaking those tasks has been implemented on a cluster of Linux workstations using distributed computing techniques. A distributed programming message-passing library, PVM (parallel Virtual Machine), provides the basis for building this system.
Certain aspects of a computer-generated world have always been difficult to simulate. Cloth is one such example since, unlike a rigid object, it is flexible and subject to many internal and external forces which drive...
详细信息
The proceedings contain 34 papers. The topics discussed include: hierarchical page eviction policy for unified memory in GPUs;modeling deep learning accelerator enabled GPUs;RPPM: rapid performance prediction of multi...
ISBN:
(纸本)9781728107462
The proceedings contain 34 papers. The topics discussed include: hierarchical page eviction policy for unified memory in GPUs;modeling deep learning accelerator enabled GPUs;RPPM: rapid performance prediction of multithreaded workloads on multicore processors;on the impact of instruction address translation overhead;empirical investigation of stale value tolerance on parallel RNN training;DeLTA: GPU performance model for deep learning applications with in-depth memory system traffic analysis;distributed software defined networking controller failure mode and availability analysis;HeteroMap: a runtime performance predictor for efficient processing of graph analytics on heterogeneous multi-accelerators;analyzing machine learning workloads using a detailed GPU simulator;distributed software defined networking controller failure mode and availability analysis;a model driven approach towards improving the performance of apache spark applications;an improved dynamic vertical partitioning technique for semi-structured data;and timeloop: a systematic approach to DNN accelerator evaluation.
Power control is an important issue in wireless networks, which still has no satisfactory solution. Due to the limited amount of power available to wireless units, there is a need for systems that operate with reduced...
详细信息
ISBN:
(纸本)0769521320
Power control is an important issue in wireless networks, which still has no satisfactory solution. Due to the limited amount of power available to wireless units, there is a need for systems that operate with reduced power consumption levels. We propose a new model for the problem, that exploits the relationship among necessary power and reach of broadcast. The resulting model is called the power control problem in ad hoc networks (PCADHOC). We derive a linear integer programming model, which is used to find lower bounds on the amount of required power. The constraints of the problem guarantee that all required transmissions can be successfully performed. A distributed algorithm based on variable neighborhood search is proposed to solve the PCADHOC. The results of experiments with the algorithm show that the power savings are considerable.
Cloud applications are increasingly playing a crucial role in big data analytics. New use cases such as autonomous cars and edge computing call for novel approaches mixing heterogeneous computing and machine learning....
详细信息
ISBN:
(纸本)9781728116440
Cloud applications are increasingly playing a crucial role in big data analytics. New use cases such as autonomous cars and edge computing call for novel approaches mixing heterogeneous computing and machine learning. These applications typically process petabyte-scale datasets, therefore, requiring low-power and scalable storage providing low-latency and high-throughput data access. While data centers have been focusing on migrating from legacy HDDs and SATA SSDs by deploying high-throughput and low-latency NVMe SSDs, the data bottlenecks appear as capacity scales. One approach to tackle this problem is to enable processing to happen within the storage device -in-storage processing ( ISP)-eliminating the need to move the data. In this paper, we investigated the deployment of storage units with embedded low-power application processors along with FPGA-based reconfigurable hardware accelerators to address both performance and energy efficiency. To this purpose, we developed a high-capacity solid-state drive ( SSD) named Catalina equipped with a quad-core ARM A53 processor running a Linux operating system along with a highly efficient FPGA accelerator for running applications in-place. We evaluated our proposed approach on a case study application for a similarity search library called Faiss.
Graph-processing workloads have become widespread due to their relevance on a wide range of application domains such as network analysis, path-planning, bioinformatics, and machine learning. Graph-processing workloads...
详细信息
ISBN:
(纸本)9798350387117;9798350387124
Graph-processing workloads have become widespread due to their relevance on a wide range of application domains such as network analysis, path-planning, bioinformatics, and machine learning. Graph-processing workloads have massive data footprints that exceed cache storage capacity and exhibit highly irregular memory access patterns due to data-dependent graph traversals. This irregular behaviour causes graph-processing workloads to exhibit poor data locality, undermining their performance. This paper makes two fundamental observations on the memory access patterns of graph-processing workloads: First, conventional cache hierarchies become mostly useless when dealing with graph-processing workloads, since 78.6% of the accesses that miss in the L1 Data Cache (L1D) result in misses in the L2 Cache (L2C) and in the Last Level Cache (LLC), requiring a DRAM access. Second, it is possible to predict whether a memory access will be served by DRAM or not in the context of graph-processing workloads by observing strides between accesses triggered by instructions with the same Program Counter (PC). Our key insight is that bypassing the L2C and the LLC for highly irregular accesses significantly reduces latency cost while also reducing pressure on the lower levels of the cache hierarchy. Based on these observations, this paper proposes the Large Predictor (LP), a low-cost micro-architectural predictor capable of distinguishing between regular and irregular memory accesses. We propose to serve accesses tagged as regular by LP via the standard memory hierarchy, while irregular access are served via the Side Data Cache (SDC). The SDC is a private percore set-associative cache placed alongside the L1D specifically aimed at reducing the latency cost of highly irregular accesses while avoiding polluting the rest of the cache hierarchy with data that exhibits poor locality. SDC coupled with LP yields geometric mean speed-ups of 20.3% and 20.2% on single- and multi-core scenarios, resp
暂无评论