Serverless offers a scalable and cost-effective service model for users to run applications without focusing on underlying infrastructure or physical servers. While the Serverless architecture is not designed to addre...
详细信息
ISBN:
(纸本)9798400702341
Serverless offers a scalable and cost-effective service model for users to run applications without focusing on underlying infrastructure or physical servers. While the Serverless architecture is not designed to address the unique challenges posed by resource-intensive workloads, e.g., Machine Learning (ML) tasks, it is highly scalable. Due to the limitations of Serverless function deployment and resource provisioning, the combination of ML and Serverless is a complex undertaking. We tackle this problem through decomposition of large ML models into smaller sub-models, referred to as slices. We set up ML inference tasks using these slices as a Serverless workflow, i.e., sequence of functions. Our experimental evaluations are performed on the Serverless offering by AWS for demonstration purposes, considering an open-source format for ML model representation, Open Neural Network Exchange. Achieved results portray that our decomposition method enables the execution of ML inference tasks on Serverless, regardless of the model size, benefiting from the high scalability of this architecture while lowering the strain on computing resources, such as required run-time memory.
Data movement between the main memory and the processor is a key contributor to execution time and energy consumption in memory-intensive applications. this data movement bottleneck can be alleviated using Processing-...
详细信息
ISBN:
(数字)9781665462723
ISBN:
(纸本)9781665462723
Data movement between the main memory and the processor is a key contributor to execution time and energy consumption in memory-intensive applications. this data movement bottleneck can be alleviated using Processing-in-Memory (PiM). One category of PiM is Processing-using-Memory (PuM), in which computation takes place inside the memory array by exploiting intrinsic analog properties of the memory device. PuM yields highperformance and energy efficiency, but existing PuM techniques support a limited range of operations. As a result, current PuM architectures cannot efficiently perform some complex operations (e.g., multiplication, division, exponentiation) without large increases in chip area and design complexity. To overcome these limitations of existing PuM architectures, we introduce pLUTo (processing-using-memory with lookup table (LUT) operations), a DRAM-based PuM architecturethat leverages the high storage density of DRAM to enable the massively parallel storing and querying of lookup tables (LUTs). the key idea of pLUTo is to replace complex operations with low-cost, bulk memory reads (i.e., LUT queries) instead of relying on complex extra logic. We evaluate pLUTo across 11 real-world workloads that showcase the limitations of prior PuM approaches and show that our solution outperforms optimized CPU and GPU baselines by an average of 713x and 1.2x, respectively, while simultaneously reducing energy consumption by an average of 1855x and 39.5x. Across these workloads, pLUTo outperforms state-of-the-art PiM architectures by an average of 18.3x. We also show that different versions of pLUTo provide different levels of flexibility and performance at different additional DRAM area overheads (between 10.2% and 23.1%). pLUTo 's source code and all scripts required to reproduce the results of this paper are openly and fully available at https://***/CMU- SAFARI/pLUTo.
Video identification in encrypted network traffic has become a trending field in the research area for user behavior and Quality of Experience (QoE) analysis. However, the traditional methods of video identification h...
详细信息
ISBN:
(纸本)9798400702341
Video identification in encrypted network traffic has become a trending field in the research area for user behavior and Quality of Experience (QoE) analysis. However, the traditional methods of video identification have become ineffective withthe usage of Hypertext Transfer Protocol Secure (HTTPS). this paper presents a video identification method in encrypted network traffic using the number of packets received at the user's end in a second. For this purpose, video streams are captured, and feature is extracted from the video streams in the form of a series of Packets per Seconds (PPS). this feature is provided as input to a Convolutional Neural Network (CNN), which learns the pattern from the network traffic feature and successfully identifies the video even if the pattern differs from the training sample. the results show that PPS outperforms the other video identification techniques with a high accuracy of 90%. Moreover, the results show that CNN outperforms its counterpart regarding video identification with a 25% performance increase.
the presence of noisy labels has always been a primary factor affecting the effectiveness of federated learning (FL). Conventional FL approaches relying on Supervised Learning (SL) tend to overfit the noise labels, re...
详细信息
Serverless computing has become a fact of life on modern clouds. A serverless function may process sensitive data from clients. Protecting such a function against untrusted clouds using hardware enclave is attractive ...
详细信息
ISBN:
(纸本)9781665433334
Serverless computing has become a fact of life on modern clouds. A serverless function may process sensitive data from clients. Protecting such a function against untrusted clouds using hardware enclave is attractive for user privacy. In this work, we run existing serverless applications in SGX enclave, and observe that the performance degradation can be as high as 5.6 x to even 422.6 x . Our investigation identifies these slowdowns are related to architectural features, mainly from page-wise enclave initialization. Leveraging insights from our overhead analysis, we revisit SGX hardware design and make minimal modification to its enclave model. We extend SGX with a new primitive-region-wise plugin enclaves that can be mapped into existing enclaves to reuse attested common states amongst functions. By remapping plugin enclaves, an enclave allows in-situ processing to avoid expensive data movement in a function chain. Experiments show that our design reduces the enclave function latency by 94.74-99.57%, and boosts the autoscaling throughput by 19-179 x.
In this study, the situations in which the Perturb and Observe Method (POM) can be used to determine the points in which the wind turbine operates at maximum power were examined. Among them, the points in which there ...
详细信息
Withthe end of Dennard scaling, highly-parallel and specialized hardware accelerators have been proposed to improve the throughput and energy-efficiency of deep neural network (DNN) models for various applications. H...
详细信息
ISBN:
(纸本)9781665433334
Withthe end of Dennard scaling, highly-parallel and specialized hardware accelerators have been proposed to improve the throughput and energy-efficiency of deep neural network (DNN) models for various applications. However, collective data movement primitives such as multicast and broadcast that are required for multiply-and-accumulate (MAC) computation in DNN models are expensive, and require excessive energy and latency when implemented with electrical networks. this consequently limits the scalability and performance of electronic hardware accelerators. Emerging technology such as silicon photonics can inherently provide efficient implementation of multicast and broadcast operations, making photonics more amenable to exploit parallelism within DNN models. Moreover, when coupled with other unique features such as low energy consumption, high channel capacity with wavelength-division multiplexing (WDM), and high speed, silicon photonics could potentially provide a viable technology for scaling DNN acceleration. In this paper, we propose Albireo, an analog photonic architecture for scaling DNN acceleration. By characterizing photonic devices such as microring resonators (MRRs) and Mach-Zehnder modulators (MZM) using photonic simulators, we develop realistic device models and outline their capability for system level acceleration. Using the device models, we develop an efficient broadcast combined with multicast data distribution by leveraging parameter sharing through unique WDM dot product processing. We evaluate the energy and throughput performance of Albireo on DNN models such as ResNet18, MobileNet and VGG16. When compared to current state-of-the-art electronic accelerators, Albireo increases throughput by 110X, and improves energy-delay product (EDP) by an average of 74 X with current photonic devices. Furthermore, by considering moderate and aggressive photonic scaling, the proposed Albireo design shows that EDP can be reduced by at least 229 X.
Convolutional Neural Networks (CNNs) are widely used for optical character recognition of vehicle license plates in automatic license plate recognition (ALPR) systems. However, their high computational complexity make...
详细信息
ISBN:
(数字)9798331522124
ISBN:
(纸本)9798331522131
Convolutional Neural Networks (CNNs) are widely used for optical character recognition of vehicle license plates in automatic license plate recognition (ALPR) systems. However, their high computational complexity makes meeting specific ALPR applications' time and cost requirements challenging. this work aimed to develop a CNN architecture and select a hardware acceleration technique to create a low-cost optical character recognition (OCR) system capable of real-time vehicle identification. We designed the CNN architecture with accuracy and simplicity in mind, and we chose the hardware acceleration technique based on silicon cost and performance. Our 8-bit quantized CNN achieved an accuracy of 97.11%, and the accelerator resulted in a latency of 4.21 ms and a throughput of 598 FPS. the solution offers accuracy and performance comparable to related work methods, using less than 20% of the hardware resources.
De novo assembly of genomes for which there is no reference, is essential for novel species discovery and metagenomics. In this work, we accelerate two key performance bottlenecks of DBG-based assembly, graph construc...
ISBN:
(纸本)9781665442787
De novo assembly of genomes for which there is no reference, is essential for novel species discovery and metagenomics. In this work, we accelerate two key performance bottlenecks of DBG-based assembly, graph construction and graph traversal, with a near-data processing (NDP) architecture based on 3D-stacking. the proposed framework distributes key operations across NDP cores to exploit a high degree of parallelism and high memory bandwidth. We propose several optimizations based on domain-specific properties to improve the performance of our design. We integrate the proposed techniques into an existing DBG assembly tool, and our simulation-based evaluation shows that the proposed NDP implementation can improve the performance of graph construction by 33× and traversal by 16× compared to the state-of-the-art.
Modern classification problems tackled by using Decision Tree (DT) models often require demanding constraints in terms of accuracy and scalability. this is often hard to achieve due to the ever-increasing volume of da...
Modern classification problems tackled by using Decision Tree (DT) models often require demanding constraints in terms of accuracy and scalability. this is often hard to achieve due to the ever-increasing volume of data used for training and testing. Bayesian approaches to DTs using Markov Chain Monte Carlo (MCMC) methods have demonstrated great accuracy in a wide range of applications. However, the inherently sequential nature of MCMC makes it unsuitable to meet both accuracy and scaling constraints. One could run multiple MCMC chains in an embarrassingly parallel fashion. Despite the improved run-time, this approach sacrifices accuracy in exchange for strong scaling. Sequential Monte Carlo (SMC) samplers are another class of Bayesian inference methods that also have the appealing property of being parallelizable without trading off accuracy. Nevertheless, finding an effective parallelization for the SMC sampler is difficult, due to the challenges in parallelizing its bottleneck, redistribution, in such a way that the workload is equally divided across the processing elements, especially when dealing with variable-size models such as DTs. this study presents a parallel SMC sampler for DTs on Shared Memory (SM) architectures, with an $O(log_{2} N)$ parallel redistribution for variable-size samples. On an SM machine mounting 32 cores, the experimental results show that our proposed method scales up to a factor of 16 compared to its serial implementation, and provides comparable accuracy to MCMC, but 51 times faster.
暂无评论