this study addresses the critical issue of child abuse in digital communications by developing advanced machine learning and deep learning models for detecting child abusive texts in the Bengali language on online pla...
详细信息
the popularity of movies has made predicting amovie's success before its release a crucial task for both studios and filmmakers. this research proposes a movie success prediction system using machine learning that...
详细信息
Reservoir computing is a nascent sub-field of machine learning which relies on the recurrent multiplication of a very large, sparse, fixed matrix. We argue that direct spatial implementation of these fixed matrices mi...
详细信息
ISBN:
(纸本)9781665420273
Reservoir computing is a nascent sub-field of machine learning which relies on the recurrent multiplication of a very large, sparse, fixed matrix. We argue that direct spatial implementation of these fixed matrices minimizes the work performed in the computation, and allows for significant reduction in latency and power through constant propagation and logic minimization. Bit-serial arithmetic enables massive static matrices to be implemented. We present the structure of our bit-serial matrix multiplier, and evaluate using canonical signed digit representation to further reduce logic utilization. We have implemented these matrices on a large FPGA and provide a cost model that is simple and extensible. these FPGA implementations, on average, reduce latency by 50x up to 86x versus GPU libraries. Comparing against a recent sparse DNN accelerator, we measure a 4.1x to 47x reduction in latency depending on matrix dimension and sparsity.
To meet the strict service level agreement requirements of recommendation systems, the entire set of embeddings in recommendation systems needs to be loaded into the memory. However, as the model and dataset for produ...
详细信息
ISBN:
(纸本)9781665420273
To meet the strict service level agreement requirements of recommendation systems, the entire set of embeddings in recommendation systems needs to be loaded into the memory. However, as the model and dataset for production-scale recommendation systems scale up, the size of the embeddings is approaching the limit of memory capacity. Limited physical memory constrains the algorithms that can be trained and deployed, posing a severe challenge for deploying advanced recommendation systems. Recent studies offload the embedding lookups into SSDs, which targets the embedding-dominated recommendation models. this paper takes it one step further and proposes to offload the entire recommendation system into SSD with in-storage computing capability. the proposed SSD-side FPGA solution leverages a low -end FPGA to speed up boththe embedding-dominated and MLP-dominated models withhigh resource efficiency. We evaluate the performance of the proposed solution with a prototype SSD. Results show that we can achieve 20-100x throughput improvement compared withthe baseline SSD and 1.5-15x improvement compared withthe state-of-art.
this paper gives a singular approach to permitwise unmanned aerial car (UAV) direction planning in self-sufficient telecommunication systems. the proposed technique utilizes the aggregate of heuristic algorithms, arti...
详细信息
As FPGAs become ubiquitous compute platforms, existing research has focused on enabling virtualization features to facilitate fine-grained FPGA sharing. We employ an overlay architecture which enables arbitrary, indep...
详细信息
ISBN:
(纸本)9798400700958
As FPGAs become ubiquitous compute platforms, existing research has focused on enabling virtualization features to facilitate fine-grained FPGA sharing. We employ an overlay architecture which enables arbitrary, independent user logic to share portions of a single FPGA by dividing the FPGA into independently reconfigurable slots. We then explore scheduling possibilities to effectively time- and space-multiplex the virtualized FPGA by introducing Nimblock. the Nimblock scheduling algorithm balances application priorities and performance degradation to improve response time and reduce deadline violations. Unlike other algorithms, Nimblock explores pre-emption as a scheduling parameter to dynamically change resource allocations, and automatically allocates resources to enable suitable parallelism for an application without additional user input. In our exploration, we evaluate five scheduling algorithms: a baseline, three existing algorithms, and our novel Nimblock algorithm. We demonstrate system feasibility by realizing the complete system on a Xilinx ZCU106 FPGA and evaluating on a set of real-world benchmarks. In our results, we achieve up to 5.7x lower average response times when compared to a no-sharing and no-virtualization scheduling algorithm and up to 2.1x average response time improvement over competitive scheduling algorithms that support sharing within our virtualization environment. We additionally demonstrate up to 49% fewer deadline violations and up to 2.6x lower tail response times when compared to other high-performance algorithms.
Serverless computing has garnered significant interest for executing high-performancecomputing (HPC) applications in recent years, attracting attention for its elastic scalability, reduced entry barriers, and pay-per...
详细信息
the proceedings contain 48 papers. the topics discussed include: environment representations with bisimulation metrics for hierarchical reinforcement learning;crime prediction linked to geographical location with peri...
ISBN:
(纸本)9781665487504
the proceedings contain 48 papers. the topics discussed include: environment representations with bisimulation metrics for hierarchical reinforcement learning;crime prediction linked to geographical location with periodic features for societal security;differential privacy machine learning based on attention residual networks;an energy and memory efficient speaker verification system based on binary neural networks;enhancing the feature learning with phonetic control module for speaker verification;latency minimization for intelligent reflecting surface-assisted cloud-edge collaborative computing;a high-precision intelligent retrieval algorithm for bill of quantities;number and classes of rotations on juggling sequence rotation;a hybrid approach for the circular knapsack packing problem with rectangular items;personalized microblog recommendation system integrates following and reposting relationship;research on intelligent scheduling algorithm based on cloud computing;and a proposed model for enhancing the performance of health care services in smart cities using hybrid optimization techniques.
General matrix multiply (GEMM) is an important operation in broad applications, especially the thriving deep neural networks. To achieve low power consumption for GEMM, researchers have already leveraged unary computi...
详细信息
ISBN:
(纸本)9781665420273
General matrix multiply (GEMM) is an important operation in broad applications, especially the thriving deep neural networks. To achieve low power consumption for GEMM, researchers have already leveraged unary computing, which manipulates bitstreams with extremely simple logic. However, existing unary architectures are not well generalizable to varying GEMM configurations in versatile applications and incompatible to the binary computing stack, imposing challenges to execute unary GEMM effortlessly. In this work, we address the problem by architecting a hybrid unary-binary systolic array, uSystolic, to inherit the legacy-binary data scheduling with slow (thus power-efficient) data movement, i.e., data bytes are crawling out from memory to drive uSystolic. uSystolic exhibits tremendous area and power improvements as a joint effect of 1) low-power computing kernel, 2) spatial-temporal bitstream reuse, and 3) on-chip SRAM elimination. For the evaluated edge computing scenario, compared withthe binary parallel design, the rated-coded uSystolic reduces the systolic array area and total on-chip area by 59.0% and 91.3%, withthe on-chip energy and power efficiency improved by up to 112.2x and 44.8x for AlexNet.
Database Management Systems (DBMS) have become an essential tool for industry and research and are often a significant component of data centers. there have been many efforts to accelerate DBMS application performance...
详细信息
ISBN:
(纸本)9781665476522
Database Management Systems (DBMS) have become an essential tool for industry and research and are often a significant component of data centers. there have been many efforts to accelerate DBMS application performance. One of the most explored techniques is the use of vector processing. Unfortunately, conventional vector architectures have not been able to exploit the full potential of DBMS acceleration. In this paper, we present VAQUERO, our Scratchpad-based Vector Accelerator for QUEry pROcessing. VAQUERO improves the efficiency of vector architectures for DBMS operations such as data aggregation and hash joins featuring lookup tables. Lookup tables are significant contributors to the performance bottlenecks in DBMS processing suffering from insufficient ISA support in the form of scatter-gather instructions. VAQUERO introduces a novel Advanced Scratchpad Memory specifically designed with two mapping modes - direct- and associative-mode. these mapping modes enable VAQUERO to accelerate real-world databases with workload sizes that significantly exceed the scratchpad memory capacity. Additionally, the associative-mode allows to use VAQUERO with DBMS operators that use hashed keys, e.g. hash-join and hash-aggregate. VAQUERO has been designed considering general DBMS algorithm requirements instead of being based on a particular database organization. For this reason, VAQUERO is capable to accelerate DBMS operators for both row- and column-oriented databases. In this paper, we evaluate the efficiency of VAQUERO using two highly optimized popular open-source DBMS, namely the row-based PostgreSQL and column-based MonetDB. We implemented VAQUERO at the RTL level and prototype it, by performing Place&Route, at the 7nm technology node. VAQUERO incurs a modest 0.15% area overhead compared with an Intel Ice Lake processor. Our evaluation shows that VAQUERO significantly outperforms PostgreSQL and MonetDB by 2.09x and 3.32x respectively, when processing operators and queries
暂无评论