检索结果-内蒙古大学图书馆

38th International Conference on VLSI Design and International Conference on Embedded Systems

作者： Kochar, Dimple Vijay Ashok, Maitreyi Chandrakasan, Anantha P. MIT Elect Engn & Comp Sci Cambridge MA 02139 USA

ISBN: (纸本)9798331522452;9798331522445

Audio denoising is crucial for delivering high-quality sound in applications ranging from communication devices to entertainment systems. On-device denoising is critical for ensuring consistent performance across various host platforms. Machine learning (ML) models exhibit strong audio processing performance in the frequency domain but require efficient hardware design. This paper focuses on enhancing audio quality using convolutional encoder-decoder ML models with low power consumption while meeting real-time processing constraints. We achieve this by developing a quantized network that optimally reduces computational costs without compromising enhancement quality. Furthermore, our hardware quantization scheme reduces memory usage by up to 75% while maintaining accuracy. Next, we design a complementary processing element activation routing scheme tailored to our algorithm, significantly reducing the onchip memory accesses by 5-9x. Fabricated in 28nm CMOS process, our chip demonstrates real-time audio denoising, processing each frame within 8ms while consuming only 407 mu W or 3.24 mu J/frame at 0.65V, 18.5 MHz, making it ideal for battery-powered IoT devices. In terms of performance, our chip also achieves the highest evaluation score for audio quality (PESQ), outperforming previous works.

关键词： convolutional encoder-decoder audio denoise machine learning low power quantization

来源：评论

学校读者我要写书评

暂无评论

convolutional encoder-decoder networks for pixel-wise ear detection and segmentation

引用

IET BIOMETRICS 2018年第3期7卷 175-184页

作者： Emersic, Ziga Gabriel, Luka L. Struc, Vitomir Peer, Peter Univ Ljubljana Fac Comp & Informat Sci Vecna Pot 113 SL-1000 Ljubljana Slovenia KTH Royal Inst Technol SE-10044 Stockholm Sweden Univ Ljubljana Fac Elect Engn Trzaska 25 SL-1000 Ljubljana Slovenia

Object detection and segmentation represents the basis for many tasks in computer and machine vision. In biometric recognition systems the detection of the region-of-interest (ROI) is one of the most crucial steps in the processing pipeline, significantly impacting the performance of the entire recognition system. Existing approaches to ear detection, are commonly susceptible to the presence of severe occlusions, ear accessories or variable illumination conditions and often deteriorate in their performance if applied on ear images captured in unconstrained settings. To address these shortcomings, we present a novel ear detection technique based on convolutional encoder-decoder networks (CEDs). We formulate the problem of ear detection as a two-class segmentation problem and design and train a CED-network architecture to distinguish between image-pixels belonging to the ear and the non-ear class. Unlike competing techniques, our approach does not simply return a bounding box around the detected ear, but provides detailed, pixel-wise information about the location of the ears in the image. Experiments on a dataset gathered from the web (a.k.a. in the wild) show that the proposed technique ensures good detection results in the presence of various covariate factors and significantly outperforms competing methods from the literature.

关键词： object detection computer vision biometrics (access control) feature extraction image segmentation ear convolutional encoder-decoder pixel-wise ear detection object detection machine vision biometric recognition systems entire recognition system ear accessories ear images ear detection technique two-class segmentation problem design image-pixels nonear class detected ear pixel-wise information good detection results

来源：评论

学校读者我要写书评

暂无评论

Automatic colon polyp detection using convolutional encoder-decoder model 17

Automatic colon polyp detection using Convolutional Encoder-...

引用

17th IEEE International Symposium on Signal Processing and Information Technology (ISSPIT)

作者： Bardhi, Ornela Sierra-Sosa, Daniel Garcia-Zapirain, Begonya Elmaghraby, Adel Univ Deusto eVIDA Lab Bilbao Spain Univ Louisville Comp Engn & Comp Sci Dept Louisville KY 40292 USA

ISBN: (纸本)9781538646625

Colorectal cancer is a leading cause of cancer deaths, estimated 696 thousand worldwide. Recent years have seen an increase of deep learning techniques and algorithms being used to detect colon polyps. In this work we address colon polyp detection using convolutional Neural Networks (CNNs) combined with Autoencoders. We use 3 publicly available databases namely: CVC-ColonDB, CVC-ClinicDB and ETIS-LaribPolypDB, to train the model. The results obtained in terms of accuracy are: 0.937, 0.951, 0.967 for the above-mentioned databases respectively. Due to the nature of the colon polyps, diverse shapes and characteristics, there is still place for improvements.

关键词： deep learning colon polyp detection convolutional encoder-decoder

来源：评论

学校读者我要写书评

暂无评论

CTSE-Net: Resource-efficient convolutional and TF-transformer network for speech enhancement

引用

KNOWLEDGE-BASED SYSTEMS 2025年 317卷

作者： Saleem, Nasir Bourouis, Sami Elmannai, Hela Algarni, Abeer D. Gomal Univ Fac Engn & Technol Dept Elect Engn Dera Ismail Khan Pakistan Taif Univ Coll Comp & Informat Technol Dept Informat Technol Taif 21944 Saudi Arabia Princess Nourah Bint Abdulrahman Univ Coll Comp & Informat Sci Dept Informat Technol POB 84428 Riyadh 11671 Saudi Arabia

Deep Neural Networks (DNNs) are powerful tools in real-time speech enhancement (SE) since they automatically learn high-level feature representations from raw audio, resulting in significant advancements. Therefore, demand for resource-efficient DNNs for speech enhancement is increasing, mainly using embedded systems. Still, a lightweight and resource-efficient DNN with optimal speech enhancement performance is a challenging task. Dual-path attention-driven architectures have shown notable performance in SE, primarily because of their ability to capture time and frequency dependencies. This paper proposes a resource-efficient SE using a codec-based dual-path time-frequency transformer (CTSE-Net) to improve noisy speech and apply it to speech recognition tasks. The proposed SE employs a codec (coder-decoder) architecture with feature calibration in skip connections to obtain fine-grained frequency components. The codec is interconnected using a dual-path time-frequency transformer incorporating time and frequency attentions. The encoder encodes a time-frequency (T-F) representation derived from the distorted compressed speech spectrum, whereas the decoder estimates the compressed magnitude spectrum of enhanced speech. Further, dedicated speech activity detection (SAD) is employed to identify speech segments in the input signals. By distinguishing speech from background noise or silence, the SAD block provides important information to the decoder for target speech enhancement. The proposed resource-efficient approach ensures attention across time-frequency and distinguishes speech from background noise, leading to more effective denoising and enhancement. Experiments indicate that CTSE-Net shows robust noise reduction and contributes to accurate speech recognition. On the benchmark VCTK+DEMAND dataset, the proposed CTSE-Net demonstrates better SE performance, achieving notable improvements in ESTOI (33.69%), PESQ (1.05), and SDR (11.36 dB) over the noisy mixture.

关键词： TF-transformer Feature calibration Speech activity detection Dual-path transformer convolutional encoder-decoder

来源：评论

学校读者我要写书评

暂无评论

MFFR-net: Multi-scale feature fusion and attentive recalibration network for deep neural speech enhancement

引用

DIGITAL SIGNAL PROCESSING 2025年 156卷

作者： Saleem, Nasir Bourouis, Sami Gomal Univ Fac Engn & Technol Dept Elect Engn Dera Ismail Khan Pakistan Taif Univ Coll Comp & Informat Technol Dept Informat Technol Taif 21944 Saudi Arabia

Deep neural networks (DNNs) have been successfully applied in advancing speech enhancement (SE), particularly in overcoming the challenges posed by nonstationary noisy backgrounds. In this context, multi-scale feature fusion and recalibration (MFFR) can improve speech enhancement performance by combining multi-scale and recalibrated features. This paper proposes a speech enhancement system that capitalizes on a large-scale pre- trained model, seamlessly fused with features attentively recalibrated using varying kernel sizes in convolutional layers. This process enables the SE system to capture features across diverse scales, enhancing its overall performance. The proposed SE system uses a transferable features extractor architecture and integrates with multi-scaled attentively recalibrated features. Utilizing 2D-convolutional layers, the convolutional encoder- decoder extracts both local and contextual features from speech signals. To capture long-term temporal dependencies, a bidirectional simple recurrent unit (BSRU) serves as a bottleneck layer positioned between the encoder and decoder. The experiments are conducted on three publicly available datasets including Texas Instruments/Massachusetts Institute of Technology (TIMIT), LibriSpeech, and Voice Cloning Toolkit+Diverse Environments Multi-channel Acoustic Noise Database (VCTK+DEMAND). The experimental results show that the proposed SE system performs better than several recent approaches on the Short-Time Objective Intelligibility (STOI) and Perceptual Evaluation of Speech Quality (PESQ) evaluation metrics. On the TIMIT dataset, the proposed system showcases a considerable improvement in STOI (17.3%) and PESQ (0.74) over the noisy mixture. The evaluation on the LibriSpeech dataset yields results with a 17.6% and 0.87 improvement in STOI and PESQ.

关键词： BSRU Feature scaling Attentive feature recalibration Feature fusion End-to-end SE convolutional encoder-decoder

来源：评论

学校读者我要写书评

暂无评论

Compact deep neural networks for real-time speech enhancement on resource-limited devices

引用

SPEECH COMMUNICATION 2024年 156卷

作者： Wahab, Fazal E. Ye, Zhongfu Saleem, Nasir Ullah, Rizwan Univ Sci & Technol China Natl Engn Lab Speech & Language Informat Proc Hefei Anhui Peoples R China Gomal Univ Fac Engn & Technol Dept Elect Engn Dera Ismail Khan 29050 Pakistan Chulalongkorn Univ Dept Elect Engn Bangkok 10330 Thailand

In real-time applications, the aim of speech enhancement (SE) is to achieve optimal performance while ensuring computational efficiency and near-instant outputs. Many deep neural models have achieved optimal performance in terms of speech quality and intelligibility. However, formulating efficient and compact deep neural models for real-time processing on resource-limited devices remains a challenge. This study presents a compact neural model designed in a complex frequency domain for speech enhancement, optimized for resource-limited devices. The proposed model combines convolutional encoder-decoder and recurrent architectures to effectively learn complex mappings from noisy speech for real-time speech enhancement, enabling low-latency causal processing. Recurrent architectures such as Long-Short Term Memory (LSTM), Gated Recurrent Unit (GRU), and Simple Recurrent Unit (SRU), are incorporated as bottlenecks to capture temporal dependencies and improve the performance of SE. By representing the speech in the complex frequency domain, the proposed model processes both magnitude and phase information. Further, this study extends the proposed models and incorporates attention-gate-based skip connections, enabling the models to focus on relevant information and dynamically weigh the important features. The results show that the proposed models outperform the recent benchmark models and obtain better speech quality and intelligibility. The proposed models show less computational load and deliver better results. This study uses the WSJ0 database where clean sentences from WSJ0 are mixed with different background noises to create noisy mixtures. The results show that STOI and PESQ are improved by 21.1% and 1.25 (41.5%) on the WSJ0 database whereas, on the VoiceBank+DEMAND database, STOI and PESQ are improved by 4.1% and 1.24 (38.6%) respectively. The extension of the models shows further improvement in STOI and PESQ in seen and unseen noisy conditions.

关键词： Deep learning Speech enhancement convolutional encoder-decoder Recurrent networks Quality and intelligibility Phase estimation Causal processing

来源：评论

学校读者我要写书评

暂无评论

Temporally Dynamic Spiking Transformer Network for Speech Enhancement

引用

IEEE ACCESS 2024年 12卷 146513-146526页

作者： Alohali, Manal Abdullah Saleem, Nasir Rhouma, Delel Medani, Mohamed Elmannai, Hela Bourouis, Sami Princess Nourah Bint Abdulrahman Univ Coll Comp & Informat Sci Dept Informat Syst POB 84428 Riyadh 11671 Saudi Arabia Gomal Univ Fac Engn & Technol Dept Elect Engn Dera Ismail Khan 29050 Pakistan Qassim Univ Coll Comp Dept Comp Sci Buraydah 52571 Saudi Arabia King Khalid Univ Appl Coll Mahail Aseer Aseer 62529 Saudi Arabia Princess Nourah Bint Abdulrahman Univ Coll Comp & Informat Sci Dept Informat Technol POB 84428 Riyadh 11671 Saudi Arabia Taif Univ Coll Comp & Informat Technol Dept Informat Technol Taif 21944 Saudi Arabia

Speech enhancement (SE) aims to improve the quality and intelligibility of speech signals, particularly in the presence of noise or other distortions, to ensure reliable communication and robust speech recognition. Deep neural networks (DNNs) have shown remarkable success in SE due to their ability to learn complex patterns and representations from large amounts of data. However, they face limitations in handling long-term temporal sequences. Spiking neural networks and transformers inherently manage temporal data and capture fine-grained temporal patterns in speech signals. This paper proposes a model that integrates self-attention with spiking neural networks for speech enhancement. The proposed model employs a convolutional encoder-decoder architecture with a spiking transformer acting as a bottleneck network. The spiking self-attention mechanism in this framework represents features using spike-based queries, keys, and values. This approach enhances features by effectively capturing temporal dependencies and contextual relationships in speech signals. The spiking transformer is divided into two branches to capture comprehensive global dependencies across the temporal and spectral dimensions. The encoder-decoder incorporates a multi-scale feature extractor, which extracts features at various scales, enabling the model to build a comprehensive hierarchical representation. This representation significantly enhances the model's ability to learn and process noisy speech, leading to excellent SE performance. Experiments are conducted using two publicly available benchmark datasets: WSJO-SI84 and VCTK+DEMAND. The proposed model demonstrated improved SE performance, showing significant progress with notable improvements of 33.69% in ESTOI, 1.05 in PESQ, and 11.36 dB in SDR over the noisy mixtures.

关键词： speech recognition Speech enhancement deep learning deep learning spiking transformer spiking transformer temporal dynamics temporal dynamics spiking self-attention (SSA) spiking self-attention (SSA) convolutional encoder-decoder convolutional encoder-decoder convolutional encoder-decoder

来源：评论

学校读者我要写书评

暂无评论

NSE-CATNet: Deep Neural Speech Enhancement Using convolutional Attention Transformer Network

引用

IEEE ACCESS 2023年 11卷 66979-66994页

作者： Saleem, Nasir Gunawan, Teddy Surya Kartiwi, Mira Nugroho, Bambang Setia Wijayanto, Inung Gomal Univ Fac Engn & Technol Dept Elect Engn Dera Ismail Khan 29050 Pakistan Int Islamic Univ Malaysia IIUM Elect & Comp Engn Dept Kuala Lumpur 53100 Malaysia Telkom Univ Sch Elect Engn Bandung 40257 Indonesia Int Islamic Univ Malaysia IIUM Informat Syst Dept Kuala Lumpur 53100 Malaysia

Speech enhancement (SE) is a critical aspect of various speech-processing applications. Recent research in this field focuses on identifying effective ways to capture the long-term contextual dependencies of speech signals to enhance performance. Deep convolutional networks (DCN) using self-attention and the Transformer model have demonstrated competitive results in SE. Transformer models with convolution layers can capture short and long-term temporal sequences by leveraging multi-head self-attention, which allows the model to attend the entire sequence. This study proposes a neural speech enhancement (NSE) using the convolutional encoder-decoder (CED) and convolutional attention Transformer (CAT), named the NSE-CATNet. To effectively process the time-frequency (T-F) distribution of spectral components in speech signals, a T-F attention module is incorporated into the convolutional Transformer model. This module enables the model to explicitly leverage position information and generate a two-dimensional attention map for the time-frequency speech distribution. The performance of the proposed SE is evaluated using objective speech quality and intelligibility metrics on two different datasets, the VoiceBank-DEMAND Corpus and the LibriSpeech dataset. The experimental results indicate that the proposed SE outperformed the competitive baselines in terms of speech enhancement performance at -5dB, 0dB, and 5dB. This suggests that the model is effective at improving the overall quality by 0.704 with VoiceBank-DEMAND and by 0.692 with LibriSpeech. Further, the intelligibility with VoiceBank-DEMAND and LibriSpeech is improved by 11.325% and 11.75% over the noisy speech signals.

关键词： Neural speech enhancement T-F attention convolutional encoder-decoder convolutional attention transformer T-F masking

来源：评论

学校读者我要写书评

暂无评论

Real-Time Speech Enhancement Based on convolutional Recurrent Neural Network

引用

Intelligent Automation & Soft Computing 2023年第2期35卷 1987-2001页

作者： S.Girirajan A.Pandian Department of Computer Science and Engineering School of ComputingSRM Institute of Science and EngineeringKattankulathurTamil NaduIndia

Speech enhancement is the task of taking a noisy speech input and pro-ducing an enhanced speech *** recent years,the need for speech enhance-ment has been increased due to challenges that occurred in various applications such as hearing aids,Automatic Speech Recognition(ASR),and mobile speech communication *** of the Speech Enhancement research work has been carried out for English,Chinese,and other European *** a few research works involve speech enhancement in Indian regional *** this paper,we propose a two-fold architecture to perform speech enhancement for Tamil speech signal based on convolutional recurrent neural network(CRN)that addresses the speech enhancement in a real-time single channel or track of sound created by the *** theﬁrst stage mask based long short-term mem-ory(LSTM)is used for noise suppression along with loss function and in the sec-ond stage,convolutional encoder-decoder(CED)is used for speech *** proposed model is evaluated on various speaker and noisy environments like Babble noise,car noise,and white Gaussian *** proposed CRN model improves speech quality by 0.1 points when compared with the LSTM base model and also CRN requires fewer parameters for *** performance of the pro-posed model is outstanding even in low Signal to Noise Ratio(SNR).

关键词： Speech enhancement convolutional encoder-decoder long short-term memory noise suppression speech restoration

来源：评论

学校读者我要写书评

暂无评论

Control framework for collaborative robot using imitation learning-based teleoperation from human digital twin to robot digital twin*

引用

MECHATRONICS 2022年第0期85卷

作者： Lee, Hyunsoo Kim, Seong Dae Amin, Mohammad Aman Ullah Al Kumoh Natl Inst Technol Sch Ind Engn Gumi South Korea Univ Tennessee Chattanooga Dept Engn Management & Technol Chattanooga TN USA Univ Texas Arlington Dept Ind Engn Arlington TX USA 615 McCallie Ave Chattanooga TN 37403 USA

Despite the deployment of collaborative robots for various industrial processes, their teaching and control remain comparatively difficult tasks compared with general industrial robots. Various imitation learning methods involving the transfer of human poses to a collaborative robot have been proposed. However, most of these methods depend heavily on deep learning-based human recognition algorithms that fail to recognize complicated human poses. To address this issue, we propose an automated/semi-automated vision-based teleoperation framework using human digital twin and a collaborative robot digital twin models. First, a human pose is recognized and reasoned to a human skeleton model using a convolution encoder-decoder architecture. Next, the developed human digital twin model is taught using the skeletons. As human and collaborative robots have different joints and rotation architectures, pose mapping is achieved using the proposed Bezier curve-based smooth approximation. Then, a real collaborative robot is controlled using the developed robot digital twin. Furthermore, the proposed framework works successfully using a human digital twin in the case of recognition failures of human poses. To verify the effectiveness of the proposed framework, transfers of several human poses to a real collaborative robot are tested and analyzed.

关键词： Collaborative robot Teleoperation framework Imitation learning Digital twin Bezier curve-based smooth pose mapping convolutional encoder-decoder

来源：评论

学校读者我要写书评

暂无评论

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案：

请选择收藏分类：

通借通还

建议与咨询 留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案： 新增检索档案 确定 取消

请选择收藏分类： 新增自定义分类 确定 取消

通借通还

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

请选择保存的检索档案：

请选择收藏分类：