Audio denoising is crucial for delivering high-quality sound in applications ranging from communication devices to entertainment systems. On-device denoising is critical for ensuring consistent performance across vari...
详细信息
ISBN:
(纸本)9798331522452;9798331522445
Audio denoising is crucial for delivering high-quality sound in applications ranging from communication devices to entertainment systems. On-device denoising is critical for ensuring consistent performance across various host platforms. Machine learning (ML) models exhibit strong audio processing performance in the frequency domain but require efficient hardware design. This paper focuses on enhancing audio quality using convolutional encoder-decoder ML models with low power consumption while meeting real-time processing constraints. We achieve this by developing a quantized network that optimally reduces computational costs without compromising enhancement quality. Furthermore, our hardware quantization scheme reduces memory usage by up to 75% while maintaining accuracy. Next, we design a complementary processing element activation routing scheme tailored to our algorithm, significantly reducing the onchip memory accesses by 5-9x. Fabricated in 28nm CMOS process, our chip demonstrates real-time audio denoising, processing each frame within 8ms while consuming only 407 mu W or 3.24 mu J/frame at 0.65V, 18.5 MHz, making it ideal for battery-powered IoT devices. In terms of performance, our chip also achieves the highest evaluation score for audio quality (PESQ), outperforming previous works.
Object detection and segmentation represents the basis for many tasks in computer and machine vision. In biometric recognition systems the detection of the region-of-interest (ROI) is one of the most crucial steps in ...
详细信息
Object detection and segmentation represents the basis for many tasks in computer and machine vision. In biometric recognition systems the detection of the region-of-interest (ROI) is one of the most crucial steps in the processing pipeline, significantly impacting the performance of the entire recognition system. Existing approaches to ear detection, are commonly susceptible to the presence of severe occlusions, ear accessories or variable illumination conditions and often deteriorate in their performance if applied on ear images captured in unconstrained settings. To address these shortcomings, we present a novel ear detection technique based on convolutional encoder-decoder networks (CEDs). We formulate the problem of ear detection as a two-class segmentation problem and design and train a CED-network architecture to distinguish between image-pixels belonging to the ear and the non-ear class. Unlike competing techniques, our approach does not simply return a bounding box around the detected ear, but provides detailed, pixel-wise information about the location of the ears in the image. Experiments on a dataset gathered from the web (a.k.a. in the wild) show that the proposed technique ensures good detection results in the presence of various covariate factors and significantly outperforms competing methods from the literature.
Colorectal cancer is a leading cause of cancer deaths, estimated 696 thousand worldwide. Recent years have seen an increase of deep learning techniques and algorithms being used to detect colon polyps. In this work we...
详细信息
ISBN:
(纸本)9781538646625
Colorectal cancer is a leading cause of cancer deaths, estimated 696 thousand worldwide. Recent years have seen an increase of deep learning techniques and algorithms being used to detect colon polyps. In this work we address colon polyp detection using convolutional Neural Networks (CNNs) combined with Autoencoders. We use 3 publicly available databases namely: CVC-ColonDB, CVC-ClinicDB and ETIS-LaribPolypDB, to train the model. The results obtained in terms of accuracy are: 0.937, 0.951, 0.967 for the above-mentioned databases respectively. Due to the nature of the colon polyps, diverse shapes and characteristics, there is still place for improvements.
Deep Neural Networks (DNNs) are powerful tools in real-time speech enhancement (SE) since they automatically learn high-level feature representations from raw audio, resulting in significant advancements. Therefore, d...
详细信息
Deep Neural Networks (DNNs) are powerful tools in real-time speech enhancement (SE) since they automatically learn high-level feature representations from raw audio, resulting in significant advancements. Therefore, demand for resource-efficient DNNs for speech enhancement is increasing, mainly using embedded systems. Still, a lightweight and resource-efficient DNN with optimal speech enhancement performance is a challenging task. Dual-path attention-driven architectures have shown notable performance in SE, primarily because of their ability to capture time and frequency dependencies. This paper proposes a resource-efficient SE using a codec-based dual-path time-frequency transformer (CTSE-Net) to improve noisy speech and apply it to speech recognition tasks. The proposed SE employs a codec (coder-decoder) architecture with feature calibration in skip connections to obtain fine-grained frequency components. The codec is interconnected using a dual-path time-frequency transformer incorporating time and frequency attentions. The encoder encodes a time-frequency (T-F) representation derived from the distorted compressed speech spectrum, whereas the decoder estimates the compressed magnitude spectrum of enhanced speech. Further, dedicated speech activity detection (SAD) is employed to identify speech segments in the input signals. By distinguishing speech from background noise or silence, the SAD block provides important information to the decoder for target speech enhancement. The proposed resource-efficient approach ensures attention across time-frequency and distinguishes speech from background noise, leading to more effective denoising and enhancement. Experiments indicate that CTSE-Net shows robust noise reduction and contributes to accurate speech recognition. On the benchmark VCTK+DEMAND dataset, the proposed CTSE-Net demonstrates better SE performance, achieving notable improvements in ESTOI (33.69%), PESQ (1.05), and SDR (11.36 dB) over the noisy mixture.
作者:
Saleem, NasirBourouis, SamiGomal Univ
Fac Engn & Technol Dept Elect Engn Dera Ismail Khan Pakistan Taif Univ
Coll Comp & Informat Technol Dept Informat Technol Taif 21944 Saudi Arabia
Deep neural networks (DNNs) have been successfully applied in advancing speech enhancement (SE), particularly in overcoming the challenges posed by nonstationary noisy backgrounds. In this context, multi-scale feature...
详细信息
Deep neural networks (DNNs) have been successfully applied in advancing speech enhancement (SE), particularly in overcoming the challenges posed by nonstationary noisy backgrounds. In this context, multi-scale feature fusion and recalibration (MFFR) can improve speech enhancement performance by combining multi-scale and recalibrated features. This paper proposes a speech enhancement system that capitalizes on a large-scale pre- trained model, seamlessly fused with features attentively recalibrated using varying kernel sizes in convolutional layers. This process enables the SE system to capture features across diverse scales, enhancing its overall performance. The proposed SE system uses a transferable features extractor architecture and integrates with multi-scaled attentively recalibrated features. Utilizing 2D-convolutional layers, the convolutionalencoder- decoder extracts both local and contextual features from speech signals. To capture long-term temporal dependencies, a bidirectional simple recurrent unit (BSRU) serves as a bottleneck layer positioned between the encoder and decoder. The experiments are conducted on three publicly available datasets including Texas Instruments/Massachusetts Institute of Technology (TIMIT), LibriSpeech, and Voice Cloning Toolkit+Diverse Environments Multi-channel Acoustic Noise Database (VCTK+DEMAND). The experimental results show that the proposed SE system performs better than several recent approaches on the Short-Time Objective Intelligibility (STOI) and Perceptual Evaluation of Speech Quality (PESQ) evaluation metrics. On the TIMIT dataset, the proposed system showcases a considerable improvement in STOI (17.3%) and PESQ (0.74) over the noisy mixture. The evaluation on the LibriSpeech dataset yields results with a 17.6% and 0.87 improvement in STOI and PESQ.
In real-time applications, the aim of speech enhancement (SE) is to achieve optimal performance while ensuring computational efficiency and near-instant outputs. Many deep neural models have achieved optimal performan...
详细信息
In real-time applications, the aim of speech enhancement (SE) is to achieve optimal performance while ensuring computational efficiency and near-instant outputs. Many deep neural models have achieved optimal performance in terms of speech quality and intelligibility. However, formulating efficient and compact deep neural models for real-time processing on resource-limited devices remains a challenge. This study presents a compact neural model designed in a complex frequency domain for speech enhancement, optimized for resource-limited devices. The proposed model combines convolutional encoder-decoder and recurrent architectures to effectively learn complex mappings from noisy speech for real-time speech enhancement, enabling low-latency causal processing. Recurrent architectures such as Long-Short Term Memory (LSTM), Gated Recurrent Unit (GRU), and Simple Recurrent Unit (SRU), are incorporated as bottlenecks to capture temporal dependencies and improve the performance of SE. By representing the speech in the complex frequency domain, the proposed model processes both magnitude and phase information. Further, this study extends the proposed models and incorporates attention-gate-based skip connections, enabling the models to focus on relevant information and dynamically weigh the important features. The results show that the proposed models outperform the recent benchmark models and obtain better speech quality and intelligibility. The proposed models show less computational load and deliver better results. This study uses the WSJ0 database where clean sentences from WSJ0 are mixed with different background noises to create noisy mixtures. The results show that STOI and PESQ are improved by 21.1% and 1.25 (41.5%) on the WSJ0 database whereas, on the VoiceBank+DEMAND database, STOI and PESQ are improved by 4.1% and 1.24 (38.6%) respectively. The extension of the models shows further improvement in STOI and PESQ in seen and unseen noisy conditions.
Speech enhancement (SE) aims to improve the quality and intelligibility of speech signals, particularly in the presence of noise or other distortions, to ensure reliable communication and robust speech recognition. De...
详细信息
Speech enhancement (SE) aims to improve the quality and intelligibility of speech signals, particularly in the presence of noise or other distortions, to ensure reliable communication and robust speech recognition. Deep neural networks (DNNs) have shown remarkable success in SE due to their ability to learn complex patterns and representations from large amounts of data. However, they face limitations in handling long-term temporal sequences. Spiking neural networks and transformers inherently manage temporal data and capture fine-grained temporal patterns in speech signals. This paper proposes a model that integrates self-attention with spiking neural networks for speech enhancement. The proposed model employs a convolutional encoder-decoder architecture with a spiking transformer acting as a bottleneck network. The spiking self-attention mechanism in this framework represents features using spike-based queries, keys, and values. This approach enhances features by effectively capturing temporal dependencies and contextual relationships in speech signals. The spiking transformer is divided into two branches to capture comprehensive global dependencies across the temporal and spectral dimensions. The encoder-decoder incorporates a multi-scale feature extractor, which extracts features at various scales, enabling the model to build a comprehensive hierarchical representation. This representation significantly enhances the model's ability to learn and process noisy speech, leading to excellent SE performance. Experiments are conducted using two publicly available benchmark datasets: WSJO-SI84 and VCTK+DEMAND. The proposed model demonstrated improved SE performance, showing significant progress with notable improvements of 33.69% in ESTOI, 1.05 in PESQ, and 11.36 dB in SDR over the noisy mixtures.
Speech enhancement (SE) is a critical aspect of various speech-processing applications. Recent research in this field focuses on identifying effective ways to capture the long-term contextual dependencies of speech si...
详细信息
Speech enhancement (SE) is a critical aspect of various speech-processing applications. Recent research in this field focuses on identifying effective ways to capture the long-term contextual dependencies of speech signals to enhance performance. Deep convolutional networks (DCN) using self-attention and the Transformer model have demonstrated competitive results in SE. Transformer models with convolution layers can capture short and long-term temporal sequences by leveraging multi-head self-attention, which allows the model to attend the entire sequence. This study proposes a neural speech enhancement (NSE) using the convolutional encoder-decoder (CED) and convolutional attention Transformer (CAT), named the NSE-CATNet. To effectively process the time-frequency (T-F) distribution of spectral components in speech signals, a T-F attention module is incorporated into the convolutional Transformer model. This module enables the model to explicitly leverage position information and generate a two-dimensional attention map for the time-frequency speech distribution. The performance of the proposed SE is evaluated using objective speech quality and intelligibility metrics on two different datasets, the VoiceBank-DEMAND Corpus and the LibriSpeech dataset. The experimental results indicate that the proposed SE outperformed the competitive baselines in terms of speech enhancement performance at -5dB, 0dB, and 5dB. This suggests that the model is effective at improving the overall quality by 0.704 with VoiceBank-DEMAND and by 0.692 with LibriSpeech. Further, the intelligibility with VoiceBank-DEMAND and LibriSpeech is improved by 11.325% and 11.75% over the noisy speech signals.
Speech enhancement is the task of taking a noisy speech input and pro-ducing an enhanced speech *** recent years,the need for speech enhance-ment has been increased due to challenges that occurred in various applicati...
详细信息
Speech enhancement is the task of taking a noisy speech input and pro-ducing an enhanced speech *** recent years,the need for speech enhance-ment has been increased due to challenges that occurred in various applications such as hearing aids,Automatic Speech Recognition(ASR),and mobile speech communication *** of the Speech Enhancement research work has been carried out for English,Chinese,and other European *** a few research works involve speech enhancement in Indian regional *** this paper,we propose a two-fold architecture to perform speech enhancement for Tamil speech signal based on convolutional recurrent neural network(CRN)that addresses the speech enhancement in a real-time single channel or track of sound created by the *** thefirst stage mask based long short-term mem-ory(LSTM)is used for noise suppression along with loss function and in the sec-ond stage,convolutional encoder-decoder(CED)is used for speech *** proposed model is evaluated on various speaker and noisy environments like Babble noise,car noise,and white Gaussian *** proposed CRN model improves speech quality by 0.1 points when compared with the LSTM base model and also CRN requires fewer parameters for *** performance of the pro-posed model is outstanding even in low Signal to Noise Ratio(SNR).
Despite the deployment of collaborative robots for various industrial processes, their teaching and control remain comparatively difficult tasks compared with general industrial robots. Various imitation learning meth...
详细信息
Despite the deployment of collaborative robots for various industrial processes, their teaching and control remain comparatively difficult tasks compared with general industrial robots. Various imitation learning methods involving the transfer of human poses to a collaborative robot have been proposed. However, most of these methods depend heavily on deep learning-based human recognition algorithms that fail to recognize complicated human poses. To address this issue, we propose an automated/semi-automated vision-based teleoperation framework using human digital twin and a collaborative robot digital twin models. First, a human pose is recognized and reasoned to a human skeleton model using a convolution encoder-decoder architecture. Next, the developed human digital twin model is taught using the skeletons. As human and collaborative robots have different joints and rotation architectures, pose mapping is achieved using the proposed Bezier curve-based smooth approximation. Then, a real collaborative robot is controlled using the developed robot digital twin. Furthermore, the proposed framework works successfully using a human digital twin in the case of recognition failures of human poses. To verify the effectiveness of the proposed framework, transfers of several human poses to a real collaborative robot are tested and analyzed.
暂无评论