For monaural speech enhancement, contextual information is important for accurate speech estimation. However, commonly used convolution neural networks (CNNs) are weak in capturing temporal contexts since they only bu...
详细信息
For monaural speech enhancement, contextual information is important for accurate speech estimation. However, commonly used convolution neural networks (CNNs) are weak in capturing temporal contexts since they only build blocks that process one local neighborhood at a time. To address this problem, we learn from human auditory perception to introduce a two-stage trainable reasoning mechanism, referred as global-local dependency (GLD) block. GLD blocks capture long-term dependency of time-frequency bins both in global level and local level from the noisy spectrogram to help detecting correlations among speech part, noise part, and whole noisy input. What is more, we conduct a monaural speech enhancement network called GLD-Net, which adopts encoder-decoder architecture and consists of speech object branch, interference branch, and global noisy branch. The extracted speech feature at global-level and local-level are efficiently reasoned and aggregated in each of the branches. We compare the proposed GLD-Net with existing state-of-art methods on WSJ0 and DEMAND dataset. The results show that GLD-Net outperforms the state-of-the-art methods in terms of PESQ and STOI.
Enhancement of low-light images is a challenging task due to the impact of low brightness, low contrast, and high noise. The inability to collect natural labeled data intensifies this problem further. Many researchers...
详细信息
Enhancement of low-light images is a challenging task due to the impact of low brightness, low contrast, and high noise. The inability to collect natural labeled data intensifies this problem further. Many researchers have attempted to solve this problem using learning-based approaches;however, most models ignore the impact of noise in low-lit images. In this paper, an encoder-decoder architecture, made up of separable convolution layers that solve the issues encountered in low-light image enhancement, is proposed. The architecture is trained end-to-end on a custom low-light image dataset (LID), comprising both clean and noisy images. We introduce a unique multi-context feature extraction module (MC-FEM) where the input first passes through a feature pyramid of dilated separable convolutions for hierarchical-context feature extraction followed by separable convolutions for feature compression. The model is optimized using a novel three-part loss function that focuses on high-level contextual features, structural similarity, and patch-wise local information. We conducted several ablation studies to determine the optimal model for low-light image enhancement under noisy and noiseless conditions. We have used performance metrics like peak-signal-to-noise ratio, structural similarity index matrix, visual information fidelity, and average brightness to demonstrate the superiority of the proposed work against the state-of-the-art algorithms. Qualitative results presented in this paper prove the strength and suitability of our model for real-time applications.
Document image binarization is an important pre-processing step in document analysis and archiving. The state-of-the-art models for document image binarization are variants of encoder-decoder architectures, such as FC...
详细信息
Document image binarization is an important pre-processing step in document analysis and archiving. The state-of-the-art models for document image binarization are variants of encoder-decoder architectures, such as FCN (fully convolutional network) and U-Net. Despite their success, they still suffer from three limitations: (1) reduced feature map resolution due to consecutive strided pooling or convolutions, (2) multiple scales of target objects, and (3) reduced localization accuracy due to the built-in invariance of deep convolutional neural networks (DCNNs). To overcome these three challenges, we propose an improved semantic segmentation model, referred to as DP-LinkNet, which adopts the D-LinkNet architecture as its backbone, with the proposed hybrid dilated convolution (HDC) and spatial pyramid pooling (SPP) modules between the encoder and the decoder. Extensive experiments are conducted on recent document image binarization competition (DIBCO) and handwritten document image binarization competition (H-DIBCO) benchmark datasets. Results show that our proposed DP-LinkNet outperforms other state-of-the-art techniques by a large margin. Our implementation and the pre-trained models are available at https://***/beargolden/DP-LinkNet.
Droughts pose significant challenges for accurate monitoring due to their complex spatiotemporal characteristics. Data-driven machine learning (ML) models have shown promise in detecting extreme events when enough wel...
详细信息
Droughts pose significant challenges for accurate monitoring due to their complex spatiotemporal characteristics. Data-driven machine learning (ML) models have shown promise in detecting extreme events when enough well-annotated data is available. However, droughts do not have a unique and precise definition, which leads to noise in human-annotated events and presents an imperfect learning scenario for deep learning models. This article introduces a 3-D convolutional neural network (CNN) designed to address the complex task of drought detection, considering spatiotemporal dependencies and learning with noisy and inaccurate labels. Motivated by the shortcomings of traditional drought indices, we leverage supervised learning with labeled events from multiple sources, capturing the shared conceptual space among diverse definitions of drought. In addition, we employ several strategies to mitigate the negative effect of noisy labels (NLs) during training, including a novel label correction (LC) method that relies on model outputs, enhancing the robustness and performance of the detection model. Our model significantly outperforms state-of-the-art drought indices when detecting events in Europe between 2003 and 2015, achieving an AUROC of 72.28%, an AUPRC of 7.67%, and an ECE of 16.20%. When applying the proposed LC method, these performances improve by +5%, +15%, and +59%, respectively. Both the proposed model and the robust learning methodology aim to advance drought detection by providing a comprehensive solution to label noise and conceptual variability.
The Temporal Convolutional Network (TCN) and TCN combined with the encoder-decoder architecture (TCN-ED) are proposed to forecast runoff in this study. Both models are trained and tested using the hourly data in the J...
详细信息
The Temporal Convolutional Network (TCN) and TCN combined with the encoder-decoder architecture (TCN-ED) are proposed to forecast runoff in this study. Both models are trained and tested using the hourly data in the Jianxi basin, China. The results indicate that the forecast horizon has a great impact on the forecast ability, and the concentration time of the basin is a critical threshold to the effective forecast horizon for both models. Both models perform poorly in the low flow and well in the medium and high flow at most forecast horizons, while it is subject to the forecast horizon in forecasting peak flow. TCN-ED has better performance than TCN in runoff forecasting, with higher accuracy, better stability, and insensitivity to fluctuations in the rainfall process. Therefore, TCN-ED is an effective deep learning solution in runoff forecasting within an appropriate forecast horizon.
Recent advancements in face super resolution (FSR) have been propelled by deep learning techniques using convolutional neural networks (CNN). However, existing methods still struggle with effectively capturing global ...
详细信息
Recent advancements in face super resolution (FSR) have been propelled by deep learning techniques using convolutional neural networks (CNN). However, existing methods still struggle with effectively capturing global facial structure information, leading to reduced fidelity in reconstructed images, and often require additional manual data annotation. To overcome these challenges, we introduce a content-guided frequency domain transform network (CGFTNet) for face super-resolution tasks. The network features a channel attention-linked encoder-decoder architecture with two key components: the Frequency Domain and Reparameterized Focus Convolution Feature Enhancement module (FDRFEM) and the Content-Guided Channel Attention Fusion (CGCAF) module. FDRFEM enhances feature representation through transformation domain techniques and reparameterized focus convolution (RefConv), capturing detailed facial features and improving image quality. CGCAF dynamically adjusts feature fusion based on image content, enhancing detail restoration. Extensive evaluations across multiple datasets demonstrate that the proposed CGFTNet consistently outperforms other state-of-the-art methods.
Video saliency prediction aims to simulate human visual attention by selecting the most pertinent and important components within a video frame or sequence. When evaluating video saliency, time and space data are esse...
详细信息
Video saliency prediction aims to simulate human visual attention by selecting the most pertinent and important components within a video frame or sequence. When evaluating video saliency, time and space data are essential, particularly in the presence of challenging features such as fast motion, shifting background, and nonrigid deformation. The current video saliency frameworks are highly prone to failure under the specified conditions. Moreover, it is unsuitable to perform video saliency identification by solely relying on image saliency models, disregarding the temporal information in videos. This research proposes a novel Spatiotemporal Bidirectional Network for Video Salient Object Detection using Multiscale Transfer Learning (SBMTL-Net) to solve the issue of detecting important objects in videos. The SBMTL-Net produces significant outcomes for a given sequence of frames by utilizing Multi-scale transfer learning with an encoder and decoder technique to acquire knowledge and spatially and temporally map properties. SBMTL-Net model consists of bidirectional LSTM (Long Short-Term Memory) and CNN (Convolutional Neural Network), where the VGG16 (Video Geometry Group) and VGG19 are utilized for multi-scale feature extraction of the input video frames. The performance of the proposed model has been evaluated on five publically available challenging datasets DAVIS-T, SegTrack-V2, ViSal, VOS-T and DAVSOD-T for the parameters MAE, F-measure and S-measure. The experimental results show the effectiveness of the proposed model as compared with other competitive models.
Can a machine learn Machine Learning? This work trains a machine learning model to solve machine learning problems from a University undergraduate level course. We generate a new training set of questions and answers ...
详细信息
Can a machine learn Machine Learning? This work trains a machine learning model to solve machine learning problems from a University undergraduate level course. We generate a new training set of questions and answers consisting of course exercises, homework, and quiz questions from MIT's 6.036 Introduction to Machine Learning course and train a machine learning model to answer these questions. Our system demonstrates an overall accuracy of 96% for open-response questions and 97% for multiple-choice questions, compared with MIT students' average of 93%, achieving grade A performance in the course, all in real-time. Questions cover all 12 topics taught in the course, excluding coding questions or questions with images. Topics include: (i) basic machine learning principles;(ii) perceptrons;(iii) feature extraction and selection;(iv) logistic regression;(v) regression;(vi) neural networks;(vii) advanced neural networks;(viii) convolutional neural networks;(ix) recurrent neural networks;(x) state machines and MDPs;(xi) reinforcement learning;and (xii) decision trees. Our system uses Transformer models within an encoder-decoder architecture with graph and tree representations. An important aspect of our approach is a data-augmentation scheme for generating new example problems. We also train a machine learning model to generate problem hints. Thus, our system automatically generates new questions across topics, answers both open-response questions and multiple-choice questions, classifies problems, and generates problem hints, pushing the envelope of AI for STEM education.
Mathematical formula recognition aims to automatically convert formula images into their structured description formats. Recently, some encoder-decoder models have been presented for this task, while they seldom expli...
详细信息
ISBN:
(纸本)9783030863319
Mathematical formula recognition aims to automatically convert formula images into their structured description formats. Recently, some encoder-decoder models have been presented for this task, while they seldom explicitly consider spatial relationship among symbols. In this paper, we proposed a novel encoder-decoder model with Graph Neural Network (GNN) to translate mathematical formula images into LaTeX codes. In the proposed model, the symbols segmented from the raw image are used to build graphs based on their spatial connection. The encoder consists of Convolutional Neural Network (CNN) and GNN. CNN is utilized to extract the visual features from the whole formula or symbols, and GNN is used to transmit the spatial information embedded in the built graphs. The adopted decoder is a Recurrent Neural Network (RNN) model, which implements a language model to generate the output sentences based on the encoded features with attention mechanism. The experimental results on IM2LATEX-100K dataset demonstrated that the proposed model obtained a better performance than state-of-the-art approaches.
Automatic pavement distress detection is essential to monitoring and maintaining pavement condition. Currently, many deep learning-based methods have been utilized in pavement distress detection. However, distress seg...
详细信息
Automatic pavement distress detection is essential to monitoring and maintaining pavement condition. Currently, many deep learning-based methods have been utilized in pavement distress detection. However, distress segmentation remains as a challenge under complex pavement conditions. In this paper, a novel deep neural network architecture, W-segnet, based on multi-scale feature fusions, is proposed for pixel-wise distress segmentation. The proposed W-segnet concatenates distress location information with distress classification features in two symmetric encoder-decoder structures. Three major types of distresses: crack, pothole, and patch are segmented and the results were discussed. Experimental results show that the proposed W-segnet is robust in various scenarios, achieving a mean pixel accuracy (MPA) of 87.52% and a mean intersection over union (MIoU) of 75.88%. The results demonstrate that W-segnet outperforms other state-of-the-art semantic segmentation models of U-net, SegNet, and PSPNet. Comparison of cost of model training and inference indicates that W-segnet has the largest number of parameters, which needs a slightly longer training time while it does not increase the inference cost. Four public datasets were used to test the generalization ability of the proposed model and the results demonstrate that the W-segnet possesses well segmentation performance.
暂无评论