River algal blooms pose a significant environmental threat, necessitating accurate forecasts and timely warnings for effective prevention. This study proposes a novel hybrid model, combining an external recursive long...
详细信息
River algal blooms pose a significant environmental threat, necessitating accurate forecasts and timely warnings for effective prevention. This study proposes a novel hybrid model, combining an external recursive long short-term memory neural network based on encoder-decoder (RLSTM-ED) with a backpropagation (BP) neural network, denoted as RLSTM-ED-BP. A dataset comprising 34,992 hydrological, climatic, and water quality (4-hourly) observations from the Hanjiang River Basin in China was divided for model training and testing. Comparative analysis with an RLSTM baseline demonstrated that the RLSTM-ED-BP model enhanced the Nash-Sutcliffe coefficient (NSE) by more than 5% and reduced the root mean square error by over 10% during the 24-h forecast horizon. The RLSTM-ED-BP model yielded NSE and threat score values exceeding 0.95 and efficiently provided early warnings for algal bloom events. The model's enhanced performance contributes to the generalizability of deep learning approaches in addressing the critical environmental challenge of algal blooms.
The k-center problem (KCP) is a well-known NP-hard combinatorial optimization challenge in the field of computer science and operations research, aiming to determine optimal locations for k centers within a given set ...
详细信息
The k-center problem (KCP) is a well-known NP-hard combinatorial optimization challenge in the field of computer science and operations research, aiming to determine optimal locations for k centers within a given set of nodes to minimize the maximum distance from each node to its nearest center. In contrast to conventional algorithms that have inherent limitations in handling the trade-off between solution quality and computational efficiency, this study proposes a new method based on a graph attention mechanism with an encoder-decoder architecture to find high-quality solutions for KCPs by directly learning heuristics from the graph. Specifically, the encoder processes the input feature of the graph and capture intricate spatial patterns and dependencies among nodes, whereas the decoder leverages the encoded information and attention weights to iteratively generate solutions for the KCP. Moreover, an adaptive embedding strategy is developed to handle the specific attributes and constraints inherent in different KCP instances. To find high-quality solutions, a policy gradient method with an exponential moving average baseline is developed to update and learn the optimal model parameters. A comprehensive set of experiments on multiple problem sizes are conducted to systematically compared the performance of the proposed method with a wide range of baseline methods across four types of KCPs, including the standard KCP, capacitated KCP, non-uniform KCP, and dynamic KCP. The experimental results demonstrate the competitive performance of the graph attention-based method in addressing KCPs.
Accurate prediction of Chemical Oxygen Demand (COD) and ammonia nitrogen (NH3) is crucial for maintaining stable and effective wastewater treatment processes. Traditional methods rely on costly, high-maintenance senso...
详细信息
Accurate prediction of Chemical Oxygen Demand (COD) and ammonia nitrogen (NH3) is crucial for maintaining stable and effective wastewater treatment processes. Traditional methods rely on costly, high-maintenance sensors, limiting their application in resource-limited wastewater treatment plants. Soft sensing methods provide an alternative by reducing dependence on costly sensors. However, existing approaches cannot perform multitarget and multistep predictions, limiting their practical applicability. This study introduced a novel triple attention-enhanced encoder-decoder temporal convolutional network (TAED-TCN) to address this problem. The model used multimodal inputs, including easily accessible water quality parameters and wastewater surface images, for multistep and synchronous prediction of COD and NH3. When it was validated with real-world sequencing batch reactor wastewater data, the model demonstrated superior multistep prediction performance. Specifically, the R2 for 1-h predictions of COD and NH3 was over 26.03 % and 20.51 % higher than the baseline model, respectively. By incorporating multiple attention mechanisms (feature, temporal, and crossattention), TAED-TCN effectively captured essential features, model nonlinear relationships, and identified long-term dependencies, thus enabled consistent multitarget prediction results even under abnormal conditions. Additionally, economic analysis revealed that TAED-TCN could reduce COD and NH3 monitoring costs by 79 % over the equipment life cycle. This study offers a cost-effective solution for water quality prediction, enhancing the operational efficiency of wastewater management.
3D object tracking in monocular video relies on understanding the scene content to improve the continuity of the tracking signal. Reconstructing 3D shapes of single-view objects is essential for capturing object depth...
详细信息
3D object tracking in monocular video relies on understanding the scene content to improve the continuity of the tracking signal. Reconstructing 3D shapes of single-view objects is essential for capturing object depth, orientation, and position within the scene. While existing deep learning-based methods excel in 3D reconstruction and tracking, they primarily focus on object feature semantics in normal frames, neglecting scene transition (ST) frames. This limitation leads to object information loss and discontinuity during tracking. This paper proposes a novel method for 3D reconstruction of single-view objects in monocular video scenes, focusing on fade scene transitions. First, large video datasets are pre-processed and segmented into sequences using cut transition detection via adaptive histogram equalization (AHE), and Euclidean distance estimation (EDE). Second, fade transition sequences are detected and classified into fade-in, fade-out, and mixed-fade scene transitions using pixel intensity-based adaptive threshold. Third, contrast enhancement is applied to fade transition frames using contrast-limited adaptive histogram equalization (CLAHE) to improve object feature extraction. Fourth, a modified DeepLabv3+ network is employed to generate multi-scale features for semantic foreground object and background segmentation. Finally, the segmented objects are processed through the proposed Point-wise multilayer perceptron (MLP) network, which reconstructs 3D object point clouds from segmented 2D single-view object pixels. Experimental evaluations on object categories "Chair," "Car," and "Airplane" from the benchmark TRECVID, Pix3D, ShapeNet, and Multimedia datasets achieved an accuracy improvement of 6.52% for fade transition detection and satisfactory results in 3D point cloud reconstruction.
Automatic brain tumor segmentation technology plays a crucial role in tumor diagnosis, particularly in the precise delineation of tumor subregions. It can assist doctors in accurately assessing the type and location o...
详细信息
Automatic brain tumor segmentation technology plays a crucial role in tumor diagnosis, particularly in the precise delineation of tumor subregions. It can assist doctors in accurately assessing the type and location of brain tumors, potentially saving patients' lives. However, the highly variable size and shape of brain tumors, along with their similarity to healthy tissue, pose significant challenges in the segmentation of multi-label brain tumor subregions. This paper proposes a network model, KIDBA-Net, based on an encoder-decoder architecture, aimed at solving the issue of pixel-level classification errors in multi-label tumor subregions. The proposed Kernel Inception Depthwise Block (KIDB) employs multi-kernel depthwise convolution to extract multi-scale features in parallel, accurately capturing the feature differences between tumor types to mitigate misclassification. To ensure the network focuses more on the lesion areas and excludes the interference of irrelevant tissues, this paper adopts Bi-Cross Attention as a skip connection hub to bridge the semantic gap between layers. Additionally, the Dynamic Feature Reconstruction Block (DFRB) exploits the complementary advantages of convolution and dynamic upsampling operators, effectively aiding the model in generating high-resolution prediction maps during the decoding phase. The proposed model surpasses other state-of-the-art brain tumor segmentation methods on the BraTS2018 and BraTS2019 datasets, particularly in the segmentation accuracy of smaller and highly overlapping tumor core (TC) and enhanced tumor (ET), achieving DSC scores of 87.8%, 82.0%, and 90.2%, 88.7%, respectively;Hausdorff distances of 2.8, 2.7 mm, and 2.7, 2.0 mm.
Feature fusion module is an essential component of real-time semantic segmentation networks to bridge the semantic gap among different feature layers. However, many networks are inefficient in multi-level feature fusi...
详细信息
Feature fusion module is an essential component of real-time semantic segmentation networks to bridge the semantic gap among different feature layers. However, many networks are inefficient in multi-level feature fusion. In this paper, we propose a simple yet effective decoder that consists of a series of multi-level attention feature fusion modules (MLA-FFMs) aimed at fusing multi-level features in a top-down manner. Specifically, MLA-FFM is a lightweight attention-based module. Therefore, it can not only efficiently fuse features to bridge the semantic gap at different levels, but also be applied to real-time segmentation tasks. In addition, to solve the problem of low accuracy of existing real-time segmentation methods at semantic boundaries, we propose a semantic boundary supervision module (BSM) to improve the accuracy by supervising the prediction of semantic boundaries. Extensive experiments demonstrate that our network achieves a state-of-the-art trade-off between segmentation accuracy and inference speed on both Cityscapes and CamVid datasets. On a single NVIDIA GeForce 1080Ti GPU, our model achieves 77.4% mIoU with a speed of 97.5 FPS on the Cityscapes test dataset, and 74% mIoU with a speed of 156.6 FPS on the CamVid test dataset, which is superior to most state-of-the-art real-time methods.
3D encoder-decoder segmentation architectures struggled with fine-grained feature decomposition, resulting in unclear feature hierarchies when fused across layers. Furthermore, the blurred nature of contour boundaries...
详细信息
3D encoder-decoder segmentation architectures struggled with fine-grained feature decomposition, resulting in unclear feature hierarchies when fused across layers. Furthermore, the blurred nature of contour boundaries in medical imaging limits the focus on high-frequency contour features. To address these challenges, we propose a Multi-oriented Hierarchical Extraction and Dual-frequency Decoupling Network (HEDN), which consists of three modules: encoder-decoder Module (E-DM), Multi-oriented Hierarchical Extraction Module (Multi-HEM), and Dual-frequency Decoupling Module (Dual-DM). The E-DM performs the basic encoding and decoding tasks, while Multi-HEM decomposes and fuses spatial and slice-level features in 3D, enriching the feature hierarchy by weighting them through 3D fusion. Dual-DM separates high-frequency features from the reconstructed network using self-supervision. Finally, the self-supervised high-frequency features separated by Dual-DM are inserted into the process following Multi-HEM, enhancing interactions and complementarities between contour features and hierarchical features, thereby mutually reinforcing both aspects. On the Synapse dataset, HEDN outperforms existing methods, boosting Dice Similarity Score (DSC) by 1.38% and decreasing 95% Hausdorff Distance (HD95) by 1.03 mm. Likewise, on the Automatic Cardiac Diagnosis Challenge (ACDC) dataset, HEDN achieves 0.5% performance gains across all categories.
Video saliency prediction aims to simulate human visual attention by locating the most pertinent and instructive areas within a video frame or sequence. While ignoring the audio aspect, time and space data are essenti...
详细信息
Video saliency prediction aims to simulate human visual attention by locating the most pertinent and instructive areas within a video frame or sequence. While ignoring the audio aspect, time and space data are essential when measuring video saliency, especially with challenging factors like swift motion, changeable background, and nonrigid deformation. Additionally, video saliency detection is inappropriate when using image saliency models directly neglecting video temporal information. This paper suggests a novel Bidirectional Multi-scale SpatioTemporal Network (BMST-Net) for identifying prominent video objects to address the above problem. The BMST-Net yields notable results for any given frame sequence, employing an encoder and decoder technique to learn and map features over time and space. The BMST-Net model consists of bidirectional LSTM (Long Short-Term Memory) and CNN (Convolutional Neural Network), where the VGG16 (Visual Geometry Group) single layer is used for feature extraction of the input video frames. Our proposed approach produced noteworthy findings concerning qualitative and quantitative investigation of the publicly available challenging video datasets, achieving competitive performance concerning state-of-the-art saliency models.
Fine-grained image captioning with attribute information has garnered significant attention in the realms of computer vision and natural language processing, demanding precise and contextually relevant descriptions of...
详细信息
Fine-grained image captioning with attribute information has garnered significant attention in the realms of computer vision and natural language processing, demanding precise and contextually relevant descriptions of visual content. While previous attribute-driven image captioning models have shown improvements, challenges remain, such as the independence of attribute predictors and caption generators and the semantic gap between images and attributes. Another common issue is the inclusion of all attributes at every time step, despite most attributes being irrelevant to the word currently being generated. This can divert the model's attention toward erroneous semantic details, resulting in a performance decline. To address these issues, we propose a novel Attribute-Driven Filtering (ADF) captioning network designed to provide rich and nuanced descriptions. This model incorporates a unique Attribute Predictor Module (APM) that dynamically predicts the most pertinent attributes in accordance with the textual context, utilizing different attributes at various time steps. The novelty of this approach lies in recognizing that not all attributes hold equal relevance at each time step, and the APM filters out irrelevant attributes to generate precise and contextually relevant captions. Furthermore, this model features a fusion mechanism that integrates visual information from a conventional attention module with attribute information predicted by the APM, aiming to reduce the visual semantic gap between images and attributes. Extensive experimentation demonstrates that the ADF model outperforms advanced models, achieving impressive CIDEr-D scores of 72.0 (Flickr30K) and 123.3 (MS-COCO) through reinforcement learning optimization. It consistently surpasses baseline models across diverse evaluation metrics, highlighting its effectiveness and robustness.
Automated audio captioning (AAC), a task that mimics human perception as well as innovatively links audio processing and natural language processing, has overseen much progress over the last few years. AAC requires re...
详细信息
Automated audio captioning (AAC), a task that mimics human perception as well as innovatively links audio processing and natural language processing, has overseen much progress over the last few years. AAC requires recognizing contents such as the environment, sound events and the temporal relationships between sound events and describing these elements with a fluent sentence. Currently, an encoder-decoder-based deep learning framework is the standard approach to tackle this problem. Plenty of works have proposed novel network architectures and training schemes, including extra guidance, reinforcement learning, audio-text self-supervised learning and diverse or controllable captioning. Effective data augmentation techniques, especially based on large language models are explored. Benchmark datasets and AAC-oriented evaluation metrics also accelerate the improvement of this field. This article situates itself as a comprehensive survey covering the comparison between AAC and its related tasks, the existing deep learning techniques, datasets, and the evaluation metrics in AAC, with insights provided to guide potential future research directions.
暂无评论