The attention-based encoder-decoder (AED) models are increasingly used in handwritten mathematical expression recognition (HMER) tasks. Given the recent success ofTransformer in computer vision and a variety of attemp...
详细信息
ISBN:
(纸本)9783031216473;9783031216480
The attention-based encoder-decoder (AED) models are increasingly used in handwritten mathematical expression recognition (HMER) tasks. Given the recent success ofTransformer in computer vision and a variety of attempts to combine Transformer with convolutional neural network (CNN), in this paper, we study 3 ways of leveraging Transformer and CNN designs to improve AED-based HMER models: 1) Tandem way, which feeds CNN-extracted features to a Transformer encoder to capture global dependencies;2) Parallel way, which adds a Transformer encoder branch taking raw image patches as input and concatenates its output with CNN's as final feature;3) Mixing way, which replaces convolution layers of CNN's last stage withmulti-head self-attention (MHSA). We compared these 3 methods on the CROHME benchmark. On CROHME 2016 and 2019, Tandem way attained the ExpRate of 54.85% and 58.56%, respectively;Parallel way attained the ExpRate of 55.63% and 57.39%;and Mixing way achieved the ExpRate of 53.93% and 55.64%. This result indicates that Parallel and Tandem ways perform better than Mixing way, and have little difference between each other.
Framing theory is a widely accepted theoretical framework in the field of news communication studies, frequently employed to analyze the content of news reports. This paper innovatively introduces framing theory into ...
详细信息
Framing theory is a widely accepted theoretical framework in the field of news communication studies, frequently employed to analyze the content of news reports. This paper innovatively introduces framing theory into the text summarization task and proposes a news text summarization method based on framing theory to address the global context of rapidly increasing speed and scale of information dissemination. Traditional text summarization methods often overlook the implicit deep-level semantic content and situational frames in news texts, and the method proposed in this paper aims to fill this gap. Our deep learning-based news frame identification module can automatically identify frame elements in the text and predict the dominant frame of the text. The frame-aware summarization generation model (FrameSum) can incorporate the identified frame feature into the text representation and attention mechanism, ensuring that the generated summary focuses on the core content of the news report while maintaining high information coverage, readability, and objectivity. Through empirical studies on the standard CNN/Daily Mail dataset, we found that this method performs significantly better in improving summary quality and maintaining the accuracy of news facts.
Unsupervised feature selection (UFS) is a fundamental task in machine learning and data analysis, aimed at identifying a subset of non -redundant and relevant features from a high -dimensional dataset. Embedded method...
详细信息
Unsupervised feature selection (UFS) is a fundamental task in machine learning and data analysis, aimed at identifying a subset of non -redundant and relevant features from a high -dimensional dataset. Embedded methods seamlessly integrate feature selection into model training, resulting in more efficient and interpretable models. Current embedded UFS methods primarily rely on self -representation or pseudo -supervised feature selection approaches to address redundancy and irrelevant feature issues, respectively. Nevertheless, there is currently a lack of research showcasing the fusion of these two approaches. This paper proposes the Orthogonal encoderdecoder factorization for unsupervised Feature Selection (OEDFS) model, combining the strengths of self -representation and pseudo -supervised approaches. This method draws inspiration from the self -representation properties of autoencoder architectures and leverages encoder and decoder factorizations to simulate a pseudo -supervised feature selection approach. To further enhance the part -based characteristics of factorization, orthogonality constraints and local structure preservation restrictions are incorporated into the objective function. The optimization process is based on the multiplicative update rule, ensuring efficient convergence. To assess the effectiveness of the proposed method, comprehensive experiments are conducted on 14 datasets and compare the results with eight state-of-the-art methods. The experimental results demonstrate the superior performance of the proposed approach in terms of UFS efficiency.
In recent years, the automatic generation of natural language descriptions of video has focused on deep learning research and natural voice processing. Video understanding has multiple applications such as video searc...
详细信息
ISBN:
(纸本)9781665473507
In recent years, the automatic generation of natural language descriptions of video has focused on deep learning research and natural voice processing. Video understanding has multiple applications such as video search and indexing, but video subtitles are a correct sophisticated topic for complex and diverse types of video content. However, the understanding between video and natural language sets remains an open issue to better understand the video and create multiple methods to create a set automatically. The deep learning method has a major focus on the direction of video processing with performance and highspeed computing capabilities. This polling discusses an encoderdecoder network end-in-frame based on a deep learning approach to generate caption. In this paper we will describe the model, dataset and parameters used to evaluate the model.
The Transformer-based encoder-decoder architecture has recently made significant advances in recognizing handwritten mathematical expressions. However, the transformer model still suffers from the lack of coverage pro...
详细信息
ISBN:
(纸本)9783031198144;9783031198151
The Transformer-based encoder-decoder architecture has recently made significant advances in recognizing handwritten mathematical expressions. However, the transformer model still suffers from the lack of coverage problem, making its expression recognition rate (ExpRate) inferior to its RNN counterpart. Coverage information, which records the alignment information of the past steps, has proven effective in the RNN models. In this paper, we propose CoMER, a model that adopts the coverage information in the transformer decoder. Specifically, we propose a novel Attention Refinement Module (ARM) to refine the attention weights with past alignment information without hurting its parallelism. Furthermore, we take coverage information to the extreme by proposing self-coverage and cross-coverage, which utilize the past alignment information from the current and previous layers. Experiments show that CoMER improves the ExpRate by 0.61%/2.09%/1.59% compared to the current state-of-the-art model, and reaches 59.33%/59.81%/62.97% on the CROHME 2014/2016/2019 test sets. (Source code is available at https://***/Green-Wood/CoMER)
Understanding human-object interaction is a fundamental challenge in computer vision and robotics. Crucial to it is the ability to infer "object affordances" from visual data, namely the types of interaction...
详细信息
Understanding human-object interaction is a fundamental challenge in computer vision and robotics. Crucial to it is the ability to infer "object affordances" from visual data, namely the types of interaction supported by an object of interest and the object parts involved. Such inference can be approached as an "affordance reasoning" task, where object affordances are recognized and localized as image heatmaps, and as an "affordance segmentation" task, where affordance labels are obtained at a more detailed, image pixel level. To tackle the two tasks, existing methods typically: (i) treat them independently;(ii) adopt static image-based models, ignoring the temporal aspect of human-object interaction;and / or (iii) require additional strong supervision concerning object class and location. In this paper, we focus on both tasks, while addressing all three aforementioned shortcomings. For this purpose, we propose a deep-learning based dual encoder-decoder model for joint affordance reasoning and segmentation, which learns from our recently introduced SOR3D-AFF corpus of RGB-D human-object interaction videos, without relying on object localization and classification. The basic components of the model comprise: (i) two parallel encoders that capture spatio-temporal interaction information;(ii) a reasoning decoder that predicts affordance heatmaps, assisted by an affordance classifier and an attention mechanism;and (iii) a segmentation decoder that exploits the predicted heatmap to yield pixel-level affordance segmentation. All modules are jointly trained, while the system can operate on both static images and videos. The approach is evaluated on four datasets, surpassing the current state-of-the-art in both affordance reasoning and segmentation.
Convolutional neural network (CNN)-based encoder-decoder models have profoundly inspired recent works in the field of salient object detection (SOD). With the rapid development of encoder-decoder models with respect t...
详细信息
Convolutional neural network (CNN)-based encoder-decoder models have profoundly inspired recent works in the field of salient object detection (SOD). With the rapid development of encoder-decoder models with respect to most pixel-level dense prediction tasks, an empirical study still does not exist that evaluates performance by applying a large body of encoder-decoder models on SOD tasks. In this paper, instead of limiting our survey to SOD methods, a broader view is further presented from the perspective of fundamental architectures of key modules and structures in CNN-based encoder-decoder models for pixel-level dense prediction tasks. Moreover, we focus on performing SOD by leveraging deep encoder-decoder models, and present an extensive empirical study on baseline encoder-decoder models in terms of different encoder backbones, loss functions, training batch sizes, and attention structures. Moreover, state-of-the-art encoder-decoder models adopted from semantic segmentation and deep CNN-based SOD models are also investigated. New baseline models that can outperform state-of-the-art performance were discovered. In addition, these newly discovered baseline models were further evaluated on three video-based SOD benchmark datasets. Experimental results demonstrate the effectiveness of these baseline models on both imageand video-based SOD tasks. This empirical study is concluded by a comprehensive summary which provides suggestions on future perspectives. (c) 2020 Elsevier Inc. All rights reserved.
We propose a method of automatically selecting appropriate responses in conversational spoken dialog systems by explicitly determining the correct response type that is needed first, based on a comparison of the user&...
详细信息
We propose a method of automatically selecting appropriate responses in conversational spoken dialog systems by explicitly determining the correct response type that is needed first, based on a comparison of the user's input utterance with many other utterances. Response utterances are then generated based on this response type designation (back channel, changing the topic, expanding the topic, etc.). This allows the generation of more appropriate responses than conventional end-to-end approaches, which only use the user's input to directly generate response utterances. As a response type selector, we propose an LSTM-based encoder-decoder framework utilizing acoustic and linguistic features extracted from input utterances. In order to extract these features more accurately, we utilize not only input utterances but also response utterances in the training corpus. To do so, multi-task learning using multiple decoders is also investigated. To evaluate our proposed method, we conducted experiments using a corpus of dialogs between elderly people and an interviewer. Our proposed method outperformed conventional methods using either a point wise classifier based on Support Vector Machines, or a single-task learning LSTM. The best performance was achieved when our two response type selectors (one trained using acoustic features, and the other trained using linguistic features) were combined, and multi-task learning was also performed.
The fields of image processing and computer vision have witnessed significant growth due to the proliferation of digital images across diverse domains. Image Segmentation is the fundamental task in digital image proce...
详细信息
The fields of image processing and computer vision have witnessed significant growth due to the proliferation of digital images across diverse domains. Image Segmentation is the fundamental task in digital image processing, finding applications in pivotal areas such as medical imaging, covert communication, autonomous driving, satellite imaging, among others. One particularly intriguing application of image segmentation lies in Reversible Data Hiding (RDH), where the delineation of the main Region of Interest (ROI) and Non-Region of Interest (NROI) using segmentation plays a crucial role for effective data encryption in the images. Over the last two decades, various studies focussed on developing an efficient data hiding approach, which can embed secret data within ROI and NROI part of image while ensuring its quality. A comprehensive survey has been conducted that meticulously examines different segmentation techniques, along with its usage in reversible data hiding. The main objective of this survey is to compare the performance metrics of reversible data hiding after applying different image segmentation techniques. The image segmentation techniques have been categorized systematically into three main classes: i) Traditional segmentation techniques, encompassing a spectrum of approaches like thresholding, region-based and edge detection based techniques, ii) Machine Learning (ML) based approach consisting of Clustering, Support Vector Machine (SVM) and iii) Deep Learning (DL) based technique, propelled by Convolutional Neural Networks (CNNs) that have emerged as a transformative paradigm, revolutionizing segmentation tasks with their ability to learn complex images. The survey finds out that PSNR value of data embedded images is high after applying deep learning based segmentation technique.
Production prediction for gas wells is a popular topic in reservoir engineering as it plays a crucial role in the formulation of development plans. Most traditional techniques can be categorized into two types, i.e., ...
详细信息
Production prediction for gas wells is a popular topic in reservoir engineering as it plays a crucial role in the formulation of development plans. Most traditional techniques can be categorized into two types, i.e., numerical simulation methods and decline curve analysis, while none of them can precisely capture the varying trends of gas production, which leads to poor prediction results. To tackle the issue, we propose a comprehensive approach that works in a pipeline manner to learn intrinsic features from data for production prediction. (1) We propose to group wells with a clustering algorithm which does not need the pre-specified cluster number. To group wells even better, two parameters, i.e., dynamic volatility and static volatility of productions are introduced and involved for clustering. (2) We devise a technique that is based on the maximum likelihood estimation, for well matching. (3) We develop an encoder-decoder model for learning varying trends of well productions, by considering geological, engineering and production data simultaneously. (4) On real-life data, we conduct intensive experiments and find that our approach achieves superior performance and substantially outperforms its counterparts.
暂无评论