In the modern-day scenario, machines and humans are expected to work together and collaborate in several social and manufacturing environments. The machines should predict humans' next move for effective collabora...
详细信息
In the modern-day scenario, machines and humans are expected to work together and collaborate in several social and manufacturing environments. The machines should predict humans' next move for effective collaborations by observing their present move. Human motion modelling and prediction are fundamental and challenging problems involving computer vision and graphics. To help solve some of the challenges, in the present investigation, we propose an innovative idea of developing a new cost function as the objective function based on adaptive sampling, which is subsequently used with an 'Adam' optimizer for training and validating a specially configured Deep Learning architecture. Our proposed development produced significantly improved results regarding future pose estimation/predictions. The adaptiveness of the proposed cost function is based on a bell-shaped locally weighted function. It has been observed that the area covered by the cost function plays a vital role during training, and the bell-shaped function's width helps decide the region of importance for the training samples. The proposed cost function has been used for training a gated recurrent unit (GRU) based encoder-decoder architecture. The encoder takes the observed input sequences, extracts the input sequence's significant variability, and passes it to the decoder. The decoder takes it as input, trains using the adaptive sampling-based method, and predicts future poses. We have experimented with this function in various sizes and shapes and compared the results obtained with some state-of-the-art research results. As elaborated in this paper, we obtained much-improved results in almost all the cases.
Unconstrained handwritten text recognition remains challenging for computer vision systems. Paragraph text recognition is traditionally achieved by two models: the first one for line segmentation and the second one fo...
详细信息
Unconstrained handwritten text recognition remains challenging for computer vision systems. Paragraph text recognition is traditionally achieved by two models: the first one for line segmentation and the second one for text line recognition. We propose a unified end-to-end model using hybrid attention to tackle this task. This model is designed to iteratively process a paragraph image line by line. It can be split into three modules. An encoder generates feature maps from the whole paragraph image. Then, an attention module recurrently generates a vertical weighted mask enabling to focus on the current text line features. This way, it performs a kind of implicit line segmentation. For each text line features, a decoder module recognizes the character sequence associated, leading to the recognition of a whole paragraph. We achieve state-of-the-art character error rate at paragraph level on three popular datasets: 1.91% for RIMES, 4.45% for IAM and 3.59% for READ 2016. Our code and trained model weights are available at https://***/FactoDeepLearning/VerticalAttentionOCR.
Neural Process (NP) fully combines the advantages of neural network and Gaussian Process (GP) to provide an efficient method for solving regression problems. Nonetheless, limited by the dimensionality of the latent va...
详细信息
Neural Process (NP) fully combines the advantages of neural network and Gaussian Process (GP) to provide an efficient method for solving regression problems. Nonetheless, limited by the dimensionality of the latent variable, NP has difficulty fitting the observed data completely and predicting the targets perfectly. To remedy these drawbacks, the authors propose a concise and effective improvement of the latent path of NP, which the authors term Multi-Latent Variables Neural Process (MLNP). MLNP samples multiple latent variables and integrates the representations corresponding to the latent variables in the decoder with adaptive weights. MLNP inherits the desirable property of linear computation scales of NP and learns the approximate distribution over objective functions from contexts more flexibly and accurately. By applying MLNP to 1-D regression, real-world image completion, which can be seen as a 2-D regression task, the authors demonstrate its significant improvement in the accuracy of prediction and contexts fitting capability compared with NP. Through ablation experiments, the authors also verify that the number of latent variables has a great impact on the prediction accuracy and fitting capability of MLNP. Moreover, the authors also analyze the roles played by different latent variables in reconstructing images.
Creating a summarized version of a text document that still conveys precise meaning is an incredibly complex endeavor in natural language processing (NLP). Abstract text summarization (ATS) is the process of using fac...
详细信息
Creating a summarized version of a text document that still conveys precise meaning is an incredibly complex endeavor in natural language processing (NLP). Abstract text summarization (ATS) is the process of using facts from source sentences and merging them into concise representations while maintaining the content and intent of the text. Manually summarizing large amounts of text are challenging and time-consuming for humans. Therefore, text summarization has become an exciting research focus in NLP. This research paper proposed an ATS model using a Transformer Technique with Self-Attention Mechanism (T2SAM). The self-attention mechanism is added to the transformer to solve the problem of coreference in text. This makes the system to understand the text better. The proposed T2SAM model improves the performance of text summarization. It is trained on the Inshorts News dataset combined with the DUC-2004 shared tasks dataset. The performance of the proposed model has been evaluated using the ROUGE metrics, and it has been shown to outperform the existing state-of-the-art baseline models. The proposed model gives the training loss minimum to 1.8220 from 10.3058 (at the starting point) up to 30 epochs, and it achieved model accuracy 48.50% F1-Score on both the Inshorts and DUC-2004 news datasets.
Talking face generation is widely used in education, entertainment, shopping, and other social practices. Existing methods focus on matching the speaker's mouth shape with the speech content. Still, there is a lac...
详细信息
Talking face generation is widely used in education, entertainment, shopping, and other social practices. Existing methods focus on matching the speaker's mouth shape with the speech content. Still, there is a lack of research on automatically extracting potential head motion features from speech, resulting in a lack of naturalness. This paper proposes SATFace, a subject agnostic talking face generation method with natural head movement. To model the talking face's complicated and critical features (identity, background, mouth shape, head posture, etc.), we construct SATFace by taking encoder-decoder as the primary network architecture. Then, we design a long short-time feature learning network to better reference the global and local information in audio for generating reasonable head movement. Besides, a modular training process is proposed to improve explicit and implicit features' learning effects and efficiency. The experimental comparison results show that SATFace improves by at least about 9.8% in cumulative probability of blur detection and 8.2% in synchronization confidence compared with the mainstream methods. The mean opinion scores show that SATFace has advantages in terms of lip sync quality, head movement naturalness, and video realness.
We designed an interface to support hand rehabilitation tasks to restore hand function and relieve discomfort. The interface requires accurate hand segmentation, which is impeded by background clutter, occlusion, and ...
详细信息
We designed an interface to support hand rehabilitation tasks to restore hand function and relieve discomfort. The interface requires accurate hand segmentation, which is impeded by background clutter, occlusion, and variations in illumination. To overcome these challenges, we propose a novel encoder-decoder that segments the hand by encoding spatial and channel correlations using two attention blocks. This approach requires much less computation than benchmark self-attention mechanisms. Moreover, a novel loss function optimizes the model to resolve class imbalance, ensure boundary smoothness, and retain the hand's shape. The quantitative and qualitative results show the model's ability to segment the hands. It performed exceptionally well for images with different hand poses and orientations, the presence of a human face, background clutter, specularity, and variations in illumination. The model attained an F1-score of 97.3% for the Ouhands and 99.3% for the HGR dataset, higher than baseline models, with faster inference times. Furthermore, the model could generalize hand segmentation to multiple hands and unseen environments. Its segmentation precision enabled the development of the hand rehabilitation interface, which guided users to perform hand exercises. For five weeks, patients steadily improved hand function while using the interface.
The recognition of abnormal behavior in surveillance video is the focus of current research, which has high research value and broad application possibilities. Its main applications are in the fields of intelligent su...
详细信息
The recognition of abnormal behavior in surveillance video is the focus of current research, which has high research value and broad application possibilities. Its main applications are in the fields of intelligent surveillance, intelligent security, and smart cities, and it is of great significance to study the recognition of abnormal behaviors. Because of the complexity of human movement and the variability of the external environment, the recognition and detection of abnormal behaviors have some challenges. The recognition and detection of abnormal human behaviors in surveillance video still needs further research and development. This paper uses the multi-branch convolutional neural network to extract the spatial features of video frames for the first time, and as an encoder to pass the condensed features to the Gated Recurrent Unit (GRU), which extracts Temporal features from multiple video frames. And then the Gated Recurrent Unit output the result as the decoder. We did a series of comparative experiments on UCF-Crime dataset. And finally, we achieved an accuracy of 86.78% in the test set. The experimental results show that our multi-branch convolutional fusion neural network is better than previous surveillance video abnormal behavior recognition algorithms. At the same time, in order to verify the generalization performance and efficiency of the algorithm, we also conducted an experimental validation on the UCF-101 dataset in this paper, and the results show that the algorithm in this paper can also show a high accuracy rate on the UCF-101 dataset, and the speed of the algorithm is almost close to that of the C3D method with improved accuracy rate, making it possible to develop simple recognition applications based on the algorithm studied in this paper subsequently.
We consider the reference-based approach for Automatic Short Answer Grading (ASAG) that involves scoring a textual constructed student answer comparing to a teacher-provided reference answer. The reference answer does...
详细信息
We consider the reference-based approach for Automatic Short Answer Grading (ASAG) that involves scoring a textual constructed student answer comparing to a teacher-provided reference answer. The reference answer does not cover the variety of student answers as it contains only specific examples of correct answers. Considering other language variants of the reference answer can handle variability in student responses and improve scoring accuracy. Alternative reference answers may be possible, but manually creating them is expensive and time-consuming. In this paper, we consider two issues: First, we need to automatically generate various reference answers that can handle the diversity of student answers. Second, we should provide an accurate grading model that improves sentence similarity computation using multiple reference answers. Therefore, our proposed approach to solve both problems highlights two components. First, we provide a sequence-to-sequence deep learning model that targets generating plausible paraphrased reference answers conditioned on the provided reference answer. Secondly, we propose a supervised grading model based on sentence embedding features. The grading model enriches features to improve accuracy considering multiple reference answers. Experiments are conducted both in Arabic and English. They show that the paraphrase generator produces accurate paraphrases. Using multiple reference answers, the proposed grading model achieves a Root Mean Square Error of 0,6955, a Pearson correlation of 88,92% for the Arabic dataset, an RMSE of 0,7790, and a Pearson correlation of 73,50% for the English dataset. While fine-tuning pre-trained transformers on the English dataset provided state-of-the-art performance (RMSE: 0.7620), our approach yields comparable results. Simple to construct, load, and embed into the Learning Management System question engine with low computational complexity, the proposed approach can be easily integrated into the Learning Ma
Text summarization is an information compression technology to extract important information from long text, which has become a challenging research direction in the field of natural language processing. At present, t...
详细信息
Text summarization is an information compression technology to extract important information from long text, which has become a challenging research direction in the field of natural language processing. At present, the text summary model based on deep learning has shown good results, but how to more effectively model the relationship between words, more accurately extract feature information and eliminate redundant information is still a problem of concern. This paper proposes a graph neural network model GA-GNN based on gated attention, which effectively improves the accuracy and readability of text summarization. First, the words are encoded using a concatenated sentence encoder to generate a deeper vector containing local and global semantic information. Secondly, the ability to extract key information features is improved by using gated attention units to eliminate local irrelevant information. Finally, the loss function is optimized from the three aspects of contrastive learning, confidence calculation of important sentences, and graph feature extraction to improve the robustness of the model. Experimental validation was conducted on a CNN/Daily Mail dataset and MR dataset, and the results showed that the model in this paper outperformed existing methods.
The detection of ground object changes from bi-temporal images is of great significance for urban planning, land-use/land-cover monitoring and natural disaster assessment. To solve the limitation of incomplete change ...
详细信息
The detection of ground object changes from bi-temporal images is of great significance for urban planning, land-use/land-cover monitoring and natural disaster assessment. To solve the limitation of incomplete change detection (CD) entities and inaccurate edges caused by the loss of detailed information, this paper proposes a network based on dense connections and attention feature fusion, namely Siamese NestedUNet with Attention Feature Fusion (SNAFF). First, multi-level bi-temporal features are extracted through a Siamese network. The dense connections between the sub-nodes of the decoder are used to compensate for the missing location information as well as weakening the semantic differences between features. Then, the attention mechanism is introduced to combine global and local information to achieve feature fusion. Finally, a deep supervision strategy is used to suppress the problem of gradient vanishing and slow convergence speed. During the testing phase, the test time augmentation (TTA) strategy is adopted to further improve the CD performance. In order to verify the effectiveness of the proposed method, two datasets with different change types are used. The experimental results indicate that, compared with the comparison methods, the proposed SNAFF achieves the best quantitative results on both datasets, in which F1, IoU and OA in the LEVIR-CD dataset are 91.47%, 84.28% and 99.13%, respectively, and the values in the CDD dataset are 96.91%, 94.01% and 99.27%, respectively. In addition, the qualitative results show that SNAFF can effectively retain the global and edge information of the detected entity, thus achieving the best visual performance. This paper proposes a novel change detection (CD) method based on dense connections and attention feature fusion, which is capable of recovering detailed information as well as capturing global and local information. A deep supervision module is introduced to further improve the CD performance. Extensive experiment
暂无评论