this paper presents BLSTM-CTC (bidirectional LSTM-Connectionist Temporal Classification), a novel scheme to tackle the chinese image text recognition problem. Different from traditional methods that perform the recogn...
详细信息
ISBN:
(纸本)9789811030055;9789811030048
this paper presents BLSTM-CTC (bidirectional LSTM-Connectionist Temporal Classification), a novel scheme to tackle the chinese image text recognition problem. Different from traditional methods that perform the recognition on the single character level, the input of BLSTM-CTC is an image text composed of a line of characters and the output is a recognized text sequence, where the recognition is carried out on the whole image text level. To train a neural network for this challenging task, we collect over 2 million news titles from which we generate over 1 million noisy image texts, covering almost the vast majority of common chinese characters. Withthese training data, a RNN training procedure is conducted to learn the recognizer. We also carry out some adaptations on the neural network to make it suitable for real scenarios. Experiments on text images from 13 TV channels demonstrate the effectiveness of the proposed pipeline. the results all outperform those of a baseline system.
Speech emotion recognition is an interesting and challenging subject due to the emotion gap between speech signals and high-level speech emotion. To bridge this gap, this paper present a method of chinese speech emoti...
详细信息
ISBN:
(纸本)9789811030055;9789811030048
Speech emotion recognition is an interesting and challenging subject due to the emotion gap between speech signals and high-level speech emotion. To bridge this gap, this paper present a method of chinese speech emotion recognition using Deep belief networks (DBN). DBN is used to perform unsupervised feature learning on the extracted low-level acoustic features. then, Multi-layer Perceptron (MLP) is initialized in terms of the learning results of hidden layer of DBN, and employed for chinese speech emotion classification. Experimental results on the chinese Natural Audio-Visual Emotion Database (CHEAVD), show that the presented method obtains a classification accuracy of 32.80 % and macro average precision of 41.54 % on the testing data from the CHEAVD dataset on speech emotion recognition tasks, significantly outperforming the baseline results provided by the organizers in the speech emotion recognition sub-challenges.
Super-resolution (SR) is an ill-posed problem, which means that infinitely many high-resolution (HR) images can be degraded to the same low-resolution (LR) image. To study the one-to-many stochastic SR mapping, we imp...
详细信息
ISBN:
(纸本)9781665448994
Super-resolution (SR) is an ill-posed problem, which means that infinitely many high-resolution (HR) images can be degraded to the same low-resolution (LR) image. To study the one-to-many stochastic SR mapping, we implicitly represent the non-local self-similarity of natural images and develop a Variational Sparse framework for Super-Resolution (VSpSR) via neural networks. Since every small patch of a HR image can be well approximated by the sparse representation of atoms in an over-complete dictionary, we design a two-branch module, i.e., VSpM, to explore the SR space. Concretely, one branch of VSpM extracts patch-level basis from the LR input, and the other branch infers pixel-wise variational distributions with respect to the sparse coefficients. By repeatedly sampling coefficients, we could obtain infinite sparse representations, and thus generate diverse HR images. According to the preliminary results of NTIRE 2021 challenge on learning SR space, our team ranks 7-th in terms of released scores.
Identifying subjects with variations caused by poses is one of the most challenging tasks in face recognition, since the difference in appearances caused by poses may be even larger than the difference due to identity...
详细信息
ISBN:
(纸本)9781479951178
Identifying subjects with variations caused by poses is one of the most challenging tasks in face recognition, since the difference in appearances caused by poses may be even larger than the difference due to identity. Inspired by the observation that pose variations change non-linearly but smoothly, we propose to learn pose-robust features by modeling the complex non-linear transform from the non-frontal face images to frontal ones through a deep network in a progressive way, termed as stacked progressive auto-encoders (SPAE). Specifically, each shallow progressive auto-encoder of the stacked network is designed to map the face images at large poses to a virtual view at smaller ones, and meanwhile keep those images already at smaller poses unchanged. then, stacking multiple these shallow auto-encoders can convert non-frontal face images to frontal ones progressively, which means the pose variations are narrowed down to zero step by step. As a result, the outputs of the topmost hidden layers of the stacked network contain very small pose variations, which can be used as the pose-robust features for face recognition. An additional attractiveness of the proposed method is that no pose estimation is needed for the test images. the proposed method is evaluated on two datasets with pose variations, i.e., MultiPIE and FERET datasets, and the experimental results demonstrate the superiority of our method to the existing works, especially to those 2D ones.
Face detection is a classical problem in computervision. It is still a difficult task due to many nuisances that naturally occur in the wild, including extreme pose, exaggerated expressions, significant illumination ...
详细信息
ISBN:
(纸本)9781538607336
Face detection is a classical problem in computervision. It is still a difficult task due to many nuisances that naturally occur in the wild, including extreme pose, exaggerated expressions, significant illumination variations and severe occlusion. In this paper, we propose a multi-scale fully convolutional network (MS-FCN) for face detection. To reduce computation, the intermediate convolutional feature maps (conv) are shared by every scale model. We up-sample and down-sample the final conv map to approximate K levels of a feature pyramid, leading to a wide range of face scales that can be detected. At each feature pyramid level, a FCN is trained end-to-end to deal with faces in a small range of scale change. Because of the up-sampling, our method can detect very small faces (10 x 10 pixels). We test our MS-FCN detector on four public face detection benchmarks, including FDDB, WIDER FACE, AFW and PASCAL FACE. Extensive experiments show that our detector outperforms state-of-the-art methods on all these datasets in general and by a substantial margin on the most challenging among them (e.g. WIDER FACE Hard). Also, MS-FCN runs at 23 FPS on a GPU for images of size 640 x 480 with no assumption on the minimum detectable face size.
In this paper, we focus on a critical task of retrieving common style in chinese scene text: given an image of style text, the system returns all the images matching the queried text image. To that, a novel twin Trans...
详细信息
ISBN:
(纸本)9783031189128;9783031189135
In this paper, we focus on a critical task of retrieving common style in chinese scene text: given an image of style text, the system returns all the images matching the queried text image. To that, a novel twin Transformer based matching network is proposed, which is featured by the integration of anchor-free detection, text recognition, and similarity matching networks. On the fly, our model retrieves the similarity of text features in the text area and evaluates it through recognition. Our experiments demonstrate that the proposed model outperforms the state-of-the-art in terms of both processing speed and accuracy. Additional experiments show that our model generalizes well on various benchmarks, including a self-constructed chinese query data set with complex chinese scenes in the real world.
Digital Hopfield neural networks (DHNN) are well known for its pattern recall capability in noisy circumstances. In this paper, a number of tests are conducted for primarily exploring the recall competency of DHNN in ...
详细信息
Sparse representation based methods have recently drawn much attention in visual tracking due to good performance against illumination variation and occlusion. they assume the errors caused by image variations can be ...
详细信息
ISBN:
(纸本)9780769549903
Sparse representation based methods have recently drawn much attention in visual tracking due to good performance against illumination variation and occlusion. they assume the errors caused by image variations can be modeled as pixel-wise sparse. However, in many practical scenarios these errors are not truly pixel-wise sparse but rather sparsely distributed in a structured way. In fact, pixels in error constitute contiguous regions within the object's track. this is the case when significant occlusion occurs. To accommodate for non-sparse occlusion in a given frame, we assume that occlusion detected in previous frames can be propagated to the current one. this propagated information determines which pixels will contribute to the sparse representation of the current track. In other words, pixels that were detected as part of an occlusion in the previous frame will be removed from the target representation process. As such, this paper proposes a novel tracking algorithm that models and detects occlusion through structured sparse learning. We test our tracker on challenging benchmark sequences, such as sports videos, which involve heavy occlusion, drastic illumination changes, and large pose variations. Experimental results show that our tracker consistently outperforms the state-of-the-art.
In this paper, we present a flexible camera calibration for pose normalization to accomplish a pose-invariant face recognition. the accuracy of calibration can be easily influenced by errors of landmark detection or v...
详细信息
ISBN:
(纸本)9789811030024;9789811030017
In this paper, we present a flexible camera calibration for pose normalization to accomplish a pose-invariant face recognition. the accuracy of calibration can be easily influenced by errors of landmark detection or various shapes of different faces and expressions. By jointly using RANSAC and facial unique characters, we explore a flexible calibration method to achieve a more accurate camera calibration and pose normalization for face images. Our proposed method is able to eliminate noisy facial landmarks and retain the ones which best match the undeformable 3D face model. the experimental results show that our method improves the accuracy of pose-invariant face recognition, especially for the faces with unsatisfied landmark detection, variant shapes, and exaggerated expressions.
Scene text recognition has been a hot topic in computervision. Recent methods adopt the attention mechanism for sequence prediction which achieve convincing results. However, we argue that the existing attention mech...
详细信息
ISBN:
(纸本)9781728188089
Scene text recognition has been a hot topic in computervision. Recent methods adopt the attention mechanism for sequence prediction which achieve convincing results. However, we argue that the existing attention mechanism faces the problem of attention diffusion, in which the model may not focus on a certain character area. In this paper, we propose Gaussian Constrained Attention Network to deal withthis problem. It is a 2D attention-based method integrated with a novel Gaussian Constrained Refinement Module, which predicts an additional Gaussian mask to refine the attention weights. Different from adopting an additional supervision on the attention weights simply, our proposed method introduces an explicit refinement. In this way, the attention weights will be more concentrated and the attention-based recognition network achieves better performance. the proposed Gaussian Constrained Refinement Module is flexible and can be applied to existing attention-based methods directly. the experiments on several benchmark datasets demonstrate the effectiveness of our proposed method. Our code has been available at https://***/Pay20Y/GCAN.
暂无评论