检索结果-内蒙古大学图书馆

Fast Texture Synthesis via Pseudo Optimizer

学校读者我要写书评

暂无评论

Fast Texture Synthesis via Pseudo Optimizer

Conference on computer vision and pattern recognition (CVPR)

作者： Wu Shi Yu Qiao ShenZhen Key Lab of Computer Vision and Pattern Recognition SIAT-SenseTime Joint Lab Shenzhen Institutes of Advanced Technology Chinese Academy of Sciences SIAT Branch Shenzhen Institute of Artificial Intelligence and Robotics for Society

ISBN: (数字)9781728171685

ISBN: (纸本)9781728171692

Texture synthesis using deep neural networks can generate high quality and diversified textures. However, it usually requires a heavy optimization process. The following works accelerate the process by using feed-forward networks, but at the cost of scalability. diversity or quality. We propose a new efficient method that aims to simulate the optimization process while retains most of the properties. Our method takes a noise image and the gradients from a descriptor network as inputs, and synthesize a refined image with respect to the target image. The proposed method can synthesize images with better quality and diversity than the other fast synthesis methods do. Moreover, our method trained on a large scale dataset can generalize to synthesize unseen textures.

关键词： Neural networks Training Transforms Scalability Optimization methods Convolution

Exploring emotion features and fusion strategies for audio-video emotion recognition

学校读者我要写书评

暂无评论

arXiv 2020年

作者： Zhou, Hengshun Meng, Debin Zhang, Yuanyuan Peng, Xiaojiang Du, Jun Wang, Kai Qiao, Yu University of Science and Technology of China China ShenZhen Key Lab of Computer Vision and Pattern Recognition SIAT-SenseTime Joint Lab Shenzhen Institutes of Advanced Technology Chinese Academy of Sciences China

The audio-video based emotion recognition aims to classify a given video into basic emotions. In this paper, we describe our approaches in EmotiW 2019, which mainly explores emotion features and feature fusion strategies for audio and visual modality. For emotion features, we explore audio feature with both speech-spectrogram and Log Mel-spectrogram and evaluate several facial features with different CNN models and different emotion pretrained strategies. For fusion strategies, we explore intra-modal and cross-modal fusion methods, such as designing attention mechanisms to highlights important emotion feature, exploring feature concatenation and factorized bilinear pooling (FBP) for cross-modal feature fusion. With careful evaluation, we obtain 65.5% on the AFEW validation set and 62.48% on the test set and rank second in the challenge. © 2020, CC BY.

关键词： Speech recognition

RankSRGAN: Generative adversarial networks with ranker for image super-resolution

学校读者我要写书评

暂无评论

arXiv 2019年

作者： Zhang, Wenlong Liu, Yihao Dong, Chao Qiao, Yu ShenZhen Key Lab of Computer Vision and Pattern Recognition SIAT-SenseTime Joint Lab. Shenzhen Institutes of Advanced Technology Chinese Academy of Sciences China University of Chinese Academy of Sciences

Generative Adversarial Networks (GAN) have demonstrated the potential to recover realistic details for single image super-resolution (SISR). To further improve the visual quality of super-resolved results, PIRM2018-SR Challenge employed perceptual metrics to assess the perceptual quality, such as PI, NIQE, and Ma. However, existing methods cannot directly optimize these indifferentiable perceptual metrics, which are shown to be highly correlated with human ratings. To address the problem, we propose Super-Resolution Generative Adversarial Networks with Ranker (RankSRGAN) to optimize generator in the direction of perceptual metrics. Specifically, we first train a Ranker which can learn the behavior of perceptual metrics and then introduce a novel rank-content loss to optimize the perceptual quality. The most appealing part is that the proposed method can combine the strengths of different SR methods to generate better results. Extensive experiments show that RankSRGAN achieves visually pleasing results and reaches state-of-the-art performance in perceptual metrics. Project page: https://***/Projects/RankSRGAN. Copyright © 2019, The Authors. All rights reserved.

关键词： Generative adversarial networks

Orientation Robust Scene Text recognition in Natural Scene*

学校读者我要写书评

暂无评论

Orientation Robust Scene Text Recognition in Natural Scene*

IEEE International Conference on Robotics and Biomimetics

作者： Xiaolong Chen Zhengfu Zhang Yu Qiao Jiangyu Lai Jian Jiang Zeyu Zhang Bin Fu Guangzhou Power Supply Bureau Co. Ltd. Guangzhou China ShenZhen Key Lab of Computer Vision and Pattern Recognition Chinese Academy of Sciences

In recent years, scene text recognition has achieved significant improvement and various state-of-the-art recognition approaches have been proposed. This paper focused on recognizing text in natural photos of equipment nameplates, which has wide applications in industrial automations. This task only receives little attentions in previous works. The challenge of this problem comes from multi-orientation, curved, noisy and blurry text patches in equipment nameplates. To address this problem, we propose a deep model for text recognition in multi-oriented nameplates, namely, Orientation Robust Scene Text recognition (ORSTR). Specifically, our model employs a rectification module to transform curved, distorted or multi-orientation text to near-horizontal text with a carefully designed rectification module. Once the near-horizontal text has been generated, recognition network will output the predictions of text patches. Our scene text recognition model achieves 90 . 8% recognition accuracy on equipment nameplate dataset which outperforms previous scene text recognition model (CRNN) about 0 . 8%. Several extensive experiments have been conducted to verify the effectiveness of our model.

关键词：

Learning Attentive Pairwise Interaction for Fine-Grained Classification

学校读者我要写书评

暂无评论

arXiv 2020年

作者： Zhuang, Peiqin Wang, Yali Qiao, Yu ShenZhen Key Lab of Computer Vision and Pattern Recognition SIAT-SenseTime Joint Lab Shenzhen Institutes of Advanced Technology Chinese Academy of Sciences SIAT Branch Shenzhen Institute of Artificial Intelligence and Robotics for Society

Fine-grained classification is a challenging problem, due to subtle differences among highly-confused categories. Most approaches address this difficulty by learning discriminative representation of individual input image. On the other hand, humans can effectively identify contrastive clues by comparing image pairs. Inspired by this fact, this paper proposes a simple but effective Attentive Pairwise Interaction Network (API-Net), which can progressively recognize a pair of fine-grained images by interaction. Specifically, API-Net first learns a mutual feature vector to capture semantic differences in the input pair. It then compares this mutual vector with individual vectors to generate gates for each input image. These distinct gate vectors inherit mutual context on semantic differences, which allow API-Net to attentively capture contrastive clues by pairwise interaction between two images. Additionally, we train API-Net in an end-to-end manner with a score ranking regularization, which can further generalize API-Net by taking feature priorities into account. We conduct extensive experiments on five popular benchmarks in fine-grained classification. API-Net outperforms the recent SOTA methods, i.e., CUB-200-2011 (90.0%), Aircraft (93.9%), Stanford Cars (95.3%), Stanford Dogs (90.3%), and NABirds (88.1%). Copyright © 2020, The Authors. All rights reserved.

关键词： Vectors

Smallbignet: Integrating core and contextual views for video classification

学校读者我要写书评

暂无评论

arXiv 2020年

作者： Li, Xianhang Wang, Yali Zhou, Zhipeng Qiao, Yu ShenZhen Key Lab of Computer Vision and Pattern Recognition SIAT-SenseTime Joint Lab Shenzhen Institutes of Advanced Technology Chinese Academy of Sciences SIAT Branch Shenzhen Institute of Artificial Intelligence and Robotics for Society

Temporal convolution has been widely used for video classification. However, it is performed on spatio-temporal contexts in a limited view, which often weakens its capacity of learning video representation. To alleviate this problem, we propose a concise and novel SmallBig network, with the cooperation of small and big views. For the current time step, the small view branch is used to learn the core semantics, while the big view branch is used to capture the contextual semantics. Unlike traditional temporal convolution, the big view branch can provide the small view branch with the most activated video features from a broader 3D receptive field. Via aggregating such big-view contexts, the small view branch can learn more robust and discriminative spatio-temporal representations for video classification. Furthermore, we propose to share convolution in the small and big view branch, which improves model compactness as well as alleviates overfitting. As a result, our SmallBigNet achieves a comparable model size like 2D CNNs, while boosting accuracy like 3D CNNs. We conduct extensive experiments on the large-scale video benchmarks, e.g., Kinetics400, Something-Something V1 and V2. Our SmallBig network outperforms a number of recent state-of-the-art approaches, in terms of accuracy and/or efficiency. The codes and models will be available on https://***/xhl-video/SmallBigNet. Copyright © 2020, The Authors. All rights reserved.

关键词： Convolution

Exploring Multi-Scale Feature Propagation and Communication for Image Super Resolution

学校读者我要写书评

暂无评论

arXiv 2020年

作者： Feng, Ruicheng Guan, Weipeng Qiao, Yu Dong, Chao Shenzhen Key Lab of Computer Vision and Pattern Recognition SIAT-SenseTime Joint Lab Shenzhen Institutes of Advanced Technology Chinese Acedamy of Sciences China Chinese University of Hong Kong Hong Kong

Multi-scale techniques have achieved great success in a wide range of computer vision tasks. However, while this technique is incorporated in existing works, there still lacks a comprehensive investigation on variants of multi-scale convolution in image super resolution. In this work, we present a unified formulation over widely-used multi-scale structures. With this framework, we systematically explore the two factors of multi-scale convolution – feature propagation and cross-scale communication. Based on the investigation, we propose a generic and efficient multi-scale convolution unit – Multi-Scale cross-Scale Share-weights convolution (MS3-Conv). Extensive experiments demonstrate that the proposed MS3-Conv can achieve better SR performance than the standard convolution with less parameters and computational cost. Beyond quantitative analysis, we comprehensively study the visual quality, which show that MS3-Conv behave better to recover high-frequency details. Copyright © 2020, The Authors. All rights reserved.

关键词： Optical resolving power

Self-supervised multi-view stereo via effective co-segmentation and data-augmentation

学校读者我要写书评

暂无评论

arXiv 2021年

作者： Xu, Hongbin Zhou, Zhipeng Qiao, Yu Kang, Wenxiong Wu, Qiuxia ShenZhen Key Lab of Computer Vision and Pattern Recognition Shenzhen Institutes of Advanced Technology Chinese Academy of Sciences Shenzhen China Shanghai AI Lab Shanghai China South China University of Technology Guangzhou China

Recent studies have witnessed that self-supervised methods based on view synthesis obtain clear progress on multiview stereo (MVS). However, existing methods rely on the assumption that the corresponding points among different views share the same color, which may not always be true in practice. This may lead to unreliable self-supervised signal and harm the final reconstruction performance. To address the issue, we propose a framework integrated with more reliable supervision guided by semantic co-segmentation and data-augmentation. Specially, we excavate mutual semantic from multi-view images to guide the semantic consistency. And we devise effective data-augmentation mechanism which ensures the transformation robustness by treating the prediction of regular samples as pseudo ground truth to regularize the prediction of augmented samples. Experimental results on DTU dataset show that our proposed methods achieve the state-of-the-art performance among unsupervised methods, and even compete on par with supervised methods. Furthermore, extensive experiments on Tanks&Temples dataset demonstrate the effective generalization ability of the proposed method. Copyright © 2021, The Authors. All rights reserved.

关键词： Semantics

VFHQ: A High-Quality Dataset and Benchmark for Video Face Super-Resolution

学校读者我要写书评

暂无评论

arXiv 2022年

作者： Xie, Liangbin Wang, Xintao Zhang, Honglun Dong, Chao Shan, Ying Shenzhen Key Lab of Computer Vision and Pattern Recognition Shenzhen Institute of Advanced Technology Chinese Academy of Sciences China University of Chinese Academy of Sciences China ARC Lab Tencent PCG China

Most of the existing video face super-resolution (VFSR) methods are trained and evaluated on VoxCeleb1, which is designed specifically for speaker identification and the frames in this dataset are of low quality. As a consequence, the VFSR models trained on this dataset can not output visual-pleasing results. In this paper, we develop an automatic and scalable pipeline to collect a high-quality video face dataset (VFHQ), which contains over 16, 000 high-fidelity clips of diverse interview scenarios. To verify the necessity of VFHQ, we further conduct experiments and demonstrate that VFSR models trained on our VFHQ dataset can generate results with sharper edges and finer textures than those trained on VoxCeleb1. In addition, we show that the temporal information plays a pivotal role in eliminating video consistency issues as well as further improving visual performance. Based on VFHQ, by analyzing the benchmarking study of several state-of-the-art algorithms under bicubic and blind settings. Copyright © 2022, The Authors. All rights reserved.

关键词： Benchmarking