检索结果-内蒙古大学图书馆

31st British Machine Vision Conference, BMVC 2020

作者： Liu, Zhen Zhang, Baochang Guo, Guodong Beihang University Beijing China Institute of Deep Learning Baidu Research National Engineering Laboratory for Deep Learning Technology and Application China

Feature representation is fundamental and attracts much attention in few-shot learning. Convolutional neural networks (CNNs) are among the best feature extractors so far in this field, which are successfully combined with metric learning, leading to the state-of-the-art performance. However, the subtle difference among inter-class samples challenges existing CNN based methods, which only use real-valued CNNs that fail to extract more detailed information. In this paper, we introduce complex metric module (CMM) into metric learning, aiming to better measure the inter- and intra-class relations based on both amplitude and phase information. Specifically, building upon the recent episodic training mechanism, our CMM can enhance the representation capacity by extracting robust complex-valued features to facilitate modeling subtle relationships among samples, which can enhance the performance of the few-shot classification task when only few samples are available. Moreover, we introduce a new transductive method into CMM, by considering not only query and support but also query and query relationships to predict classes of unlabeled samples. Experiments on two benchmark datasets show that the proposed CMM significantly improves the performance over other approaches and achieves the state-of-the-art results. © 2020. The copyright of this document resides with its authors. It may be distributed unchanged freely in print or electronic forms.

关键词： Benchmarking

来源：评论

学校读者我要写书评

暂无评论

POEM: 1-bit Point-wise Operations based on Expectation-Maximization for Efficient Point Cloud Processing 32

POEM: 1-bit Point-wise Operations based on Expectation-Maxim...

引用

32nd British Machine Vision Conference, BMVC 2021

作者： Xu, Sheng Li, Yanjing Zhao, Junhe Zhang, Baochang Guo, Guodong Beihang University Beijing China National Engineering Laboratory for Deep Learning Technology and Application Institute of Deep Learning Baidu Research Beijing China

Real-time point cloud processing is fundamental for lots of computer vision tasks, while still challenged by the computational problem on resource-limited edge devices. To address this issue, we implement XNOR-Net-based binary neural networks (BNNs) for an efficient point cloud processing, but its performance is severely suffered due to two main drawbacks, Gaussian-distributed weights and non-learnable scale factor. In this paper, we introduce point-wise operations based on Expectation-Maximization (POEM) into BNNs for efficient point cloud processing. The EM algorithm can efficiently constrain weights for a robust bi-modal distribution. We lead a well-designed reconstruction loss to calculate learnable scale factors to enhance the representation capacity of 1-bit fully-connected (Bi-FC) layers. Extensive experiments demonstrate that our POEM surpasses existing the state-of-the-art binary point cloud networks by a significant margin, up to 6.7%. © 2021. The copyright of this document resides with its authors.

关键词： Computer vision

来源：评论

学校读者我要写书评

暂无评论

A new method of region embedding for text classification 6

A new method of region embedding for text classification

引用

6th International Conference on learning Representations, ICLR 2018

作者： Qiao, Chao Huang, Bo Niu, Guocheng Li, Daren Dong, Daxiang He, Wei Yu, Dianhai Wu, Hua Baidu Inc. Beijing China National Engineering Laboratory of Deep Learning Technology and Application China

To represent a text as a bag of properly identified "phrases" and use the representation for processing the text is proved to be useful. The key question here is how to identify the phrases and represent them. The traditional method of utilizing n-grams can be regarded as an approximation of the approach. Such a method can suffer from data sparsity, however, particularly when the length of n-gram is large. In this paper, we propose a new method of learning and utilizing task-specific distributed representations of n-grams, referred to as "region embeddings". Without loss of generality we address text classification. We specifically propose two models for region embeddings. In our models, the representation of a word has two parts, the embedding of the word itself, and a weighting matrix to interact with the local context, referred to as local context unit. The region embeddings are learned and used in the classification task, as parameters of the neural network classifier. Experimental results show that our proposed method outperforms existing methods in text classification on several benchmark datasets. The results also indicate that our method can indeed capture the salient phrasal expressions in the texts. © learning Representations, ICLR 2018 - Conference Track *** right reserved.

关键词： Embeddings

来源：评论

学校读者我要写书评

暂无评论

GINet: Graph Interaction Network for Scene Parsing 1

引用

16th European Conference on Computer Vision, ECCV 2020

作者： Wu, Tianyi Lu, Yu Zhu, Yu Zhang, Chuang Wu, Ming Ma, Zhanyu Guo, Guodong Institute of Deep Learning Baidu Research Beijing China National Engineering Laboratory for Deep Learning Technology and Application Beijing China Beijing University of Posts and Telecommunications Beijing China

ISBN: (数字)9783030585204

ISBN: (纸本)9783030585198

Recently, context reasoning using image regions beyond local convolution has shown great potential for scene parsing. In this work, we explore how to incorperate the linguistic knowledge to promote context reasoning over image regions by proposing a Graph Interaction unit (GI unit) and a Semantic Context Loss (SC-loss). The GI unit is capable of enhancing feature representations of convolution networks over high-level semantics and learning the semantic coherency adaptively to each sample. Specifically, the dataset-based linguistic knowledge is first incorporated in the GI unit to promote context reasoning over the visual graph, then the evolved representations of the visual graph are mapped to each local representation to enhance the discriminated capability for scene parsing. GI unit is further improved by the SC-loss to enhance the semantic representations over the exemplar-based semantic graph. We perform full ablation studies to demonstrate the effectiveness of each component in our approach. Particularly, the proposed GINet outperforms the state-of-the-art approaches on the popular benchmarks, including Pascal-Context and COCO Stuff. © 2020, Springer Nature Switzerland AG.

关键词： Convolution

来源：评论

学校读者我要写书评

暂无评论

Adaptive cross-fusion learning for multi-modal gesture recognition

引用

Virtual Reality & Intelligent Hardware 2021年第3期3卷 235-247页

作者： Benjia ZHOU Jun WAN Yanyan LIANG Guodong GUO Macao University of Science and Technology Macao 999078China National Laboratory of Pattern Recognition Institute of AutomationChinese Academy of SciencesBeijing 100190China Baidu Research Beijing 100193Chinaand National Engineering Laboratory for Deep Learning Technology and ApplicationBeijing 100193China

Background Gesture recognition has attracted significant attention because of its wide range of potential *** multi-modal gesture recognition has made significant progress in recent years,a popular method still is simply fusing prediction scores at the end of each branch,which often ignores complementary features among different modalities in the early stage and does not fuse the complementary features into a more discriminative *** This paper proposes an Adaptive Cross-modal Weighting(ACmW)scheme to exploit complementarity features from RGB-D data in this *** scheme learns relations among different modalities by combining the features of different data *** proposed ACmW module contains two key functions:(1)fusing complementary features from multiple streams through an adaptive one-dimensional convolution;and(2)modeling the correlation of multi-stream complementary features in the time *** the effective combination of these two functional modules,the proposed ACmW can automatically analyze the relationship between the complementary features from different streams,and can fuse them in the spatial and temporal *** Extensive experiments validate the effectiveness of the proposed method,and show that our method outperforms state-of-the-art methods on IsoGD and NVGesture.

关键词： Gesture recognition Multi-modal fusion RGB-D

来源：评论

学校读者我要写书评

暂无评论

Interactive grounded language acquisition and generalization in a 2D world 6

Interactive grounded language acquisition and generalization...

引用

6th International Conference on learning Representations, ICLR 2018

作者： Yu, Haonan Zhang, Haichao Xu, Wei Baidu Research Sunnyvale United States National Engineering Laboratory for Deep Learning Technology and Applications Beijing China

We build a virtual agent for learning language in a 2D maze-like world. The agent sees images of the surrounding environment, listens to a virtual teacher, and takes actions to receive rewards. It interactively learns the teacher’s language from scratch based on two language use cases: sentence-directed navigation and question answering. It learns simultaneously the visual representations of the world, the language, and the action control. By disentangling language grounding from other computational routines and sharing a concept detection function between language grounding and prediction, the agent reliably interpolates and extrapolates to interpret sentences that contain new word combinations or new words missing from training sentences. The new words are transferred from the answers of language prediction. Such a language ability is trained and evaluated on a population of over 1.6 million distinct sentences consisting of 119 object words, 8 color words, 9 spatial-relation words, and 50 grammatical words. The proposed model significantly outperforms five comparison methods for interpreting zero-shot sentences. In addition, we demonstrate human-interpretable intermediate outputs of the model in the appendix. © learning Representations, ICLR 2018 - Conference Track *** right reserved.

关键词： Visual languages

来源：评论

学校读者我要写书评

暂无评论

IAFA: Instance-Aware Feature Aggregation for 3D Object Detection from a Single Image 15th

IAFA: Instance-Aware Feature Aggregation for 3D Object Detec...

引用

15th Asian Conference on Computer Vision, ACCV 2020

作者： Zhou, Dingfu Song, Xibin Dai, Yuchao Yin, Junbo Lu, Feixiang Liao, Miao Fang, Jin Zhang, Liangjun Baidu Research Beijing China National Engineering Laboratory of Deep Learning Technology and Application Beijing China Northwestern Polytechnical University Xi’an China Beijing Institute of Technology Beijing China

ISBN: (纸本)9783030695248

3D object detection from a single image is an important task in Autonomous Driving (AD), where various approaches have been proposed. However, the task is intrinsically ambiguous and challenging as single image depth estimation is already an ill-posed problem. In this paper, we propose an instance-aware approach to aggregate useful information for improving the accuracy of 3D object detection with the following contributions. First, an instance-aware feature aggregation (IAFA) module is proposed to collect local and global features for 3D bounding boxes regression. Second, we empirically find that the spatial attention module can be well learned by taking coarse-level instance annotations as a supervision signal. The proposed module has significantly boosted the performance of the baseline method on both 3D detection and 2D bird-eye’s view of vehicle detection among all three categories. Third, our proposed method outperforms all single image-based approaches (even these methods trained with depth as auxiliary inputs) and achieves state-of-the-art 3D detection performance on the KITTI benchmark. © 2021, Springer Nature Switzerland AG.

关键词： Object detection

来源：评论

学校读者我要写书评

暂无评论

Large scale autonomous driving scenarios clustering with self-supervised feature extraction

arXiv

引用

arXiv 2021年

作者： Zhao, Jinxin Fang, Jin Ye, Zhixian Zhang, Liangjun Baidu Research and National Engineering Laboratory of Deep Learning Technology and Application China Baidu Research United States

The clustering of autonomous driving scenario data can substantially benefit the autonomous driving validation and simulation systems by improving the simulation tests' completeness and fidelity. This article proposes a comprehensive data clustering framework for a large set of vehicle driving data. Existing algorithms utilize handcrafted features whose quality relies on the judgments of human experts. Additionally, the related feature compression methods are not scalable for a large data-set. Our approach thoroughly considers the traffic elements, including both in-traffic agent objects and map information. Meanwhile, we proposed a self-supervised deep learning approach for spatial and temporal feature extraction to avoid biased data representation. With the newly designed driving data clustering evaluation metrics based on data-augmentation, the accuracy assessment does not require a human-labeled dataset, which is subject to human bias. Via such unprejudiced evaluation metrics, we have shown our approach surpasses the existing methods that rely on handcrafted feature extractions. © 2021, CC BY-NC-SA.

关键词： Autonomous vehicles

来源：评论

学校读者我要写书评

暂无评论

Sparse to dense motion transfer for face image animation

arXiv

引用

arXiv 2021年

作者： Zhao, Ruiqi Wu, Tianyi Guo, Guodong Institute of Deep Learning Baidu Research Beijing China National Engineering Laboratory for Deep Learning Technology and Application Beijing China

Face image animation from a single image has achieved remarkable progress. However, it remains challenging when only sparse landmarks are available as the driving signal. Given a source face image and a sequence of sparse face landmarks, our goal is to generate a video of the face imitating the motion of landmarks. We develop an efficient and effective method for motion transfer from sparse landmarks to the face image. We then combine global and local motion estimation in a unified model to faithfully transfer the motion. The model can learn to segment the moving foreground from the background and generate not only global motion, such as rotation and translation of the face, but also subtle local motion such as the gaze change. We further improve face landmark detection on videos. With temporally better aligned landmark sequences for training, our method can generate temporally coherent videos with higher visual quality. Experiments suggest we achieve results comparable to the state-of-the-art image driven method on the same identity testing and better results on cross identity testing. © 2021, CC BY-NC-ND.

关键词： Animation

来源：评论

学校读者我要写书评

暂无评论

Feature Selective Transformer for Semantic Image Segmentation

arXiv

引用

arXiv 2022年

作者： Lin, Fangjian Wu, Tianyi Wu, Sitong Tian, Shengwei Guo, Guodong Institute of Deep Learning Baidu Research Beijing China National Engineering Laboratory for Deep Learning Technology and Application Beijing China

Recently, it has attracted more and more attentions to fuse multi-scale features for semantic image segmentation. Various works were proposed to employ progressive local or global fusion, but the feature fusions are not rich enough for modeling multi-scale context features. In this work, we focus on fusing multi-scale features from Transformer-based backbones for semantic segmentation, and propose a Feature Selective Transformer (FeSeFormer), which aggregates features from all scales (or levels) for each query feature. Specifically, we first propose a Scale-level Feature Selection (SFS) module, which can choose an informative subset from the whole multi-scale feature set for each scale, where those features that are important for the current scale (or level) are selected and the redundant are discarded. Furthermore, we propose a Full-scale Feature Fusion (FFF) module, which can adaptively fuse features of all scales for queries. Based on the proposed SFS and FFF modules, we develop a Feature Selective Transformer (FeSeFormer), and evaluate our FeSeFormer on four challenging semantic segmentation benchmarks, including PASCAL Context, ADE20K, COCO-Stuff 10K, and Cityscapes, outperforming the state-of-the-art. Copyright © 2022, The Authors. All rights reserved.

关键词： Semantic Segmentation

来源：评论

学校读者我要写书评

暂无评论

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案：

请选择收藏分类：

通借通还

建议与咨询 留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案： 新增检索档案 确定 取消

请选择收藏分类： 新增自定义分类 确定 取消

通借通还

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

请选择保存的检索档案：

请选择收藏分类：