Automatically extracting visual attributes for e-commerce data has widespread applications in cataloging, catalogue qualification and enrichment, visual search, etc. Here, we address the task of visual attribute extra...
详细信息
ISBN:
(纸本)9781665448994
Automatically extracting visual attributes for e-commerce data has widespread applications in cataloging, catalogue qualification and enrichment, visual search, etc. Here, we address the task of visual attribute extraction for a highly challenging real-world fashion data from Flipkart catalogue (an Indian e-commerce platform), which is collected from seller uploaded product images. This data not only contains widely varying categories (e.g., shirt, sari, shoes), but also has both coarse-grained (e.g., occasion, top type, sari type) and fine-grained (e.g., neck type, print type) attributes. Training examples available for different attributes are highly imbalanced, making this task even more challenging. To this end, we propose an end-to-end framework which integrates multi-task learning with transformer as an attention module, in addition to handling the data imbalance. The proposed architecture supports multiple attributes across various product categories in a scalable manner. Extensive experiments on the in-house dataset shows effectiveness of the proposed framework in improving performance of the fine-grained attributes by 13% on the baseline across the attributes.
Neural architecture search (NAS) has shown great promise in designing state-of-the-art (SOTA) models that are both accurate and efficient. Recently, two-stage NAS, e.g. BigNAS, decouples the model training and searchi...
详细信息
ISBN:
(纸本)9781665445092
Neural architecture search (NAS) has shown great promise in designing state-of-the-art (SOTA) models that are both accurate and efficient. Recently, two-stage NAS, e.g. BigNAS, decouples the model training and searching process and achieves remarkable search efficiency and accuracy. Two-stage NAS requires sampling from the search space during training, which directly impacts the accuracy of the final searched models. While uniform sampling has been widely used for its simplicity, it is agnostic of the model performance Pareto front, which is the main focus in the search process, and thus, misses opportunities to further improve the model accuracy. In this work, we propose AttentiveNAS that focuses on improving the sampling strategy to achieve better performance Pareto. We also propose algorithms to efficiently and effectively identify the networks on the Pareto during training. Without extra re-training or post-processing, we can simultaneously obtain a large number of networks across a wide range of FLOPs. Our discovered model family, AttentiveNAS models, achieves top-1 accuracy from 77.3% to 80.7% on ImageNet, and outperforms SOTA models, including BigNAS, Once-for-All networks and FBNetV3. We also achieve ImageNet accuracy of 80.1% with only 491 MFLOPs. Our training code and pretrained models are available at https://***/facebookresearch/AttentiveNAS.
We propose a novel Siamese Natural Language Tracker (SNLT), which brings the advancements in visual tracking to the tracking by natural language (NL) descriptions task. The proposed SNLT is applicable to a wide range ...
详细信息
ISBN:
(纸本)9781665445092
We propose a novel Siamese Natural Language Tracker (SNLT), which brings the advancements in visual tracking to the tracking by natural language (NL) descriptions task. The proposed SNLT is applicable to a wide range of Siamese trackers, providing a new class of baselines for the tracking by NL task and promising future improvements from the advancements of Siamese trackers. The carefully designed architecture of the Siamese Natural Language Region Proposal Network (SNL-RPN), together with the Dynamic Aggregation of vision and language modalities, is introduced to perform the tracking by NL task. Empirical results over tracking benchmarks with NL annotations show that the proposed SNLT improves Siamese trackers by 3 to 7 percentage points with a slight tradeoff of speed. The proposed SNLT outperforms all NL trackers to-date and is competitive among state-of-the-art real-time trackers on LaSOT benchmarks while running at 50 frames per second on a single GPU. Code for this work is available at https: // github. com/ fredfung007/ snlt.
vision-centric 3D environment understanding is both vi-tal and challenging for autonomous driving systems. Re-cently, object-free methods have attracted considerable at-tention. Such methods perceive the world by pred...
详细信息
ISBN:
(数字)9798350353006
ISBN:
(纸本)9798350353013
vision-centric 3D environment understanding is both vi-tal and challenging for autonomous driving systems. Re-cently, object-free methods have attracted considerable at-tention. Such methods perceive the world by predicting the semantics of discrete voxel grids but fail to construct continuous and accurate obstacle surfaces. To this end, in this paper, we propose SurroundSDF to implicitly predict the signed distance field (SDF) and semantic field for the continuous perception from surround images. Specifically, we introduce a query-based approach and utilize SDF con-strained by the Eikonal formulation to accurately describe the surfaces of obstacles. Furthermore, considering the absence of precise SDF ground truth, we propose a novel weakly supervised paradigm for SDF, referred to as the Sandwich Eikonal formulation, which emphasizes applying correct and dense constraints on both sides of the surface, thereby enhancing the perceptual accuracy of the surface. Experiments suggest that our method achieves SOTA for both occupancy prediction and 3D scene reconstruction tasks on the nuScenes dataset.
In this paper, we propose Cluster-wise Hierarchical Generative Model for deep amortized clustering (CHiGac). It provides an efficient neural clustering architecture by grouping data points in a cluster-wise view rathe...
详细信息
ISBN:
(纸本)9781665445092
In this paper, we propose Cluster-wise Hierarchical Generative Model for deep amortized clustering (CHiGac). It provides an efficient neural clustering architecture by grouping data points in a cluster-wise view rather than point-wise view. CHiGac simultaneously learns what makes a cluster, how to group data points into clusters, and how to adaptively control the number of clusters. The dedicated cluster generative process is able to sufficiently exploit pairwise or higher-order interactions between data points in both inter- and intra-cluster, which is useful to sufficiently mine the hidden structure among data. To efficiently minimize the generalized lower bound of CHiGac, we design an Ergodic Amortized Inference (EAI) strategy by considering the average behavior over sequence on an inner variational parameter trajectory, which is theoretically proven to reduce the amortization gap. A series of experiments have been conducted on both synthetic and real-world data. The experimental results demonstrated that CHiGac can efficiently and accurately cluster datasets in terms of both internal and external evaluation metrics (DBI and ACC).
A major challenge for physically unconstrained gaze estimation is acquiring training data with 3D gaze annotations for in-the-wild and outdoor scenarios. In contrast, videos of human interactions in unconstrained envi...
详细信息
ISBN:
(纸本)9781665445092
A major challenge for physically unconstrained gaze estimation is acquiring training data with 3D gaze annotations for in-the-wild and outdoor scenarios. In contrast, videos of human interactions in unconstrained environments are abundantly available and can be much more easily annotated with frame-level activity labels. In this work, we tackle the previously unexplored problem of weakly-supervised gaze estimation from videos of human interactions. We leverage the insight that strong gaze-related geometric constraints exist when people perform the activity of "looking at each other" (LAEO). To acquire viable 3D gaze supervision from LAEO labels, we propose a training algorithm along with several novel loss functions especially designed for the task. With weak supervision from two large scale CMU-Panoptic and AVA-LAEO activity datasets, we show significant improvements in (a) the accuracy of semisupervised gaze estimation and (b) cross-domain generalization on the state-of-the-art physically unconstrained in-the-wild Gaze360 gaze estimation benchmark.
We introduce a novel representation learning method to disentangle pose-dependent as well as view-dependent factors from 2D human poses. The method trains a network using cross-view mutual information maximization (CV...
详细信息
ISBN:
(纸本)9781665445092
We introduce a novel representation learning method to disentangle pose-dependent as well as view-dependent factors from 2D human poses. The method trains a network using cross-view mutual information maximization (CV-MIM) which maximizes mutual information of the same pose performed from different viewpoints in a contrastive learning manner. We further propose two regularization terms to ensure disentanglement and smoothness of the learned representations. The resulting pose representations can be used for cross-view action recognition. To evaluate the power of the learned representations, in addition to the conventional fully-supervised action recognition settings, we introduce a novel task called single-shot cross-view action recognition. This task trains models with actions from only one single viewpoint while models are evaluated on poses captured from all possible viewpoints. We evaluate the learned representations on standard benchmarks for action recognition, and show that (i) CV-MIM performs competitively compared with the state-of-the-art models in the fully-supervised scenarios;(ii) CV-MIM outperforms other competing methods by a large margin in the single-shot cross-view setting;(iii) and the learned representations can significantly boost the performance when reducing the amount of supervised training data. Our code is made publicly available at https://***/google-research/google-research/tree/master/poem.
This paper presents a solution for mapping the location of trees in an orchard and estimating the dendrometric data of the trees. The combined solution consists of a mapping and navigation algorithm, which allows for ...
详细信息
ISBN:
(纸本)9781665448994
This paper presents a solution for mapping the location of trees in an orchard and estimating the dendrometric data of the trees. The combined solution consists of a mapping and navigation algorithm, which allows for autonomous data collection at an orchard with a regular rectangular layout, and data processing for tree detection and dendrometric data estimation. The data collection is done using an Intel RealSense D435i camera, which can obtain both RGB and depth data. The paper presents a comparison between the performance of point cloud processing (PCP) and convolutional neural networks (CNNs) on RGB data for tree detection and dendrometric data estimation. The YOLOv3 CNN achieved a mAP50 of 63.53% with 65.5 FPS and a mean error of 20.6 cm in height estimation. Point cloud processing achieved a precision of 76.72% with 2.1 FPS and a mean error of 20.4 cm in height estimation. In conclusion, this work shows that point cloud processing shows comparable results to convolutional neural networks for height estimation, but trades off processing time for better precision in detection.
Credit rating is an analysis of the credit risks associated with a corporation, which reflects the level of the riskiness and reliability in investing, and plays a vital role in financial risk. There have emerged many...
详细信息
ISBN:
(纸本)9783031189067;9783031189074
Credit rating is an analysis of the credit risks associated with a corporation, which reflects the level of the riskiness and reliability in investing, and plays a vital role in financial risk. There have emerged many studies that implement machine learning and deep learning techniques which are based on vector space to deal with corporate credit rating. Recently, considering the relations among enterprises such as loan guarantee network, some graph-based models are applied in this field with the advent of graph neural networks. But these existing models build networks between corporations without taking the internal feature interactions into account. In this paper, to overcome such problems, we propose a novel model, Corporate Credit Rating via Graph Neural Networks, CCR-GNN for brevity. We firstly construct individual graphs for each corporation based on self-outer product and then use GNN to model the feature interaction explicitly, which includes both local and global information. Extensive experiments conducted on the Chinese public-listed corporate rating dataset, prove that CCR-GNN outperforms the state-of-the-art methods consistently.
Glass surfaces appear everywhere. Their existence can however pose a serious problem to computervision tasks. Recently, a method is proposed to detect glass surfaces by learning multi-scale contextual information. Ho...
详细信息
ISBN:
(纸本)9781665445092
Glass surfaces appear everywhere. Their existence can however pose a serious problem to computervision tasks. Recently, a method is proposed to detect glass surfaces by learning multi-scale contextual information. However, as it is only based on a general context integration operation and does not consider any specific glass surface properties, it gets confused when the images contain objects that are similar to glass surfaces and degenerates in challenging scenes with insufficient contexts. We observe that humans often rely on identifying reflections in order to sense the existence of glass and on locating the boundary in order to determine the extent of the glass. Hence, we propose a model for glass surface detection, which consists of two novel modules: (1) a rich context aggregation module (RCAM) to extract multi-scale boundary features from rich context features for locating glass surface boundaries of different sizes and shapes, and (2) a reflection-based refinement module (RRM) to detect reflection and then incorporate it so as to differentiate glass regions from non-glass regions. In addition, we also propose a challenging dataset consisting of 4,012 glass images with annotations for glass surface detection. Our experiments demonstrate that the proposed model outperforms state-of-the-art methods from relevant fields.
暂无评论