The multimedia community has recently seen a tremendous surge of interest in the fashion recommendation problem. A lot of efforts have been made to model the compatibility between fashion items. Some have also studied...
详细信息
ISBN:
(纸本)9781665445092
The multimedia community has recently seen a tremendous surge of interest in the fashion recommendation problem. A lot of efforts have been made to model the compatibility between fashion items. Some have also studied users' personal preferences for the outfits. There is, however, another difficulty in the task that hasn't been dealt with carefully by previous work. Users that are new to the system usually only have several (less than 5) outfits available for learning. With such a limited number of training examples, it is challenging to model the user's preferences reliably. In this work, we propose a new solution for personalized outfit recommendation that is capable of handling this case. We use a stacked self-attention mechanism to model the high-order interactions among the items. We then embed the items in an outfit into a single compact representation within the outfit space. To accommodate the variety of users' preferences, we characterize each user with a set of anchors, i.e. a group of learnable latent vectors in the outfit space that are the representatives of the outfits the user likes. We also learn a set of general anchors to model the general preference shared by all users. Based on this representation of the outfits and the users, we propose a simple but effective strategy for the new user profiling tasks. Extensive experiments on large scale real-world datasets demonstrate the performance of our proposed method.
Modern lane detection methods have achieved remarkable performances in complex real-world scenarios, but many have issues maintaining real-time efficiency, which is important for autonomous vehicles. In this work, we ...
详细信息
ISBN:
(纸本)9781665445092
Modern lane detection methods have achieved remarkable performances in complex real-world scenarios, but many have issues maintaining real-time efficiency, which is important for autonomous vehicles. In this work, we propose LaneATT: an anchor-based deep lane detection model, which, akin to other generic deep object detectors, uses the anchors for the feature pooling step. Since lanes follow a regular pattern and are highly correlated, we hypothesize that in some cases global information may be crucial to infer their positions, especially in conditions such as occlusion, missing lane markers, and others. Thus, this work proposes a novel anchor-based attention mechanism that aggregates global information. The model was evaluated extensively on three of the most widely used datasets in the literature. The results show that our method outperforms the current state-of-the-art methods showing both higher efficacy and efficiency. Moreover, an ablation study is performed along with a discussion on efficiency trade-off options that are useful in practice.
In the realm of computervision, image denoising remains a formidable challenge with profound implications for fields like medical imaging, remote sensing, and photography. Despite notable advancements in deep learnin...
详细信息
Most existing video text detection methods track texts with appearance features, which are easily influenced by the change of perspective and illumination. Compared with appearance features, semantic features are more...
详细信息
ISBN:
(纸本)9781665445092
Most existing video text detection methods track texts with appearance features, which are easily influenced by the change of perspective and illumination. Compared with appearance features, semantic features are more robust cues for matching text instances. In this paper, we propose an end-to-end trainable video text detector that tracks texts based on semantic features. First, we introduce a new character center segmentation branch to extract semantic features, which encode the category and position of characters. Then we propose a novel appearance-semanticgeometry descriptor to track text instances, in which semantic features can improve the robustness against appearance changes. To overcome the lack of character-level annotations, we propose a novel weakly-supervised character center detection module, which only uses word-level annotated real images to generate character-level labels. The proposed method achieves state-of-the-art performance on three video text benchmarks ICDAR 2013 Video, Minetto and RT-IK, and two Chinese scene text benchmarks CA-SIA10K and MSRA-TD500.
We present a deep neural network to predict structural similarity between 2D layouts by leveraging Graph Matching Networks (GMN). Our network, coined LayoutGMN, learns the layout metric via neural graph matching, usin...
详细信息
ISBN:
(纸本)9781665445092
We present a deep neural network to predict structural similarity between 2D layouts by leveraging Graph Matching Networks (GMN). Our network, coined LayoutGMN, learns the layout metric via neural graph matching, using an attention-based GMN designed under a triplet network setting. To train our network, we utilize weak labels obtained by pixel-wise Intersection-over-Union (IoUs) to define the triplet loss. Importantly, LayoutGMN is built with a structural bias which can effectively compensate for the lack of structure awareness in IoUs. We demonstrate this on two prominent forms of layouts, viz., floorplans and UI designs, via retrieval experiments on large-scale datasets. In particular, retrieval results by our network better match human judgement of structural layout similarity compared to both IoUs and other baselines including a state-of-theart method based on graph neural networks and image convolution. In addition, LayoutGMN is the first deep model to offer both metric learning of structural layout similarity and structural matching between layout elements.
We present a simple, effective, and general activation function we term ACON which learns to activate the neurons or not. Interestingly, we find Swish, the recent popular NAS-searched activation, can be interpreted as...
详细信息
ISBN:
(纸本)9781665445092
We present a simple, effective, and general activation function we term ACON which learns to activate the neurons or not. Interestingly, we find Swish, the recent popular NAS-searched activation, can be interpreted as a smooth approximation to ReLU. Intuitively, in the same way, we approximate the more general Maxout family to our novel ACON family, which remarkably improves the performance and makes Swish a special case of ACON. Next, we present meta-ACON, which explicitly learns to optimize the parameter switching between non-linear (activate) and linear (inactivate) and provides a new design space. By simply changing the activation function, we show its effectiveness on both small models and highly optimized large models (e.g. it. improves the ImageNet top-1 accuracy rate by 6.7% and 1.8% on MobileNet0.25 and ResNet-152, respectively). Moreover, our novel ACON can be naturally transferred to object detection and semantic segmentation, showing that ACON is an effective alternative in a variety of tasks. Code is available at https: // ***/nmaac/acon.
While radar and video data can be readily fused at the detection level, fusing them at the pixel level is potentially more beneficial. This is also more challenging in part due to the sparsity of radar, but also becau...
详细信息
ISBN:
(纸本)9781665445092
While radar and video data can be readily fused at the detection level, fusing them at the pixel level is potentially more beneficial. This is also more challenging in part due to the sparsity of radar, but also because automotive radar beams are much wider than a typical pixel combined with a large baseline between camera and radar, which results in poor association between radar pixels and color pixel. A consequence is that depth completion methods designed for LiDAR and video fare poorly for radar and video. Here we propose a radar-to-pixel association stage which learns a mapping from radar returns to pixels. This mapping also serves to densify radar returns. Using this as a first stage, followed by a more traditional depth completion method, we are able to achieve imageguided depth completion with radar and video. We demonstrate performance superior to camera and radar alone on the nuScenes dataset. Our source code is available at https://***/longyunf/rc-pda.
In most existing learning systems, images are typically viewed as 2D pixel arrays. However, in another paradigm gaining popularity, a 2D image is represented as an implicit neural representation (INR) - an MLP that pr...
详细信息
ISBN:
(纸本)9781665445092
In most existing learning systems, images are typically viewed as 2D pixel arrays. However, in another paradigm gaining popularity, a 2D image is represented as an implicit neural representation (INR) - an MLP that predicts an RGB pixel value given its (x, y) coordinate. In this paper, we propose two novel architectural techniques for building INR-based image decoders: factorized multiplicative modulation and multi-scale INRs, and use them to build a state-of-the-art continuous image GAN. Previous attempts to adapt INRs for image generation were limited to MNIST-like datasets and do not scale to complex real-world data. Our proposed INR-GAN architecture improves the performance of continuous image generators by several times, greatly reducing the gap between continuous image GANs and pixel-based ones. Apart from that, we explore several exciting properties of the INR-based decoders, like out-of-the-box superresolution, meaningful image-space interpolation, accelerated inference of low-resolution images, an ability to extrapolate outside of image boundaries, and strong geometric prior. The project page is located at https://***/inr-gan.
The single image dehazing task has made significant progress recently, aiming to recover the contrast and color of the scattered image. Many patch prior based dehazing methods have been explored, and this paper propos...
详细信息
ISBN:
(纸本)9781665448994
The single image dehazing task has made significant progress recently, aiming to recover the contrast and color of the scattered image. Many patch prior based dehazing methods have been explored, and this paper proposes another single image dehazing method by analyzing the prior information of local dehazed patches. With our observation, when the estimated transmission value varies from the ground-truth transmission value to 1, the output value of a metric function decrease correspondingly, which is defined based on the difference maps among three RGB channels of local dehazed patches normalized using global atmospheric light. Under additional bounding, the local transmission value can be estimated accurately. To reduce computation time, the whole image is divided into many small patches, and within each patch, we estimate a transmission value accurately. We further use weighted interpolation and guided filtering to refine the edges and details of the rough transmission map. Finally, we evaluate the proposed method using Fattal's synthetic haze images, SOTS dataset, and a wide variety of real-world haze images. Experiments show that our method outperforms other state-of-the-art dehazing algorithms by a large margin, especially on synthetic noisy haze images.
Existing studies in weakly-supervised semantic segmentation (WSSS) using image-level weak supervision have several limitations: sparse object coverage, inaccurate object boundaries, and co-occurring pixels from non-ta...
详细信息
ISBN:
(数字)9781665445092
ISBN:
(纸本)9781665445092
Existing studies in weakly-supervised semantic segmentation (WSSS) using image-level weak supervision have several limitations: sparse object coverage, inaccurate object boundaries, and co-occurring pixels from non-target objects. To overcome these challenges, we propose a novel framework, namely Explicit Pseudo-pixel Supervision (EPS), which learns from pixel-level feedback by combining two weak supervisions;the image-level label provides the object identity via the localization map and the saliency map from the off-the-shelf saliency detection model offers rich boundaries. We devise a joint training strategy to fully utilize the complementary relationship between both information. Our method can obtain accurate object boundaries and discard co-occurring pixels, thereby significantly improving the quality of pseudo-masks. Experimental results show that the proposed method remarkably outperforms existing methods by resolving key challenges of WSSS and achieves the new state-of-the-art performance on both PASCAL VOC 2012 and MS COCO 2014 datasets. The code is available at https://***/halbielee/EPS.
暂无评论