vision (image & video) - Language (VL) pre-training is the recent popular paradigm that achieved state-of-the-art results on multi-modal tasks like image-retrieval, video-retrieval, visual question answering etc. ...
vision (image & video) - Language (VL) pre-training is the recent popular paradigm that achieved state-of-the-art results on multi-modal tasks like image-retrieval, video-retrieval, visual question answering etc. These models are trained in an unsupervised way and greatly benefit from the complementary modality supervision. In this paper, we explore if the language representations trained using vision supervision perform better than vanilla language representations on Natural Language Understanding and commonsense reasoning benchmarks. We experiment with a diverse set of image-text models such as ALBEF, BLIP, METER and video-text models like ALPRO, Frozen in Time, VIOLET. We compare the performance of language representations of stand-alone text encoders of these models to the language representations of text encoders learnt through vision supervision. Our experiments suggest that vanilla language representations show superior performance on most of the tasks. These results shed light on the current drawbacks of the vision-language models. The code is available at https://***/avinashsai/MML
Depth prediction is at the core of several computervision applications, such as autonomous driving and robotics. It is often formulated as a regression task in which depth values are estimated through network layers....
Depth prediction is at the core of several computervision applications, such as autonomous driving and robotics. It is often formulated as a regression task in which depth values are estimated through network layers. Unfortunately, the distribution of values on depth maps is seldom explored. Therefore, this paper proposes a novel framework combining contrastive learning and depth prediction, allowing us to pay more attention to depth distribution and consequently enabling improvements to the overall estimation process. Purposely, we propose a window-based contrastive learning module, which partitions the feature maps into non-overlapping windows and constructs contrastive loss within each one. Forming and sorting positive and negative pairs, then enlarging the gap between the two in the representation space, constraints depth distribution to fit the feature of the depth map. Experiments on KITTI and NYU datasets demonstrate the effectiveness of our framework.
Despite the impressive performance in many individual tasks, deep neural networks suffer from catastrophic forgetting when learning new tasks incrementally. Recently, various incremental learning methods have been pro...
详细信息
ISBN:
(纸本)9781665445092
Despite the impressive performance in many individual tasks, deep neural networks suffer from catastrophic forgetting when learning new tasks incrementally. Recently, various incremental learning methods have been proposed, and some approaches achieved acceptable performance relying on stored data or complex generative models. However, storing data from previous tasks is limited by memory or privacy issues, and generative models are usually unstable and inefficient in training. In this paper, we propose a simple non-exemplar based method named PASS, to address the catastrophic forgetting problem in incremental learning. On the one hand, we propose to memorize one class-representative prototype for each old class and adopt prototype augmentation (protoAug) in the deep feature space to maintain the decision boundary of previous tasks. On the other hand, we employ self:supervised learning (SSL) to learn more generalizable and transferable features for other tasks, which demonstrates the effectiveness of SSL in incremental learning. Experimental results on benchmark datasets show that our approach significantly outperforms non-exemplar based methods, and achieves comparable performance compared to exemplar based approaches.
Compared to 2D object bounding-box labeling, it is very difficult for humans to annotate 3D object poses, especially when depth images of scenes are unavailable. This paper investigates whether we can estimate the obj...
详细信息
ISBN:
(纸本)9781665445092
Compared to 2D object bounding-box labeling, it is very difficult for humans to annotate 3D object poses, especially when depth images of scenes are unavailable. This paper investigates whether we can estimate the object poses effectively when only RGB images and 2D object annotations are given. To this end, we present a two-step pose estimation framework to attain 6DoF object poses from 2D object bounding-boxes. In the first step, the framework learns to segment objects from real and synthetic data in a weakly-supervised fashion, and the segmentation masks will act as a prior for pose estimation. In the second step, we design a dual-scale pose estimation network, namely DSC-PoseNet, to predict object poses by employing a differential renderer. To be specific, our DSC-PoseNet firstly predicts object poses in the original image scale by comparing the segmentation masks and the rendered visible object masks. Then, we resize object regions to a fixed scale to estimate poses once again. In this fashion, we eliminate large scale variations and focus on rotation estimation, thus facilitating pose estimation. Moreover, we exploit the initial pose estimation to generate pseudo ground-truth to train our DSC-PoseNet in a self-supervised manner. The estimation results in these two scales are ensembled as our final pose estimation. Extensive experiments on widely-used benchmarks demonstrate that our method outperforms state-of-the-art models trained on synthetic data by a large margin and even is on par with several fully-supervised methods.
Multi-Camera People Tracking is a multifaceted issue that requires the integration of several computervision tasks, such as Object Detection, Multiple Object Tracking, and Person Re-identification. This study present...
详细信息
ISBN:
(数字)9798350365474
ISBN:
(纸本)9798350365481
Multi-Camera People Tracking is a multifaceted issue that requires the integration of several computervision tasks, such as Object Detection, Multiple Object Tracking, and Person Re-identification. This study presents a multi-camera people tracking method that comprises four main processes: (1) single camera people tracking based on overlap suppression clustering, (2) representative image extraction using pose estimation for re-identification, (3) re-identification using hierarchical clustering with average linkage, and (4) low-identifiability tracklets *** RIIPS team achieved the highest Higher Order Tracking Accuracy (HOTA) of 71.9446% in the 2024 AI City Challenge Track 1.
To reconstruct spectral signals from multi-channel observations, in particular trichromatic RGBs, has recently emerged as a promising alternative to traditional scanning-based spectral imager. It has been proven that ...
详细信息
ISBN:
(纸本)9781665445092
To reconstruct spectral signals from multi-channel observations, in particular trichromatic RGBs, has recently emerged as a promising alternative to traditional scanning-based spectral imager. It has been proven that the reconstruction accuracy relies heavily on the spectral response of the RGB camera in use. To improve accuracy, data-driven algorithms have been proposed to retrieve the best response curves of existing RGB cameras, or even to design brand new three-channel response curves. Instead, this paper explores the filter-array based color imaging mechanism of existing RGB cameras, and proposes to design the IR-cut filter properly for improved spectral recovery, which stands out as an in-between solution with better trade-off between reconstruction accuracy and implementation complexity. We further propose a deep learning based spectral reconstruction method, which allows to recover the illumination spectrum as well. Experiment results with both synthetic and real images under daylight illumination have shown the benefits of our IR-cut filter tuning method and our illumination-aware spectral reconstruction method.
Mobile and embedded platforms are increasingly required to efficiently execute computationally demanding DNNs across heterogeneous processing elements. At runtime, the available hardware resources to DNNs can vary con...
详细信息
ISBN:
(纸本)9781665448994
Mobile and embedded platforms are increasingly required to efficiently execute computationally demanding DNNs across heterogeneous processing elements. At runtime, the available hardware resources to DNNs can vary considerably due to other concurrently running applications. The performance requirements of the applications could also change under different scenarios. To achieve the desired performance, dynamic DNNs have been proposed in which the number of channels/layers can be scaled in real time to meet different requirements under varying resource constraints. However, the training process of such dynamic DNNs can be costly, since platform-aware models of different deployment scenarios must be retrained to become dynamic. This paper proposes Dynamic-OFA, a novel dynamic DNN approach for state-of-the-art platform-aware NAS models (i.e. Once-for-all network (OFA)). Dynamic-OFA pre-samples a family of sub-networks from a static OFA backbone model, and contains a runtime manager to choose different sub-networks under different runtime environments. As such, Dynamic-OFA does not need the traditional dynamic DNN training pipeline. Compared to the state-of-the-art, our experimental results using ImageNet on a Jetson Xavier NX show that the approach is up to 3.5x (CPU), 2.4x (GPU) faster for similar Top-1 accuracy, or 3.8% (CPU), 5.1% (GPU) higher accuracy at similar latency.
Contrastive learning methods for unsupervised visual representation learning have reached remarkable levels of transfer performance. We argue that the power of contrastive learning has yet to be fully unleashed, as cu...
详细信息
ISBN:
(纸本)9781665445092
Contrastive learning methods for unsupervised visual representation learning have reached remarkable levels of transfer performance. We argue that the power of contrastive learning has yet to be fully unleashed, as current methods are trained only on instance-level pretext tasks, leading to representations that may be sub-optimal for downstream tasks requiring dense pixel predictions. In this paper, we introduce pixel-level pretext tasks for learning dense feature representations. The first task directly applies contrastive learning at the pixel level. We additionally propose a pixel-to-propagation consistency task that produces better results, even surpassing the state-of-the-art approaches by a large margin. Specifically, it achieves 60.2 AP, 41.4 / 40.5 mAP and 77.2 mIoU when transferred to Pascal VOC object detection (C4), COCO object detection (FPN / C4) and Cityscapes semantic segmentation using a ResNet-50 backbone network, which are 2.6 AP, 0.8 / 1.0 mAP and 1.0 mIoU better than the previous best methods built on instance-level contrastive learning. Moreover, the pixel-level pretext tasks are found to be effective for pretraining not only regular backbone networks but also head networks used for dense downstream tasks, and are complementary to instance-level contrastive methods. These results demonstrate the strong potential of defining pretext tasks at the pixel level, and suggest a new path forward in unsupervised visual representation learning. Code is available at https://***/zdaxie/PixPro.
Superpixel is generated by automatically clustering pixels in an image into hundreds of compact partitions, which is widely used to perceive the object contours for its excellent contour adherence. Although some works...
详细信息
ISBN:
(纸本)9781665445092
Superpixel is generated by automatically clustering pixels in an image into hundreds of compact partitions, which is widely used to perceive the object contours for its excellent contour adherence. Although some works use the Convolution Neural Network (CNN) to generate high-quality superpixel, we challenge the design principles of these networks, specifically for their dependence on manual labels and excess computation resources, which limits their flexibility compared with the traditional unsupervised segmentation methods. We target at redefining the CNN-based superpixel segmentation as a lifelong clustering task and propose an unsupervised CNN-based method called LNS-Net. The LNS-Net can learn superpixel in a non-iterative and lifelong manner without any manual labels. Specifically, a lightweight feature embedder is proposed for LNS-Net to efficiently generate the cluster-friendly features. With those features, seed nodes can be automatically assigned to cluster pixels in a non-iterative way. Additionally, our LNS-Net can adapt the sequentially lifelong learning by rescaling the gradient of weight based on both channel and spatial context to avoid overfitting. Experiments show that the proposed LNS-Net achieves significantly better performance on three benchmarks with nearly ten times lower complexity compared with other state-of-the-art methods.
Generative adversarial network (GAN) has become one of the most important neural network models for classical unsupervised machine learning. A variety of discriminator loss functions have been developed to train GAN&#...
详细信息
ISBN:
(纸本)9781665445092
Generative adversarial network (GAN) has become one of the most important neural network models for classical unsupervised machine learning. A variety of discriminator loss functions have been developed to train GAN's discriminators and they all have a common structure: a sum of real and fake losses that only depends on the actual and generated data respectively. One challenge associated with an equally weighted sum of two losses is that the training may benefit one loss but harm the other, which we show causes instability and mode collapse. In this paper, we introduce a new family of discriminator loss functions that adopts a weighted sum of real and fake parts, which we call adaptive weighted loss functions or aw-loss functions. Using the gradients of the real and fake parts of the loss, we can adaptively choose weights to train a discriminator in the direction that benefits the GAN's stability. Our method can be potentially applied to any discriminator model with a loss that is a sum of the real and fake parts. Experiments validated the effectiveness of our loss functions on unconditional and conditional image generation tasks, improving the baseline results by a significant margin on CIFAR-10, STL-10, and CIFAR-100 datasets in Inception Scores (IS) and Frechet Inception Distance (FID) metrics.
暂无评论