Recent transformer-based methods achieve notable gains in the Human-object Interaction Detection (HOID) task by leveraging the detection of DETR and the prior knowledge of vision-Language Model (VLM). However, these m...
详细信息
The need for more transparent face recognition (FR), along with other visual-based decision-making systems has recently attracted more attention in research, society, and industry. The reasons why two face images are ...
详细信息
The remarkable progress in 3D face reconstruction has resulted in high-detail and photorealistic facial representations. Recently, Diffusion Models have revolutionized the capabilities of generative methods by surpass...
详细信息
We propose a novel teacher-student framework to distill knowledge from multiple teachers trained on distinct datasets. Each teacher is first trained from scratch on its own dataset. Then, the teachers are combined int...
详细信息
As vision-Language Models (VLMs) advance, human-centered Assistive Technologies (ATs) for helping People with Visual Impairments (PVIs) are evolving into generalists, capable of performing multiple tasks simultaneousl...
详细信息
Large-scale pre-trained vision-Language Models (VLMs) have exhibited impressive zero-shot performance and transferability, allowing them to adapt to downstream tasks in a data-efficient manner. However, when only a fe...
详细信息
vision transformers have been applied successfully for image recognition tasks. There have been either multi-headed self-attention based (ViT [12], DelT [54]) simi-lar to the original work in textual models or more re...
详细信息
The proceedings contain 166 papers. The topics discussed include: applying computervision to analyze self-injurious behaviors in children with autism spectrum disorder;underwater image enhancement and object detectio...
ISBN:
(纸本)9798331536626
The proceedings contain 166 papers. The topics discussed include: applying computervision to analyze self-injurious behaviors in children with autism spectrum disorder;underwater image enhancement and object detection: are poor object detection results on enhanced images due to missing human labels?;enhancing weakly-supervised object detection on static images through (hallucinated) motion;a zero-shot learning approach for ephemeral gully detection from remote sensing using vision language models;Attrivision: advancing generalization in pedestrian attribute recognition using CLIP;human gaze improves vision transformers by token masking;SSTAR: skeleton-based spatio-temporal action recognition for intelligent video surveillance and suicide prevention in metro stations;and offline signature verification in the banking domain.
Visual grounding tasks aim to localize image regions based on natural language references. In this work, we ex-plore whether generative VLMs predominantly trained on image-text data could be leveraged to scale up the ...
详细信息
Medical Image Foundation Models have proven to be powerful tools for mask prediction across various datasets. However, accurately assessing the uncertainty of their predictions remains a significant challenge. To addr...
详细信息
暂无评论