There is a scarcity of multilingual vision-language models that properly account for the perceptual differences that are reflected in image captions across languages and cultures. In this work, through a multimodal, m...
详细信息
The paper gives a statement and considers the solution of an urgent scientific problem of formation control for a group of unmanned aerial vehicles (UAVs) operating in an unstable environment. To construct the referen...
详细信息
Human evaluation remains the gold standard for assessing abstractive summarization. However, current practices often prioritize constructing evaluation guidelines for fluency, coherence, and factual accuracy, overlook...
详细信息
Factors are a foundational component of legal analysis and computational models of legal reasoning. These factor-based representations enable lawyers, judges, and AI and Law researchers to reason about legal cases. In...
详细信息
The problem of optimizing the load on an operator of unmanned aerial vehicles (UAVs), which performs real-time tasks of researching and monitoring territories in an unstable environment is considered. Working load dep...
详细信息
The paper gives a statement and considers the solution to an urgent problem of flying over the given targets by an unmanned aerial vehicle (UAV) in unstable conditions. A criterion is formulated for constructing effic...
详细信息
An important task for intelligentsystems is affordance grounding, where the goal is to locate regions on an object where an action can be performed. Past weakly supervised approaches learn from human-object interacti...
详细信息
ISBN:
(数字)9798350365474
ISBN:
(纸本)9798350365481
An important task for intelligentsystems is affordance grounding, where the goal is to locate regions on an object where an action can be performed. Past weakly supervised approaches learn from human-object interaction (HOI) by transferring grounding knowledge from exocentric to ego-centric views of an object. The use of HOI priors is inherently noisy and thus provides a limited source of supervision. To address this challenge, we identify that recent foundational models (i.e. VLMs and LLMs) can serve as auxiliary sources of knowledge for frameworks due to their vast world knowledge. In this work, we propose strategies to extract and leverage foundational model knowledge related to attributes and object parts to enhance an HOI-based affordance grounding framework. In particular, we propose to combine HOI and foundational model priors through (1) a spatial consistency loss and (2) heatmap aggregation. Our strategies result in mKLD and mNSS improvements, and insights suggest future directions for improving affordance grounding capabilities.
Vision-language alignment learned from image-caption pairs has been shown to benefit tasks like object recognition and detection. Methods are mostly evaluated in terms of how well object class names are learned, but c...
Vision-language alignment learned from image-caption pairs has been shown to benefit tasks like object recognition and detection. Methods are mostly evaluated in terms of how well object class names are learned, but captions also contain rich attribute context that should be considered when learning object alignment. It is unclear how methods use this context in learning, as well as whether models succeed when tasks require attribute and object understanding. To address this gap, we conduct extensive analysis of the role of attributes in vision-language models. We specifically measure model sensitivity to the presence and meaning of attribute context, gauging influence on object embeddings through unsupervised phrase grounding and classification via description methods. We further evaluate the utility of attribute context in training for open-vocabulary object detection, fine-grained text-region retrieval, and attribution tasks. Our results show that attribute context can be wasted when learning alignment for detection, attribute meaning is not adequately considered in embeddings, and describing classes by only their attributes is ineffective. A viable strategy that we find to increase benefits from attributes is contrastive training with adjective-based negative captions.
Algorithms for text-generation in dialogue canbe misguided. For example, in task-orientedsettings, reinforcement learning that optimizesonly task-success can lead to abysmal lexical diversity. We hypothesize this is d...
详细信息
Speech content is closely related to the stability of speaker embeddings in speaker verification tasks. In this paper, we propose a novel architecture based on self-constraint learning (SCL) and reconstruction task (R...
详细信息
暂无评论