Existing explainable and explicit visual reasoning methods only perform reasoning based on visual evidence but do not take into account knowledge beyond what is in the visual scene. To addresses the knowledge gap betw...
详细信息
ISBN:
(纸本)9781665445092
Existing explainable and explicit visual reasoning methods only perform reasoning based on visual evidence but do not take into account knowledge beyond what is in the visual scene. To addresses the knowledge gap between visual reasoning methods and the semantic complexity of real-world images, we present the first explicit visual reasoning method that incorporates external knowledge and models high-order relational attention for improved generalizability and explainability. Specifically, we propose a knowledge incorporation network that explicitly creates and includes new graph nodes for entities and predicates from external knowledge bases to enrich the semantics of the scene graph used in explicit reasoning. We then create a novel Graph-Relate module to perform high-order relational attention on the enriched scene graph. By explicitly introducing structured external knowledge and high-order relational attention, our method demonstrates significant generalizability and explainability over the state-of-the-art visual reasoning approaches on the GQA and VQAv2 datasets.
Face detection and face recognition are the most important and frequently used operations in image processing and computervision domains. There are many applications for face recognition, but some of them require acc...
详细信息
Most team sports such as hockey involve periods of active play interleaved with breaks in play. When watching a game remotely, many fans would prefer an abbreviated game showing only periods of active play. Here we ad...
详细信息
ISBN:
(纸本)9781665448994
Most team sports such as hockey involve periods of active play interleaved with breaks in play. When watching a game remotely, many fans would prefer an abbreviated game showing only periods of active play. Here we address the problem of identifying these periods in order to produce a time-compressed viewing experience. Our approach is based on a hidden Markov model of play state driven by deep visual and optional auditory cues. We find that our deep visual cues generalize well across different cameras and that auditory cues can improve performance but only if unsupervised methods are used to adapt emission distributions to domain shift across games. Our system achieves temporal compression rates of 20-50% at a recall of 96%.
Inner speech recognition is a modern advancement in Brain computer interfaces (BCI) that facilitates a communication between the computer and the brain in a direct way. It is particularly beneficial for individuals wh...
详细信息
Multistage, or serial, fusion refers to the algorithms sequentially fusing an increased number of matching results at each step and making decisions about accepting or rejecting the match hypothesis, or going to the n...
详细信息
ISBN:
(纸本)9781665448994
Multistage, or serial, fusion refers to the algorithms sequentially fusing an increased number of matching results at each step and making decisions about accepting or rejecting the match hypothesis, or going to the next step. Such fusion methods are beneficial in the situations where running additional matching algorithms needed for later stages is time consuming or expensive. The construction of multistage fusion methods is challenging, since it requires both learning fusion functions and finding optimal decision thresholds for each stage. In this paper, we propose the use of single neural network for learning the multistage fusion. In addition we discuss the choices for the performance measurements of the trained algorithms and for the selection of network training optimization criteria. We perform the experiments using three face matching algorithms and IJB-A and IJB-C databases.
In this paper, we present a decomposition model for stereo matching to solve the problem of excessive growth in computational cost (time and memory cost) as the resolution increases. In order to reduce the huge cost o...
详细信息
ISBN:
(纸本)9781665445092
In this paper, we present a decomposition model for stereo matching to solve the problem of excessive growth in computational cost (time and memory cost) as the resolution increases. In order to reduce the huge cost of stereo matching at the original resolution, our model only runs dense matching at a very low resolution and uses sparse matching at different higher resolutions to recover the disparity of lost details scale-by-scale. After the decomposition of stereo matching, our model iteratively fuses the sparse and dense disparity maps from adjacent scales with an occlusion-aware mask. A refinement network is also applied to improving the fusion result. Compared with high-performance methods like PSMNet and GANet, our method achieves 10 - 100 x speed increase while obtaining comparable disparity estimation results.
This work addresses the challenge of data scarcity in personality-labeled datasets by introducing personality labels to clips from two open datasets, ZeroEGGS and Bandai, which provide diverse fullbody animations. To ...
详细信息
ISBN:
(数字)9798350374490
ISBN:
(纸本)9798350374490;9798350374506
This work addresses the challenge of data scarcity in personality-labeled datasets by introducing personality labels to clips from two open datasets, ZeroEGGS and Bandai, which provide diverse fullbody animations. To this end, we present a user study to annotate short clips from both sets with labels based on the Five-Factor Model (FFM) of personality. We chose features informed by Laban Movement Analysis (LMA) to represent each animation. These features then guided us to select the samples of distinct motion styles to be included in the user study, obtaining high personality variance and keeping the study duration and cost viable. Using the labeled data, we then ran a correlation analysis to find features that indicate high correlation with each personality dimension. Our regression analysis results indicate that highly correlated features are promising in accurate personality estimation. We share our early findings, code, and data publicly.
A video can be represented by the composition of appearance and motion. Appearance (or content) expresses the information invariant throughout time, and motion describes the time-variant movement. Here, we propose sel...
详细信息
ISBN:
(纸本)9781665445092
A video can be represented by the composition of appearance and motion. Appearance (or content) expresses the information invariant throughout time, and motion describes the time-variant movement. Here, we propose self-supervised approaches for video Generative Adversarial Networks (GANs) to achieve the appearance consistency and motion coherency in videos. Specifically, the dual discriminators for image and video individually learn to solve their own pretext tasks;appearance contrastive learning and temporal structure puzzle. The proposed tasks enable the discriminators to learn representations of appearance and temporal context, and force the generator to synthesize videos with consistent appearance and natural flow of motions. Extensive experiments in facial expression and human action public benchmarks show that our method outperforms the state-of-the-art video GANs. Moreover, consistent improvements regardless of the architecture of video GANs confirm that our framework is generic.
Facial recognition systems are incredibly important in today's digital world, spanning various industries. It has several benefits and Plays a crucial part in identifying individuals, authentication, and security....
详细信息
Text recognition in information loss scenarios like blurriness, occlusion, and perspective distortion is challenging in real-world applications. To enhance robustness, some studies use extra unlabeled data for encoder...
详细信息
ISBN:
(纸本)9798350344868;9798350344851
Text recognition in information loss scenarios like blurriness, occlusion, and perspective distortion is challenging in real-world applications. To enhance robustness, some studies use extra unlabeled data for encoder pretraining. Others focus on improving decoder context reasoning. However, pretraining methods require abundant unlabeled data and high computing resources, while decoder-based approaches risk over-correction. In this paper, we propose MaskSTR, a dual-branch training framework for STR models, using patch masking to simulate information loss. MaskSTR guides visual representation learning, improving robustness to information loss conditions without extra data or training stages. Furthermore, we introduce Block Masking, a novel and straightforward mask generation method, for further performance enhancement. Experiments demonstrate MaskSTR's effectiveness across CTC, attention, and Transformer decoding methods, achieving significant performance gains and setting new state-of-the-art results.
暂无评论