Tracking by natural language specification in a video is a challenging task in computervision. Distinct from initializing the target state only by the bounding box in the first frame, language specification has a str...
详细信息
ISBN:
(数字)9781665487399
ISBN:
(纸本)9781665487399
Tracking by natural language specification in a video is a challenging task in computervision. Distinct from initializing the target state only by the bounding box in the first frame, language specification has a strong potential to assist visual object trackers to capture appearance variation and eliminate semantic ambiguity of the tracked object. In this paper, we carefully design a unified local-global-search framework from the perspective of cross-modal retrieval, including a local tracker, an adaptive retrieval switch module, and a target-specific retrieval module. The adaptive retrieval switch module aligns semantics from the visual signal and the lingual description of the target using three sub-modules, i.e., object-aware attention memory, part-aware cross-attention, and vision-language contrast, which achieve an automatic switch between local search and global search. When booting the global search mechanism, the target-specific retrieval module relocalizes the missing target in the image-wide range via an efficient vision-language guided proposal selector and target-text match. Numerous experimental results on three prevailing benchmarks show the effectiveness and generalization of our framework.
Video frame interpolation involves the synthesis of new frames from existing ones. Convolutional neural networks (CNNs) have been at the forefront of the recent advances in this field. One popular CNN-based approach i...
详细信息
ISBN:
(数字)9781665487399
ISBN:
(纸本)9781665487399
Video frame interpolation involves the synthesis of new frames from existing ones. Convolutional neural networks (CNNs) have been at the forefront of the recent advances in this field. One popular CNN-based approach involves the application of generated kernels to the input frames to obtain an interpolated frame. Despite all the benefits interpolation methods offer, many of these networks require a lot of parameters, with more parameters meaning a heavier computational burden. Reducing the size of the model typically impacts performance negatively. This paper presents a method for parameter reduction for a popular flow-less kernel-based network (Adaptive Collaboration of Flows). Through our technique of removing the layers that require the most parameters and replacing them with smaller encoders, we reduce the number of parameters of the network and even achieve better performance compared to the original method. This is achieved by deploying rotation to force each individual encoder to learn different features from the input images. Ablations are conducted to justify design choices and an evaluation on how our method performs on full-length videos is presented.
Pass localization and team identification are two primary tasks for pass-count based possession statistics generation of a soccer match. While the existing works perform these two tasks separately, we propose dual int...
详细信息
ISBN:
(数字)9781665487399
ISBN:
(纸本)9781665487399
Pass localization and team identification are two primary tasks for pass-count based possession statistics generation of a soccer match. While the existing works perform these two tasks separately, we propose dual interacting reinforcement learning agents to jointly perform these tasks. The proposed model has a localization agent, that decides which direction to move a temporal window to localize a pass. On the other hand, there is an identification agent that decides if the temporal window contains a pass for team-A (or team-B), or the localization agent needs to readjust the temporal window further. In this multi-agent setup, an agent may communicate by sharing some message to guide the other agent to achieve its task. To achieve this inter-agent communication, we extend the Dueling DQN architecture and share the value of a state as a message to the other agent. Two agents watch, act independently and cooperate with each other in order to detect a valid pass in a soccer video. A novel reward function is proposed that helps the agents to learn the optimal policy. Experiments performed on online videos show that our method is 3% better at localization of pass than the competitive methods.
Facial expression recognition plays an important role in human-computer interaction. In this paper, we propose the Coarse-to-Fine Cascaded network with Smooth Predicting (CFC-SP) to improve the performance of facial e...
详细信息
ISBN:
(数字)9781665487399
ISBN:
(纸本)9781665487399
Facial expression recognition plays an important role in human-computer interaction. In this paper, we propose the Coarse-to-Fine Cascaded network with Smooth Predicting (CFC-SP) to improve the performance of facial expression recognition. CFC-SP contains two core components, namely Coarse-to-Fine Cascaded networks (CFC) and Smooth Predicting (SP). For CFC, it first groups several similar emotions to form a rough category, and then employs a network to conduct a coarse but accurate classification. Later, an additional network for these grouped emotions is further used to obtain fine-grained predictions. For SP, it improves the recognition capability of the model by capturing both universal and unique expression features. To be specific, the universal features denote the general characteristic of facial emotions within a period and the unique features denote the specific characteristic at this moment. Experiments on Aff-Wild2 show the effectiveness of the proposed CFSP. We achieved 3rd place in the Expression Classification Challenge of the 3rd Competition on Affective Behavior Analysis in-the-wild. The code will be released at https://***/BR-IDL/PaddleViT.
computer-aided analyses of cells in Whole Slide Images (WSIs) have become an important topic in digital pathology. Despite the recent success of deep learning in biomedical research, these methods are still difficult ...
详细信息
ISBN:
(数字)9781665487399
ISBN:
(纸本)9781665487399
computer-aided analyses of cells in Whole Slide Images (WSIs) have become an important topic in digital pathology. Despite the recent success of deep learning in biomedical research, these methods are still difficult to apply to multi-gigabyte WSIs. To overcome this difficulty, a variety of patch-based solutions have been introduced, which however all suffer from certain limitations compared to manual examinations and often fail to meet the specificities of cytological inspections. Here we introduce an alternative scheme which incorporates clinical expertise in the selection process to automatically identify the clinically relevant areas. By using a bone marrow smear dataset containing 22-gigapixel images of 153 patients, we introduce a novel pipeline combining unsupervised and supervised methodologies to gradually select the most appropriate single-cell regions, which are subsequently used in multiple medically crucial Acute Myeloid Leukemia (AML) predictions. Our approach is capable of dealing with a variety of common WSI challenges, massively limits the manual annotation effort, reduces the data by a factor of up to 99.9% and achieves super-human performance on the final cytological prediction tasks.
We present a method for augmenting photo-realistic 3D scene assets by automatically recognizing, matching, and swapping their materials. Our method proposes a material matching pipeline for the efficient replacement o...
详细信息
ISBN:
(数字)9781665487399
ISBN:
(纸本)9781665487399
We present a method for augmenting photo-realistic 3D scene assets by automatically recognizing, matching, and swapping their materials. Our method proposes a material matching pipeline for the efficient replacement of unknown materials with perceptually similar PBR materials from a database, enabling the quick creation of many variations of a given 3D synthetic scene. At the heart of this method is a novel material similarity feature that is learnt, in conjunction with optimal lighting conditions, by fine-tuning a deep neural network on a material classification task using our proposed dataset. Our evaluation demonstrates that lighting optimization improves CNN-based texture feature extraction methods and better estimates material properties. We conduct a series of experiments showing our method's ability to augment photo-realistic indoor scenes using both standard and procedurally generated PBR materials.
To understand the genuine emotions expressed by humans during social interactions, it is necessary to recognize the subtle changes on the face (micro-expressions) demonstrated by an individual. Facial micro-expression...
详细信息
ISBN:
(数字)9781665487399
ISBN:
(纸本)9781665487399
To understand the genuine emotions expressed by humans during social interactions, it is necessary to recognize the subtle changes on the face (micro-expressions) demonstrated by an individual. Facial micro-expressions are brief, rapid, spontaneous gestures and non-voluntary facial muscle movements beneath the skin. Therefore, it is a challenging task to classify facial micro-expressions. This paper presents an end-to-end novel three-stream graph attention network model to capture the subtle changes on the face and recognize micro-expressions (MEs) by exploiting the relationship between optical flow magnitude, optical flow direction, and the node locations features. A facial graph representational structure is used to extract the spatial and temporal information using the three frames. The varying dynamic patch size of optical flow features is used to extract the local texture information across each landmark point. The network only utilizes the landmark points location features and optical flow information across these points and generates good results for the classification of MEs. A comprehensive evaluation of SAMM and the CASME II datasets demonstrates the high efficacy, efficiency, and generalizability of the proposed approach and achieves better results than the state-of-the-art methods.
How to build a system for robust classification and recognition of facial expressions has been one of the most important research issues for successful interactive computing applications. However, previous datasets an...
详细信息
ISBN:
(数字)9781665487399
ISBN:
(纸本)9781665487399
How to build a system for robust classification and recognition of facial expressions has been one of the most important research issues for successful interactive computing applications. However, previous datasets and studies mainly focused on facial expression recognition in a controlled/lab setting, therefore, could hardly be generalized in a more practical and real-life environment. The Affective Behavior Analysis in-the-wild (ABAW) 2022 competition released a dataset consisting of various video clips of facial expressions in-the-wild. In this paper, we propose a method based on the ensemble of multi-head cross attention networks to address the facial expression classification task introduced in the ABAW 2022 competition. We built a uni-task approach for this task, achieving the average F1-score of 34.60 on the validation set and 33.77 on the test set, ranking second place on the final leaderboard.
Training models to apply linguistic knowledge and visual concepts from 2D images to 3D world understanding is a promising direction that researchers have only recently started to explore. In this work, we design a nov...
详细信息
For computers to recognize human emotions, expression classification is an equally important problem in the human-computer interaction area. In the 3rd Affective Behavior Analysis In-The-Wild competition, the task of ...
详细信息
ISBN:
(数字)9781665487399
ISBN:
(纸本)9781665487399
For computers to recognize human emotions, expression classification is an equally important problem in the human-computer interaction area. In the 3rd Affective Behavior Analysis In-The-Wild competition, the task of expression classification includes eight classes with six basic expressions of human faces from videos. In this paper, we employ a transformer mechanism to encode the robust representation from the backbone. Fusion of the robust representations plays an important role in the expression classification task. Our approach achieves 30.35% and 28.60% for the F-1 score on the validation set and the test set, respectively. This result shows the effectiveness of the proposed architecture based on the Aff-Wild2 dataset and our team archives 5th for the expression classification task in the 3rd Affective Behavior Analysis In-The-Wild competition.
暂无评论