Facial micro-expressions are brief, rapid, spontaneous gestures of the facial muscles that express an individual's genuine emotions. Because of their short duration and subtlety, detecting and classifying these mi...
详细信息
ISBN:
(纸本)9781665448994
Facial micro-expressions are brief, rapid, spontaneous gestures of the facial muscles that express an individual's genuine emotions. Because of their short duration and subtlety, detecting and classifying these micro-expressions by humans and machines is difficult. In this paper, a novel approach is proposed that exploits relationships between landmark points and the optical flow patch for the given landmark points. It consists of a two-stream graph attention convolutional network that extracts the relationships between the landmark points and local texture using an optical flow patch. A graph structure is built to draw-out temporal information using the triplet of frames. One stream is for node feature location, and the other one is for a patch of optical-flow information. These two streams (node location stream and optical flow stream) are fused for classification. The results are shown on, CASME II and SAMM, publicly available datasets, for three classes and five classes of micro-expressions. The proposed approach outperforms the state-of-the-art methods for 3 and 5 categories of expressions.
Creating high-dynamic videos such as motion-rich actions and sophisticated visual effects poses a significant challenge in the field of artificial intelligence. Unfortu-nately, current state-of-the-art video generatio...
详细信息
ISBN:
(数字)9798350353006
ISBN:
(纸本)9798350353013
Creating high-dynamic videos such as motion-rich actions and sophisticated visual effects poses a significant challenge in the field of artificial intelligence. Unfortu-nately, current state-of-the-art video generation methods, primarily focusing on text-to- video generation, tend to pro-duce video clips with minimal motions despite maintaining high fidelity. We argue that relying solely on text in-structions is insufficient and suboptimal for video generation. In this paper, we introduce PixelDance, a novel approach based on diffusion models that incorporates image instructions for both the first and last frames in conjunction with text instructions for video generation. Comprehensive experimental results demonstrate that PixelDance trained with public data exhibits significantly better proficiency in synthesizing videos with complex scenes and intricate motions, setting a new standard for video generation.
We present a method to generate full-body selfies from photographs originally taken at arms length. Because self-captured photos are typically taken close up, they have lim-ited field of view and exaggerated perspecti...
详细信息
ISBN:
(数字)9798350353006
ISBN:
(纸本)9798350353013
We present a method to generate full-body selfies from photographs originally taken at arms length. Because self-captured photos are typically taken close up, they have lim-ited field of view and exaggerated perspective that distorts facial shapes. We instead seek to generate the photo some one else would take of you from a few feet away. Our approach takes as input four selfies of your face and body, a background image, and generates a full-body selfie in a de-sired target pose. We introduce a novel diffusion-based approach to combine all of this information into high-quality, well-composed photos of you with the desired pose and background.
Standardized lossy video coding is at the core of almost all real-world video processing pipelines. Rate control is used to enable standard codecs to adapt to different network bandwidth conditions or storage constrai...
详细信息
ISBN:
(数字)9798350365474
ISBN:
(纸本)9798350365481
Standardized lossy video coding is at the core of almost all real-world video processing pipelines. Rate control is used to enable standard codecs to adapt to different network bandwidth conditions or storage constraints. However, standard video codecs (e.g., H.264) and their rate control modules aim to minimize video distortion w.r.t. human quality assessment. We demonstrate empirically that standard-coded videos vastly deteriorate the performance of deep vision models. To overcome the deterioration of vision performance, this paper presents the first end-to-end learnable deep video codec control that considers both bandwidth constraints and downstream deep vision performance, while adhering to existing standardization. We demonstrate that our approach better preserves downstream deep vision performance than traditional standard video coding.
Current methods for Earth observation tasks such as semantic mapping, map alignment, and change detection rely on near-nadir images;however, often the first available images in response to dynamic world events such as...
详细信息
ISBN:
(纸本)9781665448994
Current methods for Earth observation tasks such as semantic mapping, map alignment, and change detection rely on near-nadir images;however, often the first available images in response to dynamic world events such as natural disasters are oblique. These tasks are much more difficult for oblique images due to observed object parallax. There has been recent success in learning to regress an object's geocentric pose, defined as height above ground and orientation with respect to gravity, by training with airborne lidar registered to satellite images. We present a model for this novel task that exploits affine invariance properties to outperform state of the art performance by a wide margin. We also address practical issues required to deploy this method in the wild for real-world applications. Our data and code are publicly available(1).
computervision is increasingly effective at segmenting objects in images and videos;however, scene effects related to the objects-shadows, reflections, generated smoke, etc.-are typically overlooked. Identifying such...
详细信息
ISBN:
(纸本)9781665445092
computervision is increasingly effective at segmenting objects in images and videos;however, scene effects related to the objects-shadows, reflections, generated smoke, etc.-are typically overlooked. Identifying such scene effects and associating them with the objects producing them is important for improving our fundamental understanding of visual scenes, and can also assist a variety of applications such as removing, duplicating, or enhancing objects in video. In this work, we take a step towards solving this novel problem of automatically associating objects with their effects in video. Given an ordinary video and a rough segmentation mask over time of one or more subjects of interest, we estimate an omnimatte for each subject-an alpha matte and color image that includes the subject along with all its related time-varying scene elements. Our model is trained only on the input video in a self-supervised manner, without any manual labels, and is generic-it produces omnimattes automatically for arbitrary objects and a variety of effects. We show results on real-world videos containing interactions between different types of subjects (cars, animals, people) and complex effects, ranging from semitransparent elements such as smoke and reflections, to fully opaque effects such as objects attached to the subject.
The proceedings contain 27 papers. The special focus in this conference is on Artificial Neural Networks in patternrecognition. The topics include: Neural Decompiling of Tracr Transformers;pitfalls in Proce...
ISBN:
(纸本)9783031716010
The proceedings contain 27 papers. The special focus in this conference is on Artificial Neural Networks in patternrecognition. The topics include: Neural Decompiling of Tracr Transformers;pitfalls in Processing Infinite-Length Sequences with Popular Approaches for Sequential Data;robust Clustering with McDonald’s Beta-Liouville Mixture Models for Proportional Data;evaluating Support Vector Machines with Multiple Kernels by Random Search;Automatic Interpretation of 18F-Fluorocholine PET/CT Findings in Patients with Primary Hyperparathyroidism: A Novel Dataset with Benchmarks;a Hybrid Neuroevolutionary Approach to the Design of Convolutional Neural Networks for 2D and 3D Medical Image Segmentation;An Improved Pix2Pix GAN for Medical Image Generation;vision Transformer Features-Based Leukemia Classification;comparative Study of Deep Learning Models in Melanoma Detection;a Metaheuristic Optimization Based Deep Feature Selection for Oral Cancer Classification;machine Learning for Clinical Score Prediction from Longitudinal Dataset: A Case Study on Parkinson’s Disease;explaining Network Decision Provides Insights on the Causal Interaction Between Brain Regions in a Motor Imagery Task;Multi-modal Decoding of Reach-to-Grasping from EEG and EMG via Neural Networks;VAeViT: Fusing Multi-views for Complete 3D Object recognition;leveraging Transformers for Weakly Supervised Object Localization in Unconstrained Videos;palmprint Classification via Filter Faces and Feature Extraction;deep Multi-label Classification of Personality with Handwriting Analysis;license Plate Detection and Character recognition Using Deep Learning and Font Evaluation;experiments in Modeling Disagreement;deep Multiresolution Wavelet Transform for Speech Emotion Assessment of High-Risk Suicide Callers;Dynamic HumTrans: Humming Transcription Using CNNs and Dynamic Programming;Leveraging LSTM Embeddings for River Water Temperature Modeling;research on the Identification of Common Economic Shellfish in Jiang
The most performant spatio-temporal action localisation models use external person proposals and complex external memory banks. We propose a fully end-to-end, purely-transformer based model that directly ingests an in...
详细信息
ISBN:
(数字)9798350353006
ISBN:
(纸本)9798350353013
The most performant spatio-temporal action localisation models use external person proposals and complex external memory banks. We propose a fully end-to-end, purely-transformer based model that directly ingests an input video, and outputs tubelets - a sequence of bounding boxes and the action classes at each frame. Our flexible model can be trained with either sparse bounding-box supervision on individual frames, or full tubelet annotations. And in both cases, it predicts coherent tubelets as the output. Moreover, our end-to-end model requires no additional pre-processing in the form of proposals, or post-processing in terms of non-maximal suppression. We perform extensive ablation experiments, and significantly advance the state-of-the-art on five different spatio-temporal action localisation benchmarks with both sparse keyframes and full tubelet annotations.
We address the problem of unsupervised classification of players in a team sport according to their team affiliation, when jersey colours and design are not known a priori. We adopt a contrastive learning approach in ...
详细信息
ISBN:
(纸本)9781665448994
We address the problem of unsupervised classification of players in a team sport according to their team affiliation, when jersey colours and design are not known a priori. We adopt a contrastive learning approach in which an embedding network learns to maximize the distance between representations of players on different teams relative to players on the same team, in a purely unsupervised fashion, without any labelled data. We evaluate the approach using a new hockey dataset and find that it outperforms prior unsupervised approaches by a substantial margin, particularly for real-time application when only a small number of frames are available for unsupervised learning before team assignments must be made. Remarkably, we show that our contrastive method achieves 94% accuracy after unsupervised training on only a single frame, with accuracy rising to 97% within 500 frames (17 seconds of game time). We further demonstrate how accurate team classification allows accurate team-conditional heat maps of player positioning to be computed.
Despite their unmatched performance, deep neural networks remain susceptible to targeted attacks by nearly imperceptible levels of adversarial noise. While the underlying cause of this sensitivity is not well understo...
详细信息
ISBN:
(纸本)9781665445092
Despite their unmatched performance, deep neural networks remain susceptible to targeted attacks by nearly imperceptible levels of adversarial noise. While the underlying cause of this sensitivity is not well understood, theoretical analyses can be simplified by refraining each layer of a feed forward network as an approximate solution to a sparse coding problem. Iterative solutions using basis pursuit are theoretically more stable and have improved adversarial robustness. However, cascading layer-wise pursuit implementations suffer from error accumulation in deeper networks. In contrast, our new method of deep pursuit approximates the activations of all layers as a single global optimization problem, allowing us to consider deepen real-world architectures with skip connections such as residual networks. Experimentally, our approach demonstrates improved robustness to adversarial noise.
暂无评论