This paper presents an efficient and effective matting framework for human video clips. To alleviate the inefficiency problem in existing models, we propose using a refiner dedicated to error-prone regions, and reduce...
详细信息
ISBN:
(纸本)9781665468916
This paper presents an efficient and effective matting framework for human video clips. To alleviate the inefficiency problem in existing models, we propose using a refiner dedicated to error-prone regions, and reduce the computation at higher resolutions, so the proposed framework can achieve real-time performance for 1080p 60fps videos. Also, with the recurrent architecture, our model is aware of temporal information and produces temporally more consistent matting results compared to models processing each frame individually. Moreover, it contains a module for capturing semantic information. That makes our model easy to use without troublesome setup, such as annotating trimaps or other additional inputs. Experiments show that our proposed method outperforms previous matting methods, and reaches the state of the art on the videoMatte240K dataset.
A Pango FPGA-based solution for merging and processing multiple independent video streams and reconstructing a real-time display system is presented. The system handles each video stream individually, then produces a ...
详细信息
A novel platform with motion video recognition for the intelligent sport monitoring application is studied in this manuscript. The action markers of human target action images in sports videos are random. Combining im...
详细信息
The rapid advancement of generative artificial intelligence (GAI) has led to the creation of transformative applications such as ChatGPT, which significantly boosts text processing efficiency and diversifies audio, im...
详细信息
In this study, we developed a real-time vibration visualization system that can estimate and display vibration distributions at all frequencies in realtime through parallel implementation of subpixel digital image co...
详细信息
In this study, we developed a real-time vibration visualization system that can estimate and display vibration distributions at all frequencies in realtime through parallel implementation of subpixel digital image correlation (DIC) computations with short-time Fourier transforms on a GPU-based high-speed vision platform. To help operators intuitively monitor high-speed motion, we introduced a two-step framework of high-speed videoprocessing to obtain vibration distributions at hundreds of hertz and video conversion processing for the visualization of vibration distribution at dozens of hertz. The proposed system can estimate the full-field vibration displacements of 1920x1080 images in realtime at 1000 fps and display their frequency responses in the range of 0500 Hz on a computer at dozens of frames per second by accelerating phase-only DICs for full-field displacement measurement and video conversion. The effectiveness of this system for real-time vibration monitoring and visualization was demonstrated by conducting experiments on objects vibrating at dozens or hundreds of hertz.
Robust and efficient detection of young "Yuluxiang" pears fruits has a significant challenge in natural environments. This difficulty arises from factors such as the similar color between young fruits and th...
详细信息
Robust and efficient detection of young "Yuluxiang" pears fruits has a significant challenge in natural environments. This difficulty arises from factors such as the similar color between young fruits and the background, occlusion from branches and leaves, fruit denseness, and small fruit size. To achieve the precise detection, a lightweight detection method named YOLO-CiHFC was proposed in this study. The CiR module was constructed using the Inverted Residual Mobile Block (iRMB) and C2f module. The C2f modules of the YOLOv8n backbone and neck networks were all replaced with CiR modules to maintain low computational complexity and the number of parameters, and to enhance the feature extraction and fusion capabilities of the model. Then, the HS-FPN structure was introduced to reconstruct the neck network, and the Focaler-CIoU was introduced as the loss function of models. In comparison with the YOLOv8n, the F1 score and average precision (AP) of the YOLO-CiHFC improved by 0.25% and 1.77%, respectively. The inference time (achieving 1.5 ms) of the YOLO-CiHFC was 0.2 ms faster than the YOLOv8n. The model size of YOLO-CiHFC was 52.94% of the original model. Furthermore, a comparison was performed between the YOLO-CiHFC model and common lightweight models, such as YOLOv3-Tiny, YOLOv4-Tiny, YOLOv5n, and YOLOv7-Tiny. The results showed that the YOLO-CiHFC model achieved the optimal F1 score of 85.95% and AP of 88.00%, had the smallest model size of 3.15 MB, and obtained the best detection results under different scenarios. The YOLO-CiHFC model was deployed in Jetson nano at a real-time detection speed of 25.5 f/s. In this study, the proposed YOLO-CiHFC method not only achieved lightweight, but also improved the accuracy and speed of detection. This study can provide the methodological support for intelligent detection of young "Yuluxiang" pears fruits.
The increasing demand for high-quality, real-time visual communication and the growing user expectations, coupled with limited network resources, necessitate novel approaches to semantic image communication. This pape...
详细信息
ISBN:
(纸本)9798331529543;9798331529550
The increasing demand for high-quality, real-time visual communication and the growing user expectations, coupled with limited network resources, necessitate novel approaches to semantic image communication. This paper presents a method to enhance semantic image communication that combines a novel lossy semantic encoding approach with spatially adaptive semantic image synthesis models. By developing a model-agnostic training augmentation strategy, our approach substantially reduces susceptibility to distortion introduced during encoding, effectively eliminating the need for lossless semantic encoding. Comprehensive evaluation across two spatially adaptive conditioning methods and three popular datasets indicates that this approach enhances semantic image communication at very low bit rate regimes.
Deception is a prevalent human behavior that significantly impacts our perception of essential facts. Therefore, developing accurate deception detection technology holds great significance. However, current research o...
详细信息
ISBN:
(纸本)9798350386660;9798350386677
Deception is a prevalent human behavior that significantly impacts our perception of essential facts. Therefore, developing accurate deception detection technology holds great significance. However, current research on pure visual deception detection algorithms does not leverage deep learning methods to extract detailed features such as facial Action Units (AUs) and Gaze Angles. Additionally, the global information within facial video sequences is often overlooked. To address these limitations, this paper introduces a novel deception detection model that combines global and local facial features through attention mechanisms. Firstly, the model focuses on the local features of the face, computing AU Strength and Gaze Angle for each frame to create a multivariate time series for every video. Subsequently, the Siamese Transformer model, employing Patching, extracts deep temporal and channel features from the multivariate time series. Additionally, the occurrence frequency of five specific AUs is selected as a manual feature. Secondly, the model conducts video understanding based on the global features of the face. Local features are extracted from each frame using Shallow CNNs with multiple sensitivity fields. Then, a video Transformer model with spatiotemporal separation attention is applied to globally model the sequence of face frames. Finally, the extracted local and global facial features are concatenated and fed into a classifier to determine deception. Extensive experiments on existing datasets validate the outstanding performance of the proposed method.
This paper describes the development of a tracker for basketball wheelchair players using special flashing LEDs and an omnidirectional camera. Our previous trackers required some imageprocessing to find multiple LEDs...
详细信息
ISBN:
(数字)9781665470506
ISBN:
(纸本)9781665470506
This paper describes the development of a tracker for basketball wheelchair players using special flashing LEDs and an omnidirectional camera. Our previous trackers required some imageprocessing to find multiple LEDs in the video captured by the omnidirectional camera. The problem was that these image processes were time consuming. This study uses convolutional neural networks to reduce the time it takes to find multiple LEDs in that video.
Educational technology is increasingly focusing on real-time- time language learning. Prior studies have utilized Natural Language processing (NLP) to assess students' classroom behavior by analyzing their reporte...
详细信息
Educational technology is increasingly focusing on real-time- time language learning. Prior studies have utilized Natural Language processing (NLP) to assess students' classroom behavior by analyzing their reported feelings and thoughts. However, these studies have not fully enhanced the feedback provided to instructors and peers. This research addresses this issue by combining two innovative technologies: Federated 3D- Convolutional Neural Networks (Fed 3D- CNN) and Long Short- Term Memory (LSTM) networks and also aims to investigate classroom attitudes to enhance students' language competence. These technologies enable the modification of teaching strategies through text analysis and image recognition, providing comprehensive feedback on student interactions. For this study, the Multimodal Emotion Lines Dataset (MELD) and eNTERFACE'05 datasets were selected. eNTERFACE contains 3D images of individuals, while MELD analyzes spoken patterns. To address over fitting issues, the SMOTE technique is used to balance the dataset through oversampling and under sampling. The study accurately predicts human emotions using Federated 3D- CNN technology, which excels in imageprocessing by predicting personal information from various angles. Federated Learning with 3D-CNNs- CNNs allows simultaneous implementation for multiple clients by leveraging both local and global weight changes. The NLP system identifies emotional language patterns in students, laying the foundation for this analysis. Although not all student feedback has been extensively studied in the literature, the Fed 3D- CNN and LSTM algorithm recommendations are valuable for extracting feedback- related information from audio and video. The proposed framework achieves a prediction accuracy of 97.72%, outperforming existing methods. This study aims to investigate classroom attitudes to enhance students' language competence.
暂无评论