The quality of frames is significant for both research and application of video frame interpolation (VFI). In recent VFI studies, the methods of full-reference image quality assessment have generally been used to eval...
The quality of frames is significant for both research and application of video frame interpolation (VFI). In recent VFI studies, the methods of full-reference image quality assessment have generally been used to evaluate the quality of VFI frames. However, high frame rate reference videos, necessities for the full-reference methods, are difficult to obtain in most applications of VFI. To evaluate the quality of VFI frames without reference videos, a no-reference perceptual quality assessment method is proposed in this paper. This method is more compatible with VFI application and the evaluation scores from it are consistent with human subjective opinions. A new quality assessment dataset for VFI was constructed through subjective experiments firstly, to assess the opinion scores of interpolated frames. The dataset was created from triplets of frames extracted from high-quality videos using 9 state-of-the-art VFI algorithms. The proposed method evaluates the perceptual coherence of frames incorporating the original pair of VFI inputs. Specifically, the method applies a triplet network architecture, including three parallel feature pipelines, to extract the deep perceptual features of the interpolated frame as well as the original pair of frames. Coherence similarities of the two-way parallel features are jointly calculated and optimized as a perceptual metric. In the experiments, both full-reference and no-reference quality assessment methods were tested on the new quality dataset. The results show that the proposed method achieves the best performance among all compared quality assessment methods on the dataset.
In multimodal sentiment analysis, the same attention mechanism is usually used to capture features at the same semantic level without considering the differences in intermodal interactions for sentiment categorization...
In multimodal sentiment analysis, the same attention mechanism is usually used to capture features at the same semantic level without considering the differences in intermodal interactions for sentiment categorization, which leads to problems such as insufficient inter-modal fusion feature extraction. In order to address the problem, this paper proposes a new improvement scheme for the MMMU-BA (Multi-Modal Multi-Utterance-Bi-Modal Attention) model, in which the new modal features formed by forward and backward splicing of the two original modal features are used to replace the original modal features that were originally incoming to the bi-modal attention mechanism, which produces stronger bimodal features. stronger bimodal features. In addition, the original modal features are replaced with modal features that have gone through the self-attention mechanism, which enhances the model’s representational ability. The model is validated on the publicly available datasets CMU-MOSI (Multimodal Opinion-level Sentiment Intensity) dataset and CMU-MOSEI (CMU Multimodal Opinion Sentiment and Emotion Intensity) dataset, and the accuracy is improved by 0.57% and 0.32%, and the F1 is improved by 1.6% and 0.7%, respectively.
Intelligent applications such as metaverse built on IoT technology are usually inseparable from the collection of surveillance video data. Yet, how to effectively compress indoor surveillance video has always been a h...
Intelligent applications such as metaverse built on IoT technology are usually inseparable from the collection of surveillance video data. Yet, how to effectively compress indoor surveillance video has always been a huge challenge for data transmission. In this paper, we propose an image-feature parallel compression framework for indoor surveillance video. In other words, the video is divided into images and features which are compressed separately. Specifically, according to the Mean-shift algorithm, the background image is generated, encoded and shared among frames. Based on a series of detection and matching algorithms, a human feature (HF) extractor is designed to extract a frontage body image, identity features, boundary features and structural features which are used for the generation of foreground body images. At the same time, body masks are generated in order to segment body areas from body images. Finally, with the guidance of body mask and boundary features, the body area covers the corresponding area of generated background image to reconstruct the video, and its quality is improved by a quality enhancement network. Experimental results show that the proposed compression scheme achieves coding gain up to 66.91% and 51.84% averagely compared with HEVC and VVC respectively.
Zero-shot learning is a technique capable of recognizing target categories even when labeled samples for these categories are completely absent. Traditional zero-shot learning methods based on embedding models usually...
Zero-shot learning is a technique capable of recognizing target categories even when labeled samples for these categories are completely absent. Traditional zero-shot learning methods based on embedding models usually have a low utilization rate of semantic attributes and exhibit a bias towards seen classes during testing. Addressing this, we propose an embedding-based ZSL method grounded on semantic attributes. This method uses a spatial attention mechanism during the construction of the semantic attribute embedding space, enabling the model to focus on more distinctive attribute features within the images. Consequently, it can utilize these distinctive features for similarity classification. Furthermore, a category calibration loss function is introduced to assign a greater weight to unseen classes and a lesser weight to seen classes, aiming to reduce the bias towards seen classes during testing. Extensive experiments were carried out on three mainstream ZSL benchmark datasets. Compared with some of the existing classical algorithms, our method demonstrated improved results.
Voice activity detection (VAD), is a signal processing technique used to determine whether a given speech signal contains voiced or unvoiced segments. VAD is used in various applications such as Speech Coding, Voice C...
Voice activity detection (VAD), is a signal processing technique used to determine whether a given speech signal contains voiced or unvoiced segments. VAD is used in various applications such as Speech Coding, Voice Controlled Systems, speech feature extraction, etc. For example, in Adaptive multi-rate (AMR) speech coding, VAD is used as an efficient way of coding different speech frames at different bit rates. In this paper, we implemented the application of a Zero-Phase Zero Frequency Resonator (ZP-ZFR) as VAD on hardware. ZP-ZFR is an Infinite Impulse Response (IIR) filter that offers the advantage of requiring a lower filter order, making it suitable for hardware implementation. The proposed system is implemented on the TIMIT database using the Nexys Video Artix-7 FPGA board. The hardware design is carried out using Vivado 2021.1, a popular tool for FPGA development. The Hardware Description Language (HDL) used for implementation is Verilog. The proposed system achieves good performance with low complexity. Therefore this work is implemented on hardware, which can be used in various applications.
Early diagnosis is very important in brain tumors. Although Magnetic Resonance (MRI) is widely used for brain tumor detection, it is difficult to detect the tumor manually. Therefore, computer-aided diagnosis systems ...
Early diagnosis is very important in brain tumors. Although Magnetic Resonance (MRI) is widely used for brain tumor detection, it is difficult to detect the tumor manually. Therefore, computer-aided diagnosis systems have been frequently utilized in recent years. In this study, an Efficient Channel Attention-Dense Convolutional Network (ECA-DenseNet) framework is proposed to detect tumors in patients based on brain MRI images. While detecting the tumor, it is tried to determine which type of tumor is present in the patient. In the developed ECA-DenseNet structure, an ECA block has been added to the dense blocks. The ECA block aimed to discard unimportant information and thus reduce the computation time. The improved DenseNet model has been tested on an open-source dataset. The improved model is compared with DenseNet-121, DenseNet-169, DenseNet-201, and DenseNet-264. The experimental results show the improving model has better classification performance than the others. The accuracy of the proposed model was 95.07%.
The field of face recognition has undergone a revolutionary transformation due to the introduction of deep learning algorithms, resulting in significant advancements in accuracy, robustness, and real-world applicabili...
The field of face recognition has undergone a revolutionary transformation due to the introduction of deep learning algorithms, resulting in significant advancements in accuracy, robustness, and real-world applicability. However, the performance of face recognition systems is still influenced by challenging scenarios such as pose variation, expression, illumination, and occlusions. This paper presents a novel approach for one-shot image classification in face recognition, utilizing a modified Siamese Neural Network (SNN) to improve both speed and accuracy, even when the dataset is limited. The embedding layer of the SNN incorporates the EfficientNetV2L and Xception neural networks, which are then compared to determine the model's convergence. The experimental results were tested and demonstrated using our dataset and LFW dataset and concluded that the Eff-Large SNN is a lightweight and promising solution for one-shot image classification. The result from this Eff-Large SNN makes an accuracy of 91.95%. It outperforms SNN with Xception in terms of efficiency, accuracy, and versatility, making it suitable for a wide range of applications. This model is deployed in real-time CCTV surveillance for face recognition applications.
The spotlight on large population-based studies is growing within the research community. Epidemiological studies, amassing extensive data through questionnaires and check-ups, often include imaging data like magnetic...
The spotlight on large population-based studies is growing within the research community. Epidemiological studies, amassing extensive data through questionnaires and check-ups, often include imaging data like magnetic resonance imaging (MRI) or ultrasonography from numerous participants. This article offers an exploration of several recent epidemiological studies conducted in Germany, addressing not only the intriguing research tasks but also the multifaceted perspectives and challenges of analyzing image data. This includes potential enhancements in imaging technologies, machine learning applications for data analysis, and collaboration between different research disciplines. Key insights into the methodologies, algorithms, and techniques used for processing and interpreting complex imaging data are detailed. The potential of epidemiological image analysis to guide clinical practices and contribute to personalized medicine is also discussed.
A biometric sample is the more utile for biometric recognition the greater the distance between the sample-specific non-mated and mated comparison score distributions. Finger image quality scores turn out to be only w...
A biometric sample is the more utile for biometric recognition the greater the distance between the sample-specific non-mated and mated comparison score distributions. Finger image quality scores turn out to be only weakly correlated with the observed utility. This is worth investigating because finger image quality assessment software is widely used to predict the biometric utility of finger images in many public-sector applications. This paper shows that a weak correlation between predicted and observed utility does not matter if the quality scores are used to decide whether to discard or retain biometric samples for further processing. The important point is that useful samples are not mistakenly discarded or less useful samples are not mistakenly retained. This can be measured by quality-assessment false positive and false negative rates. In cost-benefit analyses, these metrics can be used to chose suitable quality-score thresholds for the use cases at hand.
Point-cloud-based 3D perception has attracted great attention in various applications including robotics, autonomous driving and AR/VR. In particular, the 3D sparse convolution (SpConv) network has emerged as one of t...
Point-cloud-based 3D perception has attracted great attention in various applications including robotics, autonomous driving and AR/VR. In particular, the 3D sparse convolution (SpConv) network has emerged as one of the most popular backbones due to its excellent performance. However, it poses severe challenges to real-time perception on general-purpose platforms, such as lengthy map search latency, high computation cost, and enormous memory footprint. In this paper, we propose SpOctA, a SpConv accelerator that enables high-speed and energy-efficient point cloud processing. SpOctA parallelizes the map search by utilizing algorithm-architecture co-optimization based on octree encoding, thereby achieving 8.8-21.2× search speedup. It also attenuates the heavy computational workload by exploiting inherent sparsity of each voxel, which eliminates computation redundancy and saves 44.4-79.1% processing latency. To optimize on-chip memory management, a SpConv-oriented non-uniform caching strategy is introduced to reduce external memory access energy by 57.6% on average. Implemented on a 40nm technology and extensively evaluated on representative benchmarks, SpOctA rivals the state-of-the-art SpConv accelerators by 1.1-6.9× speedup with 1.5-3.1× energy efficiency improvement,
暂无评论