Food ingredient recognition has received numerous attention for its importance for health-related applications. But there are still some challenges due to food dishes complexity such as detecting ingredients of varyin...
详细信息
ISBN:
(数字)9798331529543
ISBN:
(纸本)9798331529550
Food ingredient recognition has received numerous attention for its importance for health-related applications. But there are still some challenges due to food dishes complexity such as detecting ingredients of varying sizes and identifying multiple ingredients within a single image. Addressing this challenge, we propose a novel ingredient recognition network named Dual Discovery Integration Network (DDIN) which consists of two modules: the Region Discovery (RD) module using deconvolution to get probability distribution map for find-grained region discover, the Category Discovery (CD) module using an ingredient dictionary to capture multiple ingredient category. Finally, the output from RD and CD modules are fused to obtain the final prediction results. The experimental results demonstrate that our model achieves state-of-the-art results in ingredient recognition on the Chinese Food dataset Vireo Food-172, and it also outperform existing methods for less parameters and lower computational complexity. Further visualization of discovered ingredient regions also shows the superiority of our method.
Most existing attention-based methods on image captioning focus on the current visual information and text information at each step to generate the next word, without considering the coherence between the visual infor...
详细信息
ISBN:
(纸本)9783030638221;9783030638238
Most existing attention-based methods on image captioning focus on the current visual information and text information at each step to generate the next word, without considering the coherence between the visual information and the text information itself. We propose sufficient visual information (SVI) module to supplement the existing visual information contained in the network, and propose sufficient text information (STI) module to predict more text Words to supplement the text information contained in the network. Sufficient visual information module embed the attention value from the past two steps into the current attention to adapt to human visual coherence. Sufficient text information module can predict the next three words in one step, and jointly use their probabilities for inference. Finally, this paper combines these two modules to form an image captioning algorithm based on sufficient visual information and text information model (SVITI) to further integrate existing visual information and future text information in the network, thereby improving the image captioning performance of the model. These three methods are used in the classic image captioning algorithm, and have achieved achieve significant performance improvement compared to the latest method on the MS COCO dataset.
Recently, image compression based on Implicit Neural Representation (INR) has gained interest. However, the performance of INR-based methods is poor because MLP can not fit high-frequency information well. Recently, a...
Recently, image compression based on Implicit Neural Representation (INR) has gained interest. However, the performance of INR-based methods is poor because MLP can not fit high-frequency information well. Recently, a kind of hybrid INR method was proposed, which transmits not only the parameters of MLP but also some external latents to help to overfit at the decoder. Despite its low MACs (Multiply–Accumulate Operations per pixel), the decoding time is still slow due to the serial autoregressive context model for entropy coding. Hence, we propose two methods to improve the hybrid INR method’s decoding speed and performance. First, the Subpixel Context Model (SAM) is aimed to speed up the process of entropy coding. The proposed SAM is latent by latent instead of pixel by pixel. Second, we add an iterative pruner to the pipeline of hybrid INR compression, to further reduce the bit rate. The performance and MACs of our approach are on par with the SOTA of the hybrid INR methods. The decoding speed is about eight times faster on CPU, and three times faster on GPU.
Entropy modeling plays an important role in estimating the rates of latent representations and optimizing the rate-distortion performance for learned image compression. Autoregression modules are demonstrated to elimi...
Entropy modeling plays an important role in estimating the rates of latent representations and optimizing the rate-distortion performance for learned image compression. Autoregression modules are demonstrated to eliminate spatial/channel-wise redundancy of latent representations in fixed-rate learned image compression. However, it cannot be efficiently achieved in progressive coding due to the high computational complexity raised by element-wise probability prediction. In this paper, we propose a learned progressive image compression method that enables spatial autoregression for entropy modeling. Specifically, we develop a novel codeword alignment scheme to prevent coding redundancy and achieve efficient autoregression of latent representations in different quality layers. Consequently, conditional probability estimation for the latent prediction can be achieved based on spatial autoregression in a layer-wise manner. We further extend the proposed method by dead-zone quantizers to obtain promoted rate-distortion performance. The proposed method is a successful attempt to enable spatial autoregression in learned progressive coding and further bridge the performance gap with fixed-rate models. Experimental results show that it outperforms traditional methods such as JPEG and BPG, as well as recent fine-grained learned progressive coding models DPICT and PLONQ in terms of rate-distortion performance.
image generation from scene graphs has traditionally focused on predicting layout from the scene graph using graph convolutional networks firstly, then converting the layout to an image. These methods might involve co...
详细信息
In order to obtain more accurate information in the analysis of visual information, this paper combines spatial relationships and uses two image and video processing tools, PS and AE, to establish "scale model&qu...
详细信息
ISBN:
(纸本)9781728176499
In order to obtain more accurate information in the analysis of visual information, this paper combines spatial relationships and uses two image and video processing tools, PS and AE, to establish "scale model", "cross-ratio theorem model", "similarity model of viewing angle distance", "velocity measurement model of longitudinal plane translation", and designed "viewing angle solution algorithm". The results show: through the above models and algorithms, mathematical tools such as MATLAB and Python3 can be used to efficiently, accurately and quickly calculate the height, floor area and road length of the building in the 3D video information, and obtain the horizontal view of the drone lens, longitudinal viewing angle and its flying height and speed.
A CNN (Convolutional Neural Network) is an artificial neural network used to evaluate visual pictures. It is used for visualimageprocessing and is categorised as a deep neural network in deep learning. So, using rea...
详细信息
In visual question answering task, the dominant approach recently has been to use a unified model for pre- training and fine tuning it. This unified model typically uses a transformer to fuse image and text informatio...
In visual question answering task, the dominant approach recently has been to use a unified model for pre- training and fine tuning it. This unified model typically uses a transformer to fuse image and text information. In order to optimize the performance of the model on visual question answering task, this paper proposes a transformer architecture based on a collaborative multi-head attention mechanism to address the key/value projection redundancy problem in the multi-head attention mechanism of the transformer. In addition, this paper uses the Swin transformer model as the image feature extractor to extract multi-scale image information. Validation experiments are conducted on the VQA v2 dataset in this paper, and the experimental results show that applying the collaborative multi-head attention approach and the Swin transformer backbone to the visual question answering model can effectively improve the correct rate of the visual question answering task.
This paper presents an approach to apply quantum Fourier transform (QFT) to imageprocessing using quantum computing. The use of quantum computing for image analysis and processing is becoming increasingly relevant in...
详细信息
As video conferencing becomes an indispensable part of human’s daliy life, how to achieve a high-fidelity calling experience under low bandwidth has been a popular and challenging issue. Deep generative models have g...
As video conferencing becomes an indispensable part of human’s daliy life, how to achieve a high-fidelity calling experience under low bandwidth has been a popular and challenging issue. Deep generative models have great potential in low-bandwidth facial video compression due to the excellent generation capability based on abridged information. Nevertheless, exsiting deep generation-based compression methods tend to handle motion information in pure 2D or pseudo 3D space, causing facial distortion when large head poses are encountered. In this paper, we propose a 3D-aware high-fidelity facial video conferencing system based on a parameterized NeRF-based face model. Through the compression of the parameterized face model and the transmisstion of extracted facial parameters, we implement high-fidelity talking head synthesis for video conferencing at an ultra-low bitrate. Additionally, the 3D perception capability of the system allows for viewpoint control over the head, achieving higher interactivity and practicability. Extensive experiments verify the effectiveness of the proposed 3D-aware high-fidelity free-view facial video conferencing system.
暂无评论