Recently, multimodal relation extraction (MRE) and multimodal-named entity recognition (MNER) have attracted widespread attention. However, prior research works have encountered challenges including inadequate semanti...
详细信息
Recently, multimodal relation extraction (MRE) and multimodal-named entity recognition (MNER) have attracted widespread attention. However, prior research works have encountered challenges including inadequate semantic representation of images, cross-modal information fusion, and irrelevance between some images and text. To enhance semantic representation, we employ CLIP's image encoder, vision transformer (VIT), to generate visual features representing different semantic intensities. Addressing cross-modal semantic gaps, we introduce an image caption generation model and BERT to sequentially generate image captions and their features, transforming both modalities into text. Dynamic gates and attention mechanisms are introduced to efficiently fuse visual features, image description text features, and text features, mitigating noise from image-text irrelevance. Eventually, we successfully constructed an efficient MRE and MNER model. The experimental outcomes demonstrate that the model proposed in this paper improves 2.2% to 0.18% on the MRE and MNER datasets. Our code is available at https://***/SiweiWei6/VIT-CMNet.
Synthesizing photographic images from given text descriptions is a challenging problem. Although previous many studies have made significant progress on the visual quality of the generated images by using the multi-st...
详细信息
Synthesizing photographic images from given text descriptions is a challenging problem. Although previous many studies have made significant progress on the visual quality of the generated images by using the multi-stage and attentional network, they ignore the interrelationships between the images generated by the generator in each stage and simply leverage the attention mechanism. In this paper, the Photographic Text-to-image Generation with Pyramid Contrastive Consistency Model (PCCM-GAN) is proposed to generate photographic images. PCCM-GAN introduces two modules: a Pyramid Contrastive Consistency Model (PCCM) and a stack attention model (Stack-Attn). Based on generated images from the different stages, PCCM is proposed to compute a contrastive loss for training the generator. Stack-Attn concentrates on generating images with more details and better semantic consistency by stacking the global-local attention mechanism. And visual inspection of the inner product of PCCM and Stack-Attn is also performed to validate their effectiveness. Extensive experiments and ablation studies on the CUB and MS-COCO datasets prove the superiority of the proposed method. (c) 2021 Published by Elsevier B.V.
In this article, the combined convolution-lifting scheme is explored to address the design issues of 2-D discrete wavelet transform (DWT) structures. We found that the combined convolution-lifting scheme of type-1 (co...
详细信息
In this article, the combined convolution-lifting scheme is explored to address the design issues of 2-D discrete wavelet transform (DWT) structures. We found that the combined convolution-lifting scheme of type-1 (convolution followed by lifting) is more suitable than convolution or lifting schemes to design 2-D DWT structures with less on-chip memory. Further more, the canonic signed digit (CSD)-based multiplier-less designs are presented for convolution-DWT and lifting-DWT using 9/7 biorthogonal filters, and they have identical resource requirements for 12-bit coefficients. The proposed multiplier-less designs of convolution-DWT and lifting-DWT are used to derive a 2-D DWT structure to take advantage of the combined convolution-lifting scheme. The comparison result shows that the proposed combined 2-D DWT structure involves 24xlessarea-delay-product (ADP) and 17xless energy per image (EPI)compared with the best of the existing fractional wavelet trans-form (FrWT)-based structure and provides reconstructed images of 14 dB higher peak signal-to-noise ratio (PSNR). Compared with the recently proposed approximate lifting (ALF) 2-D DWT structure, the proposed combined 2-D DWT structure involves4.5xless ADP, 2.2xless EPI, less on-chip memory by 4Nwords and provides reconstructed images of PSNR higher by7 dB, where Nis the image width or height. Therefore, the proposed combined 2-D DWT structure is a better alternative to the existing 2-D DWT structures for low-complexity and low-memory realization of 2-D DWT especially for the visual sensor node applications
PurposeThis paper aim to solve the problem of low assembly success rate for 3c assembly lines designed based on classical control algorithms due to inevitable random disturbances and other factors,by incorporating int...
详细信息
PurposeThis paper aim to solve the problem of low assembly success rate for 3c assembly lines designed based on classical control algorithms due to inevitable random disturbances and other factors,by incorporating intelligent algorithms into the assembly line, the assembly process can be extended to uncertain assembly ***/methodology/approachThis work proposes a reinforcement learning framework based on digital twins. First, the authors used Unity3D to build a simulation environment that matches the real scene and achieved data synchronization between the real environment and the simulation environment through the robot operating system. Then, the authors trained the reinforcement learning model in the simulation environment. Finally, by creating a digital twin environment, the authors transferred the skill learned from the simulation to the real environment and achieved stable algorithm deployment in real-world *** this work, the authors have completed the transfer of skill-learning algorithms from virtual to real environments by establishing a digital twin environment. On the one hand, the experiment proves the progressiveness of the algorithm and the feasibility of the application of digital twins in reinforcement learning transfer. On the other hand, the experimental results also provide reference for the application of digital twins in 3C assembly ***/valueIn this work, the authors designed a new encoder structure in the simulation environment to encode image information, which improved the model's perception of the environment. At the same time, the authors used the fixed strategy combined with the reinforcement learning strategy to learn skills, which improved the rate of convergence and stability of skills learning. Finally, the authors transferred the learned skills to the physical platform through digital twin technology and realized the safe operation of the flexible printed circuit assembly task.
Recently, transformer has been applied to the image caption model, in which the convolutional neural network and the transformer encoder act as the image encoder of the model, and the transformer decoder acts as the d...
详细信息
Recently, transformer has been applied to the image caption model, in which the convolutional neural network and the transformer encoder act as the image encoder of the model, and the transformer decoder acts as the decoder of the model. However, transformer may suffer from the interference of non-critical objects of a scene and meet with difficulty to fully capture image information due to its self-attention mechanism's dense characteristics. In this Letter, in order to address this issue, the authors propose a novel transformer model with decreasing attention gates and attention fusion module. Specifically, they firstly use attention gate to force transformer to overcome the interference of non-critical objects and capture objects information more efficiently via truncating all the attention weights that smaller than gate threshold. Secondly, through inheriting attentional matrix from the previous layer of each network layer, the attention fusion module enables each network layer to consider other objects without losing the most critical ones. Their method is evaluated using the benchmark Microsoft COCO dataset and achieves better performance compared to the state-of-the-art methods.
Synthesizing images from text is to produce images with reliable content as specified text depiction that is an extremely demanding task with the most important problems like: content consistency and visual realism. O...
详细信息
Synthesizing images from text is to produce images with reliable content as specified text depiction that is an extremely demanding task with the most important problems like: content consistency and visual realism. Owing to considerable progression of GAN, it is now possible to produce images with good visual certainty. The translation of text descriptions to images with higher content reliability, on the other hand, is still a work in progress. This paper intends to frame a novel text-to-image synthesis approach, which includes two major phases namely;(1) Text to image encoding and (2) GAN. Initially, during text to image encoding, cross modal feature alignment takes place including text and image features. Consequently, BI-LSTM is deployed to transfer the text embedding to feature vector. At second stage, the image is synthesized based on the encoding. Consequently, text feature group are given as input to GAN, which offers the final synthesized images. Finally, the supremacy of developed approach is examined via evaluation over extant techniques.
In this paper a 4 codebook, Vector Quantization (VQ) core is implemented on FPGA (Field Programmable Gate Array). The proposed design has certain advantages over the earlier architecture in the form of design reuse of...
详细信息
ISBN:
(纸本)9783642181337
In this paper a 4 codebook, Vector Quantization (VQ) core is implemented on FPGA (Field Programmable Gate Array). The proposed design has certain advantages over the earlier architecture in the form of design reuse of VQ core to build a large VQ system. The proposed core aims at increased compressing speed, modular design for design flexibility, easy reconfigurability. Modularity helps in flexile design changes for VQ with different codebook sizes and hence controls the recovered image quality. In general, the new VQ core, meets the specific and challenging needs of a single functioned, tightly constrained real time VQ encoder. The synthesis results show that a speed up of 5 is achieved. Experiments and analyses indicate that our design can satisfy the performance requirements of 30 image frames per sec for a real time image processing. The proposed VQ requires more memory and implements VQ encoder with codebook size which are in multiples of 4.
The single-modal information retrieval pattern is gradually unable to meet the growing information processing needs. Cross-modal retrieval based on deep learning, as a new information retrieval scheme, is gradually re...
详细信息
The single-modal information retrieval pattern is gradually unable to meet the growing information processing needs. Cross-modal retrieval based on deep learning, as a new information retrieval scheme, is gradually receiving more attention. To address the potential issue of imprecise text queries in cross-modal retrieval, an iterative query-based cross-modal retrieval model is proposed. The model is generally divided into four modules: image feature extraction, text feature extraction, matching ranking, and query reinforcement. The model first extracts feature of images and text through deep learning models, then performs matching and retrieval of image-text features through the image-text stacked cross-attention algorithm. Finally, in the query reinforcement module, the most distinctive object category in the retrieval results is obtained through deep reinforcement learning for user confirmation, thereby increasing text richness and improving retrieval performance.
Nowadays media overload is a pretty common scenario all around the world. The prevalence of media overload grants both individuals and governmental entities the ability to shape public opinions, highlighting the need ...
详细信息
Nowadays media overload is a pretty common scenario all around the world. The prevalence of media overload grants both individuals and governmental entities the ability to shape public opinions, highlighting the need to deploy effective fake news detection methods. In this paper, we suggest a novel model named GraMuFeN, for detecting fake news that has been posted by users on Twitter and Weibo. This model has been designed to detect fake news using both textual and image data accompanying each piece of news. We utilize Graph Convolution Neural Networks (GCN) as the text encoder and Convolutional Neural Networks (CNN) as the image encoder with the help of Supervised Contrastive Loss aiming to develop a model much lighter in terms of trainable parameters and easier to train while having a higher performance compared to previous works. Our evaluations on two different benchmarks show a promising 10% improvement in micro f1 score and a 50% reduction in terms of the model's trainable parameters.
暂无评论