image captioning tasks based on deep learning encompasses two major domains . computervision and natural language processing. The Transformer architecture has achieved leading performance in the field of natural lang...
详细信息
ISBN:
(纸本)9781510657274;9781510657267
image captioning tasks based on deep learning encompasses two major domains . computervision and natural language processing. The Transformer architecture has achieved leading performance in the field of natural language processing, There have been studies using Transformer in image caption encoder and decoder, the results proving better performance compared to previous solutions. Positional encoding is an essential part in Transformer. Rotary Transformer proposed Rotary Position Embedding (RoPE), has achieved comparable or superior performance on various language modeling tasks. Limited work has been done to adapt the Roformer's architecture to image captioning tasks. The study conduct research based on the positional encoding of Transformer architecture, our proposed model consists of modified Roformer as an encoder and BERT as a decoder. With extracted feature as inputs as well as some training tricks, our model achieves similar or better performance on MSCOCO dataset compared to "CNN+RNN" models and regular transformer solutions.
The flying rocks in mining blasting operations pose a great threat to safety production, and the reasonable division of safety blasting scope is of great significance for production operations. In response to the prob...
详细信息
The automotive sector aims to reduce the production of scraps or burrs, which requires the adoption of a deburring procedure and the utilization of inspection techniques. The existing inspection procedures involve the...
详细信息
With growing techniques andcomputer technologies emerging, biometric recognition system has gained immense popularity among the masses. Amongst the entire biometric recognition system, face detection has been one of ...
详细信息
predicting brain age using Magnetic Resonant Imaging (MRI) and its difference with chronological age is useful for detecting Alzheimer's disease in the early stages. For having accurate brain age prediction with M...
详细信息
The application of intelligent image recognition technology in life is more and more extensive, especially in the field of computer and multimedia, the research of machine vision system is becoming more and more matur...
详细信息
Extraction of discriminative features is an efficient step in any classification problem such as synthetic aperture radar (SAR) images classification. Polarimetric SAR (PolSAR) images with rich spatial features in two...
详细信息
Aiming at the surface defects of packaging and printing products caused in the process of packaging and printing production, a method based on machine vision to detect the scratch defects on the surface of printing pr...
详细信息
Research Background: Transformers, initially developed for natural language processing (NLP), gained prominence with their effective handling of arbitrarily long sequential data through a sequence-to-sequence model. T...
ISBN:
(数字)9781837242672
Research Background: Transformers, initially developed for natural language processing (NLP), gained prominence with their effective handling of arbitrarily long sequential data through a sequence-to-sequence model. Their self-attention mechanism, a core component, demonstrated remarkable success beyond NLP and significantly impacted computervision. This led to the innovative adaptation of transformers in visual contexts, culminating in the creation of vision Transformers (ViTs). ViTs revolutionized imageprocessing by treating images as sequences of patches, applying self-attention mechanisms directly to pixels. This Paper's Contributions: This paper aims to conduct a comprehensive review of the evolution of transformers in imageprocessing, tracing the journey from initial experiments in training transformers on images to the latest advancements in hierarchical architectures and multi-scale ViTs. This survey not only highlights the superior performance and flexibility of ViTs across various computervision tasks but also discusses their promising applications in diverse fields such as medical imaging and robotics. This review underscores the transformative impact of ViTs and outlines potential future directions in this dynamic field.
作者:
Qiao, LangHu, WenjinNational Languages Information Technology
Boulder Northwest Minzu University Key Laboratory of China's Ethnic Languages and Information Technology of Ministry of Education Gansu Lanzhou730000 China School of Mathematics and Computer Science
Northwest Minzu University Key Laboratory of China's Ethnic Languages and Information Technology of Ministry of Education Gansu Lanzhou730000 China
image caption is to use computervision technology to extract the semantic content information contained in the image, and use natural language processing technology to generate a reasonable text caption. This paper t...
详细信息
暂无评论