Sketch-to-image synthesis aims to generate realistic images that match the input sketches or edge maps exactly. Most known sketch-to-image synthesis methods use various generative adversarial networks (GANs) that are ...
详细信息
Sketch-to-image synthesis aims to generate realistic images that match the input sketches or edge maps exactly. Most known sketch-to-image synthesis methods use various generative adversarial networks (GANs) that are trained with numerous pairs of sketches and real images. Because of the convolution locality, the low-level layers of the generators in these GANs lack global perception ability, causing feature maps derived from them easily to overlook global cues. Since the global receptive field is crucial for acquiring the non-local structures and features of sketches, the absence of global contexts will impact the generation of high-quality images. Some recent models turn to self-attention to construct global dependencies. However, they are not viable for large feature maps for the quadratic computational complexity concerning the size of feature maps. To address these problems, in this work, we propose Sketch2Photo - a new image synthesis approach that can capture global contexts as well as local features to generate photo-realistic images from weak or partial sketches or edge maps. We employ fast Fourier convolution (FFC) residual blocks to create global receptive fields in the bottom layers of the network and incorporate Swin Transformer block (STB) units to obtain long-range global contexts for large-size feature maps efficiently. We also present an improved spatial attention pooling (ISAP) module to relax the strict alignment requirements between incomplete sketches and generated images. Quantitative and qualitative experiments on multiple public datasets demonstrate the superiority of the proposed approach over many other sketch-to-image synthesis methods. The project code is available at https://***/hengliusky/Skecth2Photo.
Visual scene understanding mainly depends on pixel-wise classification obtained from a deep convolutional neural network. However, existing semantic segmentation models often face difficulties in real-time appli-catio...
详细信息
Visual scene understanding mainly depends on pixel-wise classification obtained from a deep convolutional neural network. However, existing semantic segmentation models often face difficulties in real-time appli-cations due to their large network architecture. Although there are real-time semantic segmentation models available, their shallow backbone can degrade the performance considerably. This paper introduces SDBNetV2, a lightweight semantic segmentation model designed to improve real-time performance without increasing computational costs. A key contribution is a novel Short-term Dense Bottleneck (SDB) module in the encoder, which provides varied field-of-views to capture different geometrical objects in a complex scene. Additionally, we propose dense feature refinement and improved semantic aggregation modules at the decoder end to enhance contextualization and object localization. We evaluate the proposed model's performance on several indoor and outdoor datasets in structured and unstructured environments. The results show that SDBNetV2 achieves superior segmentation performance over other real-time models with less than 2 million parameters.
Wall segmentation is a special case of semantic segmentation, and the task is to classify each pixel into one of two classes: wall and no-wall. The segmentation model returns a mask showing where objects like windows ...
详细信息
Wall segmentation is a special case of semantic segmentation, and the task is to classify each pixel into one of two classes: wall and no-wall. The segmentation model returns a mask showing where objects like windows and furniture are located, as well as walls. This article proposes the module's structure for semantic segmentation of walls in 2D images, which can effectively address the problem of wall segmentation. The proposed model achieved higher accuracy and faster execution than other solutions. An encoder-decoder architecture of the segmentation module was used. Dilated ResNet50/101 network was used as an encoder, representing ResNet50/101 network in which dilated convolutional layers replaced the last convolutional layers. The ADE20K dataset subset containing only interior images, was used for model training, while only its subset was used for model evaluation. Three different approaches to model training were analyzed in the research. On the validation dataset, the best approach based on the proposed structure with the ResNet101 network resulted in an average accuracy at the pixel level of 92.13% and an intersection over union (IoU) of 72.58%. Moreover, all proposed approaches can be applied to recognize other objects in the image to solve specific tasks.
Recently, the unmanned aerial vehicle (UAV) synthetic aperture radar (SAR) has become a highly sought-after topic for its wide applications in target recognition, detection, and tracking. However, SAR automatic target...
详细信息
Recently, the unmanned aerial vehicle (UAV) synthetic aperture radar (SAR) has become a highly sought-after topic for its wide applications in target recognition, detection, and tracking. However, SAR automatic target recognition (ATR) models based on deep neural networks (DNN) are suffering from adversarial examples. Generally, non-cooperators rarely disclose any SAR-ATR model information, making adversarial attacks challenging. To tackle this issue, we propose a novel attack method called Transferable Adversarial Network (TAN). It can craft highly transferable adversarial examples in real time and attack SAR-ATR models without any prior knowledge, which is of great significance for real-world black-box attacks. The proposed method improves the transferability via a two-player game, in which we simultaneously train two encoder-decoder models: a generator that crafts malicious samples through a one-step forward mapping from original data, and an attenuator that weakens the effectiveness of malicious samples by capturing the most harmful deformations. Particularly, compared to traditional iterative methods, the encoder-decoder model can one-step map original samples to adversarial examples, thus enabling real-time attacks. Experimental results indicate that our approach achieves state-of-the-art transferability with acceptable adversarial perturbations and minimum time costs compared to existing attack methods, making real-time black-box attacks without any prior knowledge a reality.
Image captioning, the process of generating natural language descriptions based on image content, has garnered attention in AI research for its implications in scene understanding and human-computer interaction. While...
详细信息
Image captioning, the process of generating natural language descriptions based on image content, has garnered attention in AI research for its implications in scene understanding and human-computer interaction. While much prior research has focused on caption generation for English, addressing low-resource languages like Bengali presents challenges, particularly in producing coherent captions linking visual objects with corresponding words. This paper proposes a context-aware attention mechanism over semantic attention to accurately diagnose objects for image captioning in Bengali. The proposed architecture consists of an encoder and a decoder block. We chose ResNet-50 over the other pre-trained models for encoding the image features due to its ability to solve the vanishing gradient problem and recognize complex object features. For decoding generated captions, a bidirectional Gated Recurrent Unit (GRU) architecture combined with an attention mechanism captures contextual dependencies in both directions, resulting in more accurate captions. The paper also highlights the challenge of transferring knowledge between domains, especially with culturally specific images. Evaluation of three Bengali benchmark datasets, namely BAN-Cap, , BanglaLekhaImageCaption, , and Bornon, , demonstrates significant performance improvement in METEOR score over existing methods by approximately 30%, 18%, and 45%, respectively. The proposed context-aware, attention-based image captioning system significantly outperforms current state-of-the-art models in Bengali caption generation despite limitations in reference captions on certain datasets.
This paper focuses on the problem of face images reconstruction from short audio segments. Built in PyTorch, the speech-to-face pipeline retains the core methodology of the others presented in the previous works, but ...
详细信息
This paper focuses on the problem of face images reconstruction from short audio segments. Built in PyTorch, the speech-to-face pipeline retains the core methodology of the others presented in the previous works, but with the introduction of a few key modifications. Leveraging a comprehensive dataset of internet audio recordings, a deep neural network undergoes training to discern correlations between voice and facial features. Through self-supervised learning, various physical attributes such as age, gender and ethnicity are captured without the need for explicit feature analysis. The evaluation process quantifies the fidelity of reconstructions, measuring the resemblance to the actual facial images solely from audio data through numerical metrics.
In this paper, we improve the natural scene text detection and recognition technology based on 2d attention and encoder-decoder framework. Firstly, the related work of text detection and recognition in different natur...
详细信息
ISBN:
(纸本)9781450397810
In this paper, we improve the natural scene text detection and recognition technology based on 2d attention and encoder-decoder framework. Firstly, the related work of text detection and recognition in different natural view is discussed. Secondly, we work on the basis of encoder-decoder framework and two-dimention module, and improve it through aggregation and hybridisation. Finally, we discussed and analyzed the results,and figured out the possible shortcomings of the model.
As a natural and convenient interaction modality, voice input has now become indispensable to smart devices (e.g. mobile phones and smart appliances). However, voice input is strongly constrained by surroundings and m...
详细信息
As a natural and convenient interaction modality, voice input has now become indispensable to smart devices (e.g. mobile phones and smart appliances). However, voice input is strongly constrained by surroundings and may raise privacy leakage in public areas. In this paper, we present SoundLip, an end-to-end interaction system enabling users to interact with smart devices via silent voice input. The key insight is to use inaudible acoustic signals to capture the lip movements of users when they issue commands. Previous works have considered lip reading as a naive classification task and thus can only recognize individual words. In contrast, our proposed system enables lip reading at both word and sentence levels, which are more suitable for daily-life use. We exploit the built-in speakers and microphones of smart devices to emit acoustic signals and listen to their reflections, respectively. In order to better abstract representations from multi-frequency and multi-modality acoustic signals, we elaborate a hierarchical convolutional neural network (HCNN) to serve as the front-end as well as recognize individual word commands. Then, for the sentence-level recognition, we exploit a multi-task encoder-decoder network to get around temporal segmentation and output sentences in an end-to-end way. We evaluate SoundLip on 20 individual words and 70 sentences from 12 participants. Our system achieves an accuracy of 91.2% at word-level and a word error rate of 7.1% at sentence-level in both user-independent and environment-independent settings. Given its innovative solution and promising performance, we believe that SoundLip has made a significant contribution to the advancement of silent voice input technology.
Deep-learning-based semantic segmentation is the research focus for unmanned aerospace vehicle (UAV) aerial images analysis. However, there are problems in segmenting small and narrow objects and boundary regions, due...
详细信息
Deep-learning-based semantic segmentation is the research focus for unmanned aerospace vehicle (UAV) aerial images analysis. However, there are problems in segmenting small and narrow objects and boundary regions, due to the large size differences between objects and the unbalanced class data in aerial images. A network named SEC-BRNet is proposed for the boundary refinement problem. First, the semantic embedding connections and progressive upsampling decoder are used to obtain spatial details for generating fused feature maps, which are then concatenated in decoding process level by level for recovering the boundary details. Second, a multiloss training strategy is developed for data imbalance and boundary roughness problems, including cross-entropy loss, Dice loss, and active boundary loss. After extensive experiments, our network could achieve 84.8% mIoU and 89.04% Boundary IoU on the AeroScapes dataset and achieve 62.81% mIoU and 90.78% Boundary IoU on the Semantic Drone Dataset. The experimental results indicate that the proposed SEC-BRNet performs well in semantic segmentation task for UAV aerial images.
Automatic Video captioning system is a method of describing the content in a video by analysing its visual aspects with regard to space and time and producing a meaningful caption that explains the video. A decade of ...
详细信息
Automatic Video captioning system is a method of describing the content in a video by analysing its visual aspects with regard to space and time and producing a meaningful caption that explains the video. A decade of research in this area has resulted in a steep growth in the quality and appropriateness of the generated caption compared with the expected result. The research has been driven from the very basic method to most advanced transformer method. Machine generated caption of a video must be adhering to many expected standards. For humans, this task may be a trivial one, however its not the same for a machine to analyse the content and generate a semantically coherent description for it. The caption which is generated in a natural language must also adhere to its lexical and syntactical structure. The video captioning process is a culmination of computer vision and natural language processing tasks. Commencing with template based conventional approach, it has surpassed statistical method, traditional deep learning approaches and is now in the trend of using transformers. This work made an extensive study of the literature and has proposed an improved transformer-based architecture for video captioning process. The transformer architecture made use of an encoder and decoder model that has two and three sublayers respectively. Multi-head self attention and cross attention are part of the model which bring about very beneficial results. The decoder is auto-regressive and uses a masked layer to prevent the model from foreseeing future words in the caption. An enhanced encoder-decoder Transformer model with CNN for feature extraction has been used in our work. This model captures the long-range dependencies and temporal relationships more effectively. The model has been evaluated with benchmark datasets and compared with state-of-the-art methods and found to be slightly better in the performance. The performance scores are slightly varying for BLEU, METEOR, ROUGE a
暂无评论