检索结果-内蒙古大学图书馆

TPS++: Attention-Enhanced Thin-Plate Spline for Scene Text Recognition

学校读者我要写书评

暂无评论

arXiv 2023年

作者： Zheng, Tianlun Chen, Zhineng Bai, Jinfeng Xie, Hongtao Jiang, Yu-Gang Shanghai Collaborative Innovation Center of Intelligent Visual Computing School of Computer Science Fudan University China Tomorrow Advance Life China University of Science and Technology of China China

Text irregularities pose significant challenges to scene text recognizers. Thin-Plate Spline (TPS)based rectification is widely regarded as an effective means to deal with them. Currently, the calculation of TPS transformation parameters purely depends on the quality of regressed text borders. It ignores the text content and often leads to unsatisfactory rectified results for severely distorted text. In this work, we introduce TPS++, an attention-enhanced TPS transformation that incorporates the attention mechanism to text rectification for the first time. TPS++ formulates the parameter calculation as a joint process of foreground control point regression and content-based attention score estimation, which is computed by a dedicated designed gated-attention block. TPS++ builds a more flexible content-aware rectifier, generating a natural text correction that is easier to read by the subsequent recognizer. Moreover, TPS++ shares the feature backbone with the recognizer in part and implements the rectification at feature-level rather than image-level, incurring only a small overhead in terms of parameters and inference time. Experiments on public benchmarks show that TPS++ consistently improves the recognition and achieves state-of-the-art accuracy. Meanwhile, it generalizes well on different backbones and recognizers. Code is at https://***/simplify23/TPS PP. Copyright © 2023, The Authors. All rights reserved.

关键词： Character recognition

Learning to Rank Patches for Unbiased Image Redundancy Reduction

学校读者我要写书评

暂无评论

arXiv 2024年

作者： Luo, Yang Chen, Zhineng Zhou, Peng Wu, Zuxuan Gao, Xieping Jiang, Yu-Gang School of Computer Science Shanghai Collaborative Innovation Center of Intelligent Visual Computing Fudan University China University of Maryland College Park United States College of Information Science and Engineering Hunan Normal University China

Images suffer from heavy spatial redundancy because pixels in neighboring regions are spatially correlated. Existing approaches strive to overcome this limitation by reducing less meaningful image regions. However, current leading methods rely on supervisory signals. They may compel models to preserve content that aligns with labeled categories and discard content belonging to unlabeled categories. This categorical inductive bias makes these methods less effective in real-world scenarios. To address this issue, we propose a self-supervised framework for image redundancy reduction called Learning to Rank Patches (LTRP). We observe that image reconstruction of masked image modeling models is sensitive to the removal of visible patches when the masking ratio is high (e.g., 90%). Building upon it, we implement LTRP via two steps: inferring the semantic density score of each patch by quantifying variation between reconstructions with and without this patch, and learning to rank the patches with the pseudo score. The entire process is self-supervised, thus getting out of the dilemma of categorical inductive bias. We design extensive experiments on different datasets and tasks. The results demonstrate that LTRP outperforms both supervised and other self-supervised methods due to the fair assessment of image content. Code is available at https://***/irsLu/ltrp. Copyright © 2024, The Authors. All rights reserved.

关键词： Semantics

Retrieval Augmented Recipe Generation

学校读者我要写书评

暂无评论

arXiv 2024年

作者： Liu, Guoshan Yin, Hailong Zhu, Bin Chen, Jingjing Ngo, Chong-Wah Jiang, Yu-Gang Shanghai Key Lab of Intelligent Information Processing School of Computer Science Fudan University China Shanghai Collaborative Innovation Center on Intelligent Visual Computing China Singapore Management University Singapore

The growing interest in generating recipes from food images has drawn substantial research attention in recent years. Existing works for recipe generation primarily utilize a two-stage training method—first predicting ingredients from a food image and then generating instructions from both the image and ingredients. Large Multi-modal Models (LMMs), which have achieved notable success across a variety of vision and language tasks, shed light on generating both ingredients and instructions directly from images. Nevertheless, LMMs still face the common issue of hallucinations during recipe generation, leading to suboptimal performance. To tackle this issue, we propose a retrieval augmented large multimodal model for recipe generation. We first introduce Stochastic Diversified Retrieval Augmentation (SDRA) to retrieve recipes semantically related to the image from an existing datastore as a supplement, integrating them into the prompt to add diverse and rich context to the input image. Additionally, Self-Consistency Ensemble Voting mechanism is proposed to determine the most confident prediction recipes as the final output. It calculates the consistency among generated recipe candidates, which use different retrieval recipes as context for generation. Extensive experiments validate the effectiveness of our proposed method, which demonstrates state-of-the-art (SOTA) performance in recipe generation on the Recipe1M dataset. © 2024, CC BY-NC-ND.

关键词： Food ingredients

An Optimization Method of Primer Design Based on Attention-BiLSTM 2

学校读者我要写书评

暂无评论

An Optimization Method of Primer Design Based on Attention-B...

2nd International Conference on Robotics, Artificial Intelligence and Intelligent Control, RAIIC 2023

作者： Bai, Binhao Long, Jinyu Yang, Zhibo Li, Junli Wei, Ping Sichuan Normal University College of Computer Science Chengdu China Sichuan Normal University Visual Computing and Virtual Reality Key Laboratory of Sichuan Chengdu China Sichuan Center of Translational Medicine Sichuan Key Laboratory of Translational Medicine of Traditional Chinese Medicine Sichuan Academy of Traditional Chinese Medicine Chengdu China

ISBN: (纸本)9798350328004

In this paper, we propose a method to predict the success of primer amplification based on the relationship existing between the sequence of primer and template, which can optimize the primer design and select the primer with better amplification from the candidate primer set. The double-stranded structure between primer and template nucleotide sequences is represented here by a number of words, each consisting of five characters that form sentences, as the dataset for the experiment, which is learned using an attention-based mechanism of bidirectional long short-term memory neural network model (Attention-BiLSTM), and then predicts primer amplification. The model predicted the results of polymerase chain reaction (PCR) involving specific primers and specific DNA templates with 82% accuracy, an improvement of about 2% over the performance of the LSTM with more stable value. These results show that the model can be used to effectively predict the results of PCR. This is the first paper to optimize primer design by screening the candidate primer set with a neural network model. © 2023 IEEE.

关键词： Neural network models

Deformable Mixer Transformer with Gating for Multi-Task Learning of Dense Prediction

学校读者我要写书评

暂无评论

arXiv 2023年

作者： Xu, Yangyang Yang, Yibo Ghanem, Bernard Zhang, Lefei Du, Bo Tao, Dacheng School of Computer Science Wuhan University Wuhan China Visual Computing Center King Abdullah University of Science and Technology Jeddah Saudi Arabia Sydney Ai Centre School of Computer Science The University of Sydney Sydney Australia

Convolution neural networks (CNNs) and Transformers have their own advantages and both have been widely used for dense prediction in multi-task learning (MTL). Most of the current studies on MTL solely rely on CNN or Transformer. In this work, we present a novel MTL model by combining both merits of deformable CNN and query-based Transformer with shared gating for multi-task learning of dense prediction. This combination may offer a simple and efficient solution owing to its powerful and flexible task-specific learning and advantages of lower cost, less complexity and smaller parameters than the traditional MTL methods. We introduce deformable mixer Transformer with gating (DeMTG), a simple and effective encoder-decoder architecture up-to-date that incorporates the convolution and attention mechanism in a unified network for MTL. It is exquisitely designed to use advantages of each block, and provide deformable and comprehensive features for all tasks from local and global perspective. First, the deformable mixer encoder contains two types of operators: the channel-aware mixing operator leveraged to allow communication among different channels, and the spatial-aware deformable operator with deformable convolution applied to efficiently sample more informative spatial locations. Second, the taskaware gating transformer decoder is used to perform the task-specific predictions, in which task interaction block integrated with self-attention is applied to capture task interaction features, and the task query block integrated with gating attention is leveraged to select corresponding task-specific features. Further, the experiment results demonstrate that the proposed DeMTG uses fewer GFLOPs and significantly outperforms current Transformer-based and CNN-based competitive models on a variety of metrics on three dense prediction datasets (i.e., NYUD-v2, PASCAL-Context, and Cityscapes). For example, by using Swin-L as a backbone, our method achieves 57.55 mIoU segmentation

关键词： Forecasting

MVFlow: Deep Optical Flow Estimation of Compressed Videos with Motion Vector Prior

学校读者我要写书评

暂无评论

arXiv 2023年

作者： Zhou, Shili Jiang, Xuhao Tan, Weimin He, Ruian Yan, Bo School of Computer Science Shanghai Key Laboratory of Intelligent Information Processing Shanghai Collaborative Innovation Center of Intelligent Visual Computing Fudan University Shanghai China

In recent years, many deep learning-based methods have been proposed to tackle the problem of optical flow estimation and achieved promising results. However, they hardly consider that most videos are compressed and thus ignore the pre-computed information in compressed video streams. Motion vectors, one of the compression information, record the motion of the video frames. They can be directly extracted from the compression code stream without computational cost and serve as a solid prior for optical flow estimation. Therefore, we propose an optical flow model, MVFlow, which uses motion vectors to improve the speed and accuracy of optical flow estimation for compressed videos. In detail, MVFlow includes a key Motion-Vector Converting Module, which ensures that the motion vectors can be transformed into the same domain of optical flow and then be utilized fully by the flow estimation module. Meanwhile, we construct four optical flow datasets for compressed videos containing frames and motion vectors in pairs. The experimental results demonstrate the superiority of our proposed MVFlow, which can reduce the AEPE by 1.09 compared to existing models or save 52% time to achieve similar accuracy to existing models. Copyright © 2023, The Authors. All rights reserved.

关键词： Optical flows

Learning Survival Distribution with Implicit Survival Function

学校读者我要写书评

暂无评论

arXiv 2023年

作者： Ling, Yu Tan, Weimin Yan, Bo School of Computer Science Shanghai Key Laboratory of Intelligent Information Processing Shanghai Collaborative Innovation Center of Intelligent Visual Computing Fudan University Shanghai China

Survival analysis aims at modeling the relationship between covariates and event occurrence with some untracked (censored) samples. In implementation, existing methods model the survival distribution with strong assumptions or in a discrete time space for likelihood estimation with censorship, which leads to weak generalization. In this paper, we propose Implicit Survival Function (ISF) based on Implicit Neural Representation for survival distribution estimation without strong assumptions, and employ numerical integration to approximate the cumulative distribution function for prediction and optimization. Experimental results show that ISF outperforms the state-of-the-art methods in three public datasets and has robustness to the hyperparameter controlling estimation precision. Copyright © 2023, The Authors. All rights reserved.

关键词： Distribution functions

Uncer2Natural: Uncertainty-Aware Unsupervised Image Denoising

学校读者我要写书评

暂无评论

Uncer2Natural: Uncertainty-Aware Unsupervised Image Denoisin...

International Conference on Acoustics, Speech, and Signal Processing (ICASSP)

作者： Chenyu Huang Weimin Tan Jiaxing Shi Zhen Xing Bo Yan School of Computer Science Shanghai Key Laboratory of Intelligent Information Processing Shanghai Collaborative Innovation Center of Intelligent Visual Computing Fudan University Shanghai China

Recently, unsupervised image denoising methods learning from paired noisy samples have received increasing attention. These methods build on the idea that the mean of multiple noisy images of the same scene is the ideal clean image. However, these methods ignore the effect of Aleatoric uncertainty in the noisy image (e.g., pixels deviating from the expected distribution). The presence of Aleatoric uncertainty causes degradation of the reconstructed target pixels, resulting in high uncertainty for these pixels (i.e., low confidence), which in turn leads to sub-optimal denoising results. To address this problem, we propose a novel uncertainty-aware unsupervised image denoising method named Uncer2Natural (U2N). It dynamically predicts the Aleatoric uncertainty for each noisy sample and produces satisfactory denoising results by reducing the effect of Aleatoric uncertainty. Extensive experimental results show that U2N outperforms state-of-the- art unsupervised image denoising methods in terms of both quantitative metrics and qualitative visual quality.

关键词： Degradation visualization Uncertainty Art Noise reduction Signal processing Noise measurement

Towards a Unified User Interface for visual Analysis of Retinal Data in Ophthalmology

学校读者我要写书评

暂无评论

arXiv 2023年

作者： Röhlig, Martin Nonnemann, Lars Schulz, Hans-Jörg Stachs, Oliver Schumann, Heidrun Institute for Visual and Analytic Computing University of Rostock Germany Department of Computer Science Aarhus University Denmark Department of Ophthalmology Rostock University Medical Center Germany

The visual analysis of retinal data contributes to the understanding of a wide range of eye diseases. For the evaluation of cross-sectional studies, ophthalmologists rely on workflows and toolsets established in their work environment. That is, they know what tools and data are needed at each step of their workflow. Yet, manually operating the various tools, including activation, data handling, or view arrangement, can be cumbersome and time-consuming. We thus introduce a new visualization-supported toolchaining approach that combines workflow, tools, and data. First, we provide access to the tools required for each step of the workflow. Second, we handle the exchange of data between these tools. Third, we organize the views of the tools on screen using suitable layouts. Fourth, we visualize the connection between workflow, tools, and data to support the data analysis. We demonstrate our approach with a use case in ophthalmic research and report on initial feedback from experts. © 2023, CC BY.

关键词： Ophthalmology