检索结果-内蒙古大学图书馆

Depth Estimation From Single Image And Semantic Prior

学校读者我要写书评

暂无评论

Depth Estimation From Single Image And Semantic Prior

IEEE International Conference on Image Processing

作者： Praful Hambarde Akshay Dudhane Prashant W. Patil Subrahmanyam Murala Abhinav Dhall Computer Vision and Pattern Recognition Lab IIT Ropar

ISBN: (数字)9781728163956

ISBN: (纸本)9781728163963

The multi-modality sensor fusion technique is an active research area in scene understating. In this work, we explore the RGB image and semantic-map fusion methods for depth estimation. The LiDARs, Kinect, and TOF depth sensors are unable to predict the depth-map at illuminate and monotonous pattern surface. In this paper, we propose a semantic-to-depth generative adversarial network (S2D-GAN) for depth estimation from RGB image and its semantic-map. In the first stage, the proposed S2D-GAN estimates the coarse level depthmap using a semantic-to-coarse-depth generative adversarial network (S2CD-GAN) while the second stage estimates the fine-level depth-map using a cascaded multi-scale spatial pooling network. The experimental analysis of the proposed S2D-GAN performed on NYU-Depth-V2 dataset shows that the proposed S2D-GAN gives outstanding result over existing single image depth estimation and RGB with sparse samples methods. The proposed S2D-GAN also gives efficient results on the real-world indoor and outdoor image depth estimation.

关键词： Estimation Semantics Generators Robot sensing systems Laser radar Generative adversarial networks Training

Blueprint Separable Residual Network for Efficient Image Super-Resolution

学校读者我要写书评

暂无评论

arXiv 2022年

作者： Li, Zheyuan Liu, Yingqi Chen, Xiangyu Cai, Haoming Gu, Jinjin Qiao, Yu Dong, Chao ShenZhen Key Lab of Computer Vision and Pattern Recognition SIAT-SenseTime Joint Lab Shenzhen Institutes of Advanced Technology Chinese Academy of Sciences China University of Macau China Shanghai AI Laboratory Shanghai China The University of Sydney Australia

Recent advances in single image super-resolution (SISR) have achieved extraordinary performance, but the computational cost is too heavy to apply in edge devices. To alleviate this problem, many novel and effective solutions have been proposed. Convolutional neural network (CNN) with the attention mechanism has attracted increasing attention due to its efficiency and effectiveness. However, there is still redundancy in the convolution operation. In this paper, we propose Blueprint Separable Residual Network (BSRN) containing two efficient designs. One is the usage of blueprint separable convolution (BSConv), which takes place of the redundant convolution operation. The other is to enhance the model ability by introducing more effective attention modules. The experimental results show that BSRN achieves state-of-the-art performance among existing efficient SR methods. Moreover, a smaller variant of our model BSRN-S won the first place in model complexity track of NTIRE 2022 Efficient SR Challenge. The code is availab.e at https://***/xiaom233/BSRN. Copyright © 2022, The Authors. All rights reserved.

关键词： Blueprints

Automatic Polyp Segmentation with Multiple Kernel Dilated Convolution Network

学校读者我要写书评

暂无评论

arXiv 2022年

作者： Tomar, Nikhil Kumar Srivastava, Abhishek Bagci, Ulas Jha, Debesh School of Informatics and Computer Science Indira Gandhi National Open University India Computer Vision and Pattern Recognition Unit Indian Statistical Institute India Machine and Hybrid Intelligence Lab Department of Radiology Feinberg School of Medicine Northwestern University United States

The detection and removal of precancerous polyps through colonoscopy is the primary technique for the prevention of colorectal cancer worldwide. However, the miss rate of colorectal polyp varies significantly among the endoscopists. It is well known that a computer-aided diagnosis (CAD) system can assist endoscopists in detecting colon polyps and minimize the variation among endoscopists. In this study, we introduce a novel deep learning architecture, named MKDCNet, for automatic polyp segmentation robust to significant changes in polyp data distribution. MKDCNet is simply an encoder-decoder neural network that uses the pre-trained ResNet50 as the encoder and novel multiple kernel dilated convolution (MKDC) block that expands the field of view to learn more robust and heterogeneous representation. Extensive experiments on four publicly availab.e polyp datasets and cell nuclei dataset show that the proposed MKDCNet outperforms the state-of-the-art methods when trained and tested on the same dataset as well when tested on unseen polyp datasets from different distributions. With rich results, we demonstrated the robustness of the proposed architecture. From an efficiency perspective, our algorithm can process at (≈ 45) frames per second on RTX 3090 GPU. MKDCNet can be a strong benchmark for building real-time systems for clinical colonoscopies. The code of the proposed MKDCNet is availab.e at https://***/nikhilroxtomar/MKDCNet. © 2022, CC BY.

关键词： Convolution

A Survey of Historical Document Image Datasets

学校读者我要写书评

暂无评论

arXiv 2022年

作者： Nikolaidou, Konstantina Seuret, Mathias Mokayed, Hamam Liwicki, Marcus EISLAB Machine Learning Group Luleå University of Technology Aurorum 1 Norrbotten Luleå97187 Sweden Pattern Recognition Lab Computer Vision Group Friedrich-Alexander-Universität Martensstr. 3 Bavaria Erlangen91058 Germany

This paper presents a systematic literature review of image datasets for document image analysis, focusing on historical documents, such as handwritten manuscripts and early prints. Finding appropriate datasets for historical document analysis is a crucial prerequisite to facilitate research using different machine learning algorithms. However, because of the very large variety of the actual data (e.g., scripts, tasks, dates, support systems, and amount of deterioration), the different formats for data and lab.l representation, and the different evaluation processes and benchmarks, finding appropriate datasets is a difficult task. This work fills this gap, presenting a meta-study on existing datasets. After a systematic selection process (according to PRISMA guidelines), we select 65 studies that are chosen based on different factors, such as the year of publication, number of methods implemented in the article, reliability of the chosen algorithms, dataset size, and journal outlet. We summarize each study by assigning it to one of three pre-defined tasks: document classification, layout structure, or content analysis. We present the statistics, document type, language, tasks, input visual aspects, and ground truth information for every dataset. In addition, we provide the benchmark tasks and results from these papers or recent competitions. We further discuss gaps and challenges in this domain. We advocate for providing conversion tools to common formats (e.g., COCO format for computer vision tasks) and always providing a set of evaluation metrics, instead of just one, to make results comparable across studies. © 2022, CC BY.

关键词： Deterioration

LVAgent: Long Video Understanding by Multi-Round Dynamical Collab.ration of MLLM Agents

学校读者我要写书评

暂无评论

arXiv 2025年

作者： Chen, Boyu Yue, Zhengrong Chen, Siran Wang, Zikang Liu, Yang Li, Peng Wang, Yali Shenzhen Key Lab of Computer Vision and Pattern Recognition Shenzhen Institutes of Advanced Technology Chinese Academy of Sciences China School of Artificial Intelligence University of Chinese Academy of Sciences China Tsinghua University Beijing China Dept. of Comp. Sci. & Tech. Institute for AI Tsinghua University Beijing China Shanghai Artificial Intelligence Laboratory China Shanghai Jiao Tong University China

Existing Multimodal Large Language Models (MLLMs) encounter significant challenges in modeling the temporal context within long videos. Currently, mainstream Agent-based methods use external tools (e.g., search engine, memory banks, OCR, retrieval models) to assist a single MLLM in answering long video questions. Despite such tool-based support, a solitary MLLM still offers only a partial understanding of long videos, resulting in limited performance. In order to better address long video tasks, we introduce LVAgent, the first framework enabling multi-round dynamic collab.ration of MLLM agents in long video understanding. Our methodology consists of four key steps: 1) Selection: We pre-select appropriate agents from the model library to form optimal agent teams based on different tasks. 2) Perception: We design an effective retrieval scheme for long videos, improving the coverage of critical temporal segments while maintaining computational efficiency. 3) Action: Agents answer long video-related questions and exchange reasons. 4) Reflection: We evaluate each agent’s performance in each round of discussion and optimize the agent team for dynamic collab.ration. The agents iteratively refine their answers by multi-round dynamical collab.ration of MLLM agents. LVAgent is the first agent system method that outperforms all closed-source models (including GPT-4o) and open-source models (including InternVL-2.5 and Qwen2-VL) in the long video understanding tasks. Our LVAgent achieves an accuracy of 80% on four mainstream long video understanding tasks. Notably, on the LongVideoBench dataset, LVAgent improves accuracy by up to 13.3% compared with SOTA. © 2025, CC BY-NC-SA.

关键词： Search engines

Activating More Pixels in Image Super-Resolution Transformer

学校读者我要写书评

暂无评论

arXiv 2022年

作者： Chen, Xiangyu Wang, Xintao Zhou, Jiantao Qiao, Yu Dong, Chao State Key Laboratory of Internet of Things for Smart City University of Macau China Shenzhen Key Lab of Computer Vision and Pattern Recognition Shenzhen Institute of Advanced Technology Chinese Academy of Sciences China Shanghai Artificial Intelligence Laboratory China ARC Lab Tencent PCG China

Transformer-based methods have shown impressive performance in low-level vision tasks, such as image super-resolution. However, we find that these networks can only utilize a limited spatial range of input information through attribution analysis. This implies that the potential of Transformer is still not fully exploited in existing networks. In order to activate more input pixels for better reconstruction, we propose a novel Hybrid Attention Transformer (HAT). It combines both channel attention and window-based self-attention schemes, thus making use of their complementary advantages of being able to utilize global statistics and strong local fitting capability. Moreover, to better aggregate the cross-window information, we introduce an overlapping cross-attention module to enhance the interaction between neighboring window features. In the training stage, we additionally adopt a same-task pre-training strategy to exploit the potential of the model for further improvement. Extensive experiments show the effectiveness of the proposed modules, and we further scale up the model to demonstrate that the performance of this task can be greatly improved. Our overall method significantly outperforms the state-of-the-art methods by more than 1dB. Copyright © 2022, The Authors. All rights reserved.

关键词： Pixels

UDC-UNet: Under-Display Camera Image Restoration via U-shape Dynamic Network

学校读者我要写书评

暂无评论

arXiv 2022年

作者： Liu, Xina Hu, Jinfan Chen, Xiangyu Dong, Chao Shenzhen Key Lab of Computer Vision and Pattern Recognition SIAT-SenseTime Joint Lab Shenzhen Institutes of Advanced Technology Chinese Academy of Sciences Shanghai China University of Chinese Academy of Sciences Shanghai China University of Macau Shanghai China Shanghai AI Laboratory Shanghai China

Under-Display Camera (UDC) has been widely exploited to help smartphones realize full-screen displays. However, as the screen could inevitably affect the light propagation process, the images captured by the UDC system usually contain flare, haze, blur, and noise. Particularly, flare and blur in UDC images could severely deteriorate the user experience in high dynamic range (HDR) scenes. In this paper, we propose a new deep model, namely UDC-UNet, to address the UDC image restoration problem with an estimated PSF in HDR scenes. Our network consists of three parts, including a U-shape base network to utilize multi-scale information, a condition branch to perform spatially variant modulation, and a kernel branch to leverage the prior knowledge of the PSF. According to the characteristics of HDR data, we additionally design a tone mapping loss to stabilize network optimization and achieve better visual quality. Experimental results show that the proposed UDC-UNet outperforms the state-of-the-art methods in quantitative and qualitative comparisons. Our approach won second place in the UDC image restoration track of the MIPI challenge. Codes and models are availab.e at https://***/J-FHu/UDCUNet. © 2022, CC BY.

关键词： Image reconstruction

RankSRGAN: Super resolution generative adversarial networks with learning to rank

学校读者我要写书评

暂无评论

arXiv 2021年

作者： Zhang, Wenlong Liu, Yihao Dong, Chao Qiao, Yu ShenZhen Key Lab of Computer Vision and Pattern Recognition SIAT-SenseTime Joint Lab Shenzhen Institutes of Advanced Technology Chinese Academy of Sciences China Shanghai AI Lab Shanghai China

Generative Adversarial Networks (GAN) have demonstrated the potential to recover realistic details for single image super-resolution (SISR). To further improve the visual quality of super-resolved results, PIRM2018-SR Challenge employed perceptual metrics to assess the perceptual quality, such as PI, NIQE, and Ma. However, existing methods cannot directly optimize these indifferentiable perceptual metrics, which are shown to be highly correlated with human ratings. To address the problem, we propose Super-Resolution Generative Adversarial Networks with Ranker (RankSRGAN) to optimize generator in the direction of different perceptual metrics. Specifically, we first train a Ranker which can learn the behaviour of perceptual metrics and then introduce a novel rank-content loss to optimize the perceptual quality. The most appealing part is that the proposed method can combine the strengths of different SR methods to generate better results. Furthermore, we extend our method to multiple Rankers to provide multi-dimension constraints for the generator. Extensive experiments show that RankSRGAN achieves visually pleasing results and reaches state-of-the-art performance in perceptual metrics and quality. Project page: https://***/Projects/RankSRGAN. © 2021, CC BY.

关键词： Generative adversarial networks

Low-Resolution Action recognition for Tiny Actions Challenge

学校读者我要写书评

暂无评论

arXiv 2022年

作者： Chen, Boyu Qiao, Yu Wang, Yali ShenZhen Key Lab of Computer Vision and Pattern Recognition Shenzhen Institute of Advanced Technology Chinese Academy of Sciences China University of Chinese Academy of Sciences China Shanghai AI Laboratory Shanghai China SIAT Branch Shenzhen Institute of Artificial Intelligence and Robotics for Society China

Tiny Actions Challenge focuses on understanding human activities in real-world surveillance. Basically, there are two main difficulties for activity recognition in this scenario. First, human activities are often recorded at a distance, and appear in a small resolution without much discriminative clue. Second, these activities are naturally distributed in a long-tailed way. It is hard to alleviate data bias for such heavy category imbalance. To tackle these problems, we propose a comprehensive recognition solution in this paper. First, we train video backbones with data balance, in order to alleviate overfitting in the challenge benchmark. Second, we design a dual-resolution distillation framework, which can effectively guide low-resolution action recognition by super-resolution knowledge. Finally, we apply model ensemble with post-processing, which can further boost performance on the long-tailed categories. Our solution ranks Top-1 on the leaderboard. Copyright © 2022, The Authors. All rights reserved.

关键词： Distillation