检索结果-内蒙古大学图书馆

CLIP-Flow:Decoding images encoded in CLIP space

Computational visual Media 2024年第6期10卷 1157-1168页

作者： Hao Ma Ming Li Jingyuan Yang Or Patashnik Dani Lischinski Daniel Cohen-Or Hui Huang Visual Computing Research Center College of Computer Science and Software EngineeringShenzhen UniversityShenzhen 518060China Department of Computer Science Tel Aviv UniversityTel Aviv 6997801Israel School of Computer Science and Engineering the Hebrew University of JerusalemJerusalem 91904Israel

This study introduces CLIP-Flow,a novel network for generating images from a given image or *** effectively utilize the rich semantics contained in both modalities,we designed a semantics-guided methodology for image-and text-to-image *** particular,we adopted Contrastive Language-Image Pretraining(CLIP)as an encoder to extract semantics and StyleGAN as a decoder to generate images from such ***,to bridge the embedding space of CLIP and latent space of StyleGAN,real NVP is employed and modified with activation normalization and invertible *** the images and text in CLIP share the same representation space,text prompts can be fed directly into CLIP-Flow to achieve text-to-image *** conducted extensive experiments on several datasets to validate the effectiveness of the proposed image-to-image synthesis *** addition,we tested on the public dataset Multi-Modal CelebA-HQ,for text-to-image *** validated that our approach can generate high-quality text-matching images,and is comparable with state-of-the-art methods,both qualitatively and quantitatively.

关键词： image-to-image text-to-image contrastive language-image pretraining(CLIP) flow StyleGAN

来源：评论

学校读者我要写书评

暂无评论

GenRec: Unifying Video Generation and Recognition with Diffusion Models 38

GenRec: Unifying Video Generation and Recognition with Diffu...

引用

38th Conference on Neural Information Processing Systems, NeurIPS 2024

作者： Weng, Zejia Yang, Xitong Xing, Zhen Wu, Zuxuan Jiang, Yu-Gang Shanghai Key Lab of Intell. Info. Processing School of CS Fudan University China Shanghai Collaborative Innovation Center of Intelligent Visual Computing China Department of Computer Science University of Maryland United States

Video diffusion models are able to generate high-quality videos by learning strong spatial-temporal priors on large-scale datasets. In this paper, we aim to investigate whether such priors derived from a generative process are suitable for video recognition, and eventually joint optimization of generation and recognition. Building upon Stable Video Diffusion, we introduce GenRec, the first unified framework trained with a random-frame conditioning process so as to learn generalized spatial-temporal representations. The resulting framework can naturally supports generation and recognition, and more importantly is robust even when visual inputs contain limited information. Extensive experiments demonstrate the efficacy of GenRec for both recognition and generation. In particular, GenRec achieves competitive recognition performance, offering 75.8% and 87.2% accuracy on SSV2 and K400, respectively. GenRec also performs the best on class-conditioned image-to-video generation, achieving 46.5 and 49.3 FVD scores on SSV2 and EK-100 datasets. Furthermore, GenRec demonstrates extraordinary robustness in scenarios that only limited frames can be observed. Code will be available at https://***/wengzejia1/GenRec. © 2024 Neural information processing systems foundation. All rights reserved.

关键词：

来源：评论

学校读者我要写书评

暂无评论

Real-Time 3D Object Detection, Recognition and Presentation Using a Mobile Device for Assistive Navigation

引用

SN computer science 2023年第5期4卷 1-18页

作者： Chen, Jin Zhu, Zhigang Visual Computing Laboratory Computer Science Department The City College-CUNY New York 10031 NY United States Nearabl Inc. New York 10023 NY United States PhD Program in Computer Science The Graduate Center-CUNY New York 10016 NY United States

This paper presents an integrated solution for 3D object detection, recognition, and presentation to increase accessibility for various user groups in indoor areas through a mobile application. The system has three major components: a 3D object detection module, an object tracking and update module, and a voice and AR-enhanced interface. The 3D object detection module consists of pre-trained 2D object detectors and 3D bounding box estimation methods to detect the 3D poses and sizes of the objects in each camera frame. This module can easily adapt to various 2D object detectors (e.g., YOLO, SSD, mask RCNN) based on the requested task and requirements of the run time and details for the 3D detection result. It can run on a cloud server or mobile application. The object tracking and update module minimizes the computational power for long-term environment scanning by converting 2D tracking results into 3D results. The voice and AR-enhanced interface integrates ARKit and SiriKit to provide voice interaction and AR visualization to improve information delivery for different user groups. The system can be integrated with existing applications, especially assistive navigation, to increase travel safety for people who are blind or have low vision and improve social interaction for individuals with autism spectrum disorder. In addition, it can potentially be used for 3D reconstruction of the environment for other applications. Our preliminary test results for the object detection evaluation and real-time system performance are provided to validate the proposed system. © 2023, The Author(s), under exclusive licence to Springer Nature Singapore Pte Ltd.

关键词： 3D object detection ARKit Assistive technology Blind or low vision Voice assistance

来源：评论

学校读者我要写书评

暂无评论

Medical image registration and its application in retinal images:a review

引用

visual computing for Industry,Biomedicine,and Art 2024年第1期7卷 142-164页

作者： Qiushi Nie Xiaoqing Zhang Yan Hu Mingdao Gong Jiang Liu Research Institute of Trustworthy Autonomous Systems and Department of Computer Science and Engineering Southern University of Science and TechnologyShenzhen 518055China Center for High Performance Computing and Shenzhen Key Laboratory of Intelligent Bioinformatics Shenzhen Institute of Advanced TechnologyChinese Academy of SciencesShenzhen 518055China Singapore Eye Research Institute Singapore 169856Singapore State Key Laboratory of Ophthalmology Optometry and Visual ScienceEye HospitalWenzhou Medical UniversityWenzhou 325027China

Medical image registration is vital for disease diagnosis and treatment with its ability to merge diverse informa-tion of images,which may be captured under different times,angles,or *** several surveys have reviewed the development of medical image registration,they have not systematically summarized the existing med-ical image registration *** this end,a comprehensive review of these methods is provided from traditional and deep-learning-based perspectives,aiming to help audiences quickly understand the development of medical image *** particular,we review recent advances in retinal image registration,which has not attracted much *** addition,current challenges in retinal image registration are discussed and insights and prospects for future research provided.

关键词： computer-aided diagnosis Medical image registration Deep learning Generative model Transformer Retina

来源：评论

学校读者我要写书评

暂无评论

Mahalanobis Distance-based Multi-view Optimal Transport for Multi-view Crowd Localization

arXiv

引用

arXiv 2024年

作者： Zhang, Qi Zhang, Kaiyi Chan, Antoni B. Huang, Hui Visual Computing Research Center College of Computer Science and Software Engineering Shenzhen University Shenzhen China China Department of Computer Science City University of Hong Kong Hong Kong

Multi-view crowd localization predicts the ground locations of all people in the scene. Typical methods usually estimate the crowd density maps on the ground plane first, and then obtain the crowd locations. However, the performance of existing methods is limited by the ambiguity of the density maps in crowded areas, where local peaks can be smoothed away. To mitigate the weakness of density map supervision, optimal transport-based point supervision methods have been proposed in the single-image crowd localization tasks, but have not been explored for multi-view crowd localization yet. Thus, in this paper, we propose a novel Mahalanobis distance-based multi-view optimal transport (M-MVOT) loss specifically designed for multi-view crowd localization. First, we replace the Euclidean-based transport cost with the Mahalanobis distance, which defines elliptical iso-contours in the cost function whose long-axis and short-axis directions are guided by the view ray direction. Second, the object-to-camera distance in each view is used to adjust the optimal transport cost of each location further, where the wrong predictions far away from the camera are more heavily penalized. Finally, we propose a strategy to consider all the input camera views in the model loss (M-MVOT) by computing the optimal transport cost for each ground-truth point based on its closest camera. Experiments demonstrate the advantage of the proposed method over density map-based or common Euclidean distance-based optimal transport loss on several multi-view crowd localization datasets. Project page: MVOT Project. © 2024, CC BY.

关键词： Cost functions

来源：评论

学校读者我要写书评

暂无评论

Multimodal classification of forest biodiversity potential from 2D orthophotos and 3D airborne laser scanning point clouds

arXiv

引用

arXiv 2025年

作者： Jensen, Simon B. Oehmcke, Stefan Møgelmose, Andreas Madadi, Meysam Igel, Christian Escalera, Sergio Moeslund, Thomas B. Visual Analysis and Perception Laboratory Aalborg University Denmark Pioneer Centre for Artificial Intelligence Denmark Department of Computer Science Copenhagen University Denmark Institute for Visual & Analytic Computing Rostock University Germany University of Barcelona and Computer Vision Center Spain

Assessment of forest biodiversity is crucial for ecosystem management and conservation. While traditional field surveys provide high-quality assessments, they are labor-intensive and spatially limited. This study investigates whether deep learning-based fusion of close-range sensing data from 2D orthophotos and 3D airborne laser scanning (ALS) point clouds can reliable assess the biodiversity potential of forests. We introduce the BioVista dataset, comprising 44 378 paired samples of orthophotos and ALS point clouds from temperate forests in Denmark, designed to explore multimodal fusion approaches. Using deep neural networks (ResNet for orthophotos and PointVector for ALS point clouds), we investigate each data modality’s ability to assess forest biodiversity potential, achieving overall accuracies of 76.7% and 75.8%, respectively. We explore various 2D and 3D fusion approaches: confidence-based ensembling, feature-level concatenation, and end-to-end training, achieving overall accuracies of 80.5%, 81.4% and 80.4% respectively. Our results demonstrate that spectral information from orthophotos and structural information from ALS point clouds effectively complement each other in forest biodiversity assessment. © 2025, CC BY.

关键词： Laser applications

来源：评论

学校读者我要写书评

暂无评论

Causal-IQA: Towards the Generalization of Image Quality Assessment Based on Causal Inference 41

Causal-IQA: Towards the Generalization of Image Quality Asse...

引用

41st International Conference on Machine Learning, ICML 2024

作者： Zhong, Yan Wu, Xingyu Zhang, Li Yang, Chenxi Jiang, Tingting School of Mathematical Sciences Peking University Beijing China National Engineering Research Center of Visual Technology National Key Laboratory for Multimedia Information Processing School of Computer Science Peking University Beijing China Department of Computing The Hong Kong Polytechnic University Hong Kong Hefei Institute of Physical Science Chinese Academy of Sciences University of Science and Technology of China Hefei China National Biomedical Imaging Center Peking University Beijing China

Due to the high cost of Image Quality Assessment (IQA) datasets, achieving robust generalization remains challenging for prevalent deep learning-based IQA *** address this, this paper proposes a novel end-to-end blind IQA method: ***, we first analyze the causal mechanisms in IQA tasks and construct a causal graph to understand the interplay and confounding effects between distortion types, image contents, and subjective human ***, through shifting the focus from correlations to causality, Causal-IQA aims to improve the estimation accuracy of image quality scores by mitigating the confounding effects using a causality-based optimization *** optimization strategy is implemented on the sample subsets constructed by a Counterfactual Division process based on the Backdoor *** experiments illustrate the superiority of Causal-IQA. Copyright 2024 by the author(s)

关键词： Image correlation

来源：评论

学校读者我要写书评

暂无评论

GenRec: unifying video generation and recognition with diffusion models 24

GenRec: unifying video generation and recognition with diffu...

引用

Proceedings of the 38th International Conference on Neural Information Processing Systems

作者： Zejia Weng Xitong Yang Zhen Xing Zuxuan Wu Yu-Gang Jiang Shanghai Key Lab of Intell. Info. Processing School of CS Fudan University and Shanghai Collaborative Innovation Center of Intelligent Visual Computing Department of Computer Science University of Maryland

ISBN: (纸本)9798331314385

关键词：

来源：评论

学校读者我要写书评

暂无评论

Towards a Unified User Interface for visual Analysis of Retinal Data in Ophthalmology

arXiv

引用

arXiv 2023年

作者： Röhlig, Martin Nonnemann, Lars Schulz, Hans-Jörg Stachs, Oliver Schumann, Heidrun Institute for Visual and Analytic Computing University of Rostock Germany Department of Computer Science Aarhus University Denmark Department of Ophthalmology Rostock University Medical Center Germany

The visual analysis of retinal data contributes to the understanding of a wide range of eye diseases. For the evaluation of cross-sectional studies, ophthalmologists rely on workflows and toolsets established in their work environment. That is, they know what tools and data are needed at each step of their workflow. Yet, manually operating the various tools, including activation, data handling, or view arrangement, can be cumbersome and time-consuming. We thus introduce a new visualization-supported toolchaining approach that combines workflow, tools, and data. First, we provide access to the tools required for each step of the workflow. Second, we handle the exchange of data between these tools. Third, we organize the views of the tools on screen using suitable layouts. Fourth, we visualize the connection between workflow, tools, and data to support the data analysis. We demonstrate our approach with a use case in ophthalmic research and report on initial feedback from experts. © 2023, CC BY.

关键词： Ophthalmology

来源：评论

学校读者我要写书评

暂无评论

A deep learning system for predicting time to progression of diabetic retinopathy

引用

NATURE MEDICINE 2024年第2期30卷 358-359页

作者： [Anonymous] Shanghai Belt and Road International Joint Laboratory for Intelligent Prevention and Treatment of Metabolic Disorders Department of Computer Science and Engineering School of Electronic Information and Electrical Engineering Shanghai Jiao Tong University Department of Endocrinology and Metabolism Shanghai Sixth People’s Hospital Affiliated to Shanghai Jiao Tong University School of Medicine Shanghai Diabetes Institute Shanghai Clinical Center for Diabetes Shanghai China MOE Key Laboratory of AI School of Electronic Information and Electrical Engineering Shanghai Jiao Tong University Shanghai China Department of Ophthalmology Huadong Sanatorium Wuxi China Department of Ophthalmology Shanghai Sixth People’s Hospital Affiliated to Shanghai Jiao Tong University School of Medicine Shanghai China Department of Ophthalmology and Visual Sciences The Chinese University of Hong Kong Hong Kong China Singapore Eye Research Institute Singapore National Eye Centre Singapore Singapore Department of Computer Science and Engineering The Hong Kong University of Science and Technology Hong Kong China Department of Chemical and Biological Engineering The Hong Kong University of Science and Technology Hong Kong China State Key Laboratory of Ophthalmology Zhongshan Ophthalmic Center Sun Yat-sen University Guangdong Provincial Key Laboratory of Ophthalmology and Visual Science Guangzhou China Department of Ophthalmology Peking Union Medical College Hospital Peking Union Medical College Chinese Academy of Medical Sciences Beijing China Medical Records and Statistics Office Shanghai Sixth People’s Hospital Affiliated to Shanghai Jiao Tong University School of Medicine Shanghai China Department of Geriatrics Tongji Hospital Tongji Medical College Huazhong University of Science and Technology Wuhan China National Engineering Research Center for Big Data Technology and System Services Computing Technology and System Lab Cluster and Grid Computing Lab School of Computer Science and Tech

We developed and validated a deep learning system (termed DeepDR Plus) in a diverse, multiethnic, multi-country dataset to predict personalized risk and time to progression of diabetic retinopathy. We show that DeepDR Plus can be integrated into the clinical workflow to promote individualized intervention strategies for the management of diabetic retinopathy.

关键词： Diabetes complications Machine learning Predictive markers

来源：评论

学校读者我要写书评

暂无评论

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案：

请选择收藏分类：

通借通还

建议与咨询 留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案： 新增检索档案 确定 取消

请选择收藏分类： 新增自定义分类 确定 取消

通借通还

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

请选择保存的检索档案：

请选择收藏分类：