检索结果-内蒙古大学图书馆

2024 SIGGRAPH Asia Conference-SIGGRAPH Asia

作者： Feng, Tony Haoran Denny, Paul Wunsche, Burkhard C. Luxton-Reilly, Andrew Whalley, Jacqueline Univ Auckland Auckland New Zealand Auckland Univ Technol Auckland New Zealand

ISBN: (纸本)9798400711367

CG (Computer Graphics) is a popular field of CS (Computer Science), but many students find this topic difficult due to it requiring a large number of skills, such as mathematics, programming, geometric reasoning, and creativity. Over the past few years, researchers have investigated ways to harness the power of GenAI (Generative Artificial Intelligence) to improve teaching. In CS, much of the research has focused on introductory computing. A recent study evaluating the performance of an LLM (Large language Model), GPT-4 (text-only), on CG questions, indicated poor performance and reliance on detailed descriptions of image content, which often required considerable insight from the user to return reasonable results. So far, no studies have investigated the abilities of LMMs (Large Multimodal models), or multimodal LLMs, to solve CG questions and how these abilities can be used to improve teaching. In this study, we construct two datasets of CG questions requiring varying degrees of visual perception skills and geometric reasoning skills, and evaluate the current state-of-the-art LMM, GPT-4o, on the two datasets. We find that although GPT-4o exhibits great potential in solving questions with visual information independently, major limitations still exist to the accuracy and quality of the generated results. We propose several novel approaches for CG educators to incorporate GenAI into CG teaching despite these limitations. We hope that our guidelines further encourage learning and engagement in CG classrooms.

关键词： Large language models LLMs Large Multimodal models LMMs visual language models VLMs Generative Artificial Intelligence GenAI GPT-4 GPT-4o visual Perception Geometric Reasoning Computer Graphics Computing Education Evaluation Assessment

来源：评论

学校读者我要写书评

暂无评论

Anchoring Vision and language Knowledge for Weakly Supervised Group Activity Recognition

Anchoring Vision and Language Knowledge for Weakly Supervise...

引用

2024 Conference on visual Communications and Image Processing

作者： Nugroho, Muhammad Adi Park, Jinyoung Kim, Donguk Kim, Changick Korea Adv Inst Sci & Technol KAIST Daejeon South Korea

ISBN: (纸本)9798331529543;9798331529550

The emergence of Foundation Vision-language models (VLMs) has ignited a surge of research in the computer vision field due to their robust baseline performance. Inspired by this, we propose the Anchoring Vision-language Network (AnViL-Net), which integrates a vision language model for the challenging task of Weakly-Supervised Group Activity Recognition (WSGAR). Our network effectively incorporates VLMs into WSGAR, addressing the challenges posed by dynamic actor motions and domain-specific activity classes. AnViL-Net leverages highly generalized VLM vision features as anchors for extracting visual features. Additionally, semantically meaningful VLM language features serve as anchors for inferring the semantic relationships between actors and their activities. We demonstrate the effectiveness of AnViL-Net on multiple group activity datasets, achieving competitive state-of-the-art results.

关键词： Group Activity Recognition visual language models

来源：评论

学校读者我要写书评

暂无评论

Orthogonal Temporal Interpolation for Zero-Shot Video Recognition 23

Orthogonal Temporal Interpolation for Zero-Shot Video Recogn...

引用

31st ACM International Conference on Multimedia (MM)

作者： Zhu, Yan Zhuo, Junbao Ma, Bin Geng, Jiajia Wei, Xiaoming Wei, Xiaolin Wang, Shuhui Univ Chinese Acad Sci Sch Artificial Intelligence Beijing Peoples R China Chinese Acad Sci Inst Comp Technol Beijing Peoples R China Meituan Inc Beijing Peoples R China

ISBN: (纸本)9798400701085

Zero-shot video recognition (ZSVR) is a task that aims to recognize video categories that have not been seen during the model training process. Recently, vision-language models (VLMs) pre-trained on large-scale image-text pairs have demonstrated impressive transferability for ZSVR. To make VLMs applicable to the video domain, existing methods often use an additional temporal learning module after the image-level encoder to learn the temporal relationships among video frames. Unfortunately, for video from unseen categories, we observe an abnormal phenomenon where the model that uses spatial-temporal feature performs much worse than the model that removes temporal learning module and uses only spatial feature. We conjecture that improper temporal modeling on video disrupts the spatial feature of the video. To verify our hypothesis, we propose Feature Factorization to retain the orthogonal temporal feature of the video and use interpolation to construct refined spatial-temporal feature. The model using appropriately refined spatial-temporal feature performs better than the one using only spatial feature, which verifies the effectiveness of the orthogonal temporal feature for the ZSVR task. Therefore, an Orthogonal Temporal Interpolation module is designed to learn a better refined spatial-temporal video feature during training. Additionally, a Matching Loss is introduced to improve the quality of the orthogonal temporal feature. We propose a model called OTI for ZSVR by employing orthogonal temporal interpolation and the matching loss based on VLMs. The ZSVR accuracies on popular video datasets (i.e., Kinetics-600, UCF101 and HMDB51) show that OTI outperforms the previous state-of-the-art method by a clear margin. Our codes are publicly available at https://***/yanzhu/mm2023_oti.

关键词： Zero-shot Video Recognition visual language models Temporal Modeling Transfer Learning

来源：评论

学校读者我要写书评

暂无评论

visual language Model for Preclinical Toxicologic Liver Histopathology Assessment 1

Visual Language Model for Preclinical Toxicologic Liver Hist...

引用

1st International Workshop on Vision-language models for Biomedical Applications (VLM4Bio)

作者： Cheng, Zehua Dai, Wei Sun, Jiahao Univ Oxford Oxford England Robo Space Beijing Peoples R China FLock Io London England Imperial Coll London London England

ISBN: (纸本)9798400712074

Preclinical drug safety assessment is a critical step in drug development that relies on time-consuming manual histopathological examination, which is prone to high inter-observer variability. Artificial intelligence methods hold promise to accelerate this assessment and enhance reproducibility. Here, we introduce VQ-TOJO, a novel self-supervised Vision Transformer model designed for general toxicologic histopathology assessment. VQ-TOJO was trained on 15 million image patches extracted from 46,734 digitized tissue sections spanning 157 preclinical studies in rats. We demonstrate VQ-TOJO's versatility in tackling various diagnostic tasks across multiple scales, including in limited labeled data scenarios. VQ-TOJO achieves state-of-the-art performance on tasks such as weakly supervised slide classification, patch-level lesion classification and automatic dose-time response assessment.

关键词： visual language models Pathology Computer Vision Toxicologic Assessment

来源：评论

学校读者我要写书评

暂无评论

Video Fire Recognition Using Zero-shot Vision-language models Guided by a Task-aware Object Detector

引用

ACM Transactions on Multimedia Computing, Communications, and Applications 1000年

作者： Diego Gragnaniello Antonio Greco Carlo Sansone Bruno Vento Department of Information and Electrical Engineering and Applied Mathematics (DIEM) University of Salerno Italy Department of Electrical Engineering and Information Technology (DIETI) University of Napoli Federico II Italy

Fire detection from images or videos has gained a growing interest in recent years due to the criticality of the application. Both reliable real-time detectors and efficient retrieval techniques, able to process large databases acquired by sensor networks, are needed. Even if the reliability of artificial vision methods improved in the last years, some issues are still open problems. In particular, literature methods often reveal a low generalization capability when employed in scenarios different from the training ones in terms of framing distance, surrounding environment, or weather conditions. This can be addressed by considering contextual information and, more specifically, using vision-language models capable of interpreting and describing the framed scene. In this work, we propose FIRE-TASTIC: FIre REcognition with Task-Aware Spatio-Temporal Image Captioning, a novel framework to use object detectors in conjunction with vision-language models for fire detection and information retrieval. The localization capability of the former makes it able to detect even tiny fire traces but expose the system to false alarms. These are strongly reduced by the impressive zero-shot generalization capability of the latter, which can recognize and describe fire-like objects without prior fine-tuning. We also present a variant of the FIRE-TASTIC framework based on visual Question Answering instead of Image Captioning, which allows one to customize the retrieved information with personalized questions. To integrate the high-level information provided by both neural networks, we propose a novel method to query the vision-language models using the temporal and spatial localization information provided by the object detector. The proposal can improve the retrieval performance, as evidenced by the experiments conducted on two recent fire detection datasets, showing the effectiveness and the generalization capabilities of FIRE-TASTIC, which surpasses the state of the art. Moreover, t

关键词： fire detection object detector visual question answering image captioning visual language models

来源：评论

学校读者我要写书评

暂无评论

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

时间限定

文献类型

馆藏选择

核心期刊

语言

文献类型

帮助

文字说明：

检索规则说明：

检索范例：

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案：

请选择收藏分类：

通借通还

建议与咨询 留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

时间限定

文献类型

馆藏选择

核心期刊

语言

文献类型

帮助

文字说明：

检索规则说明：

检索范例：

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案： 新增检索档案 确定 取消

请选择收藏分类： 新增自定义分类 确定 取消

通借通还

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

请选择保存的检索档案：

请选择收藏分类：