CG (Computer Graphics) is a popular field of CS (Computer Science), but many students find this topic difficult due to it requiring a large number of skills, such as mathematics, programming, geometric reasoning, and ...
详细信息
ISBN:
(纸本)9798400711367
CG (Computer Graphics) is a popular field of CS (Computer Science), but many students find this topic difficult due to it requiring a large number of skills, such as mathematics, programming, geometric reasoning, and creativity. Over the past few years, researchers have investigated ways to harness the power of GenAI (Generative Artificial Intelligence) to improve teaching. In CS, much of the research has focused on introductory computing. A recent study evaluating the performance of an LLM (Large language Model), GPT-4 (text-only), on CG questions, indicated poor performance and reliance on detailed descriptions of image content, which often required considerable insight from the user to return reasonable results. So far, no studies have investigated the abilities of LMMs (Large Multimodal models), or multimodal LLMs, to solve CG questions and how these abilities can be used to improve teaching. In this study, we construct two datasets of CG questions requiring varying degrees of visual perception skills and geometric reasoning skills, and evaluate the current state-of-the-art LMM, GPT-4o, on the two datasets. We find that although GPT-4o exhibits great potential in solving questions with visual information independently, major limitations still exist to the accuracy and quality of the generated results. We propose several novel approaches for CG educators to incorporate GenAI into CG teaching despite these limitations. We hope that our guidelines further encourage learning and engagement in CG classrooms.
The emergence of Foundation Vision-languagemodels (VLMs) has ignited a surge of research in the computer vision field due to their robust baseline performance. Inspired by this, we propose the Anchoring Vision-Langua...
详细信息
ISBN:
(纸本)9798331529543;9798331529550
The emergence of Foundation Vision-languagemodels (VLMs) has ignited a surge of research in the computer vision field due to their robust baseline performance. Inspired by this, we propose the Anchoring Vision-language Network (AnViL-Net), which integrates a vision language model for the challenging task of Weakly-Supervised Group Activity Recognition (WSGAR). Our network effectively incorporates VLMs into WSGAR, addressing the challenges posed by dynamic actor motions and domain-specific activity classes. AnViL-Net leverages highly generalized VLM vision features as anchors for extracting visual features. Additionally, semantically meaningful VLM language features serve as anchors for inferring the semantic relationships between actors and their activities. We demonstrate the effectiveness of AnViL-Net on multiple group activity datasets, achieving competitive state-of-the-art results.
Zero-shot video recognition (ZSVR) is a task that aims to recognize video categories that have not been seen during the model training process. Recently, vision-languagemodels (VLMs) pre-trained on large-scale image-...
详细信息
ISBN:
(纸本)9798400701085
Zero-shot video recognition (ZSVR) is a task that aims to recognize video categories that have not been seen during the model training process. Recently, vision-languagemodels (VLMs) pre-trained on large-scale image-text pairs have demonstrated impressive transferability for ZSVR. To make VLMs applicable to the video domain, existing methods often use an additional temporal learning module after the image-level encoder to learn the temporal relationships among video frames. Unfortunately, for video from unseen categories, we observe an abnormal phenomenon where the model that uses spatial-temporal feature performs much worse than the model that removes temporal learning module and uses only spatial feature. We conjecture that improper temporal modeling on video disrupts the spatial feature of the video. To verify our hypothesis, we propose Feature Factorization to retain the orthogonal temporal feature of the video and use interpolation to construct refined spatial-temporal feature. The model using appropriately refined spatial-temporal feature performs better than the one using only spatial feature, which verifies the effectiveness of the orthogonal temporal feature for the ZSVR task. Therefore, an Orthogonal Temporal Interpolation module is designed to learn a better refined spatial-temporal video feature during training. Additionally, a Matching Loss is introduced to improve the quality of the orthogonal temporal feature. We propose a model called OTI for ZSVR by employing orthogonal temporal interpolation and the matching loss based on VLMs. The ZSVR accuracies on popular video datasets (i.e., Kinetics-600, UCF101 and HMDB51) show that OTI outperforms the previous state-of-the-art method by a clear margin. Our codes are publicly available at https://***/yanzhu/mm2023_oti.
Preclinical drug safety assessment is a critical step in drug development that relies on time-consuming manual histopathological examination, which is prone to high inter-observer variability. Artificial intelligence ...
详细信息
ISBN:
(纸本)9798400712074
Preclinical drug safety assessment is a critical step in drug development that relies on time-consuming manual histopathological examination, which is prone to high inter-observer variability. Artificial intelligence methods hold promise to accelerate this assessment and enhance reproducibility. Here, we introduce VQ-TOJO, a novel self-supervised Vision Transformer model designed for general toxicologic histopathology assessment. VQ-TOJO was trained on 15 million image patches extracted from 46,734 digitized tissue sections spanning 157 preclinical studies in rats. We demonstrate VQ-TOJO's versatility in tackling various diagnostic tasks across multiple scales, including in limited labeled data scenarios. VQ-TOJO achieves state-of-the-art performance on tasks such as weakly supervised slide classification, patch-level lesion classification and automatic dose-time response assessment.
Fire detection from images or videos has gained a growing interest in recent years due to the criticality of the application. Both reliable real-time detectors and efficient retrieval techniques, able to process large...
详细信息
Fire detection from images or videos has gained a growing interest in recent years due to the criticality of the application. Both reliable real-time detectors and efficient retrieval techniques, able to process large databases acquired by sensor networks, are needed. Even if the reliability of artificial vision methods improved in the last years, some issues are still open problems. In particular, literature methods often reveal a low generalization capability when employed in scenarios different from the training ones in terms of framing distance, surrounding environment, or weather conditions. This can be addressed by considering contextual information and, more specifically, using vision-languagemodels capable of interpreting and describing the framed scene. In this work, we propose FIRE-TASTIC: FIre REcognition with Task-Aware Spatio-Temporal Image Captioning, a novel framework to use object detectors in conjunction with vision-languagemodels for fire detection and information retrieval. The localization capability of the former makes it able to detect even tiny fire traces but expose the system to false alarms. These are strongly reduced by the impressive zero-shot generalization capability of the latter, which can recognize and describe fire-like objects without prior fine-tuning. We also present a variant of the FIRE-TASTIC framework based on visual Question Answering instead of Image Captioning, which allows one to customize the retrieved information with personalized questions. To integrate the high-level information provided by both neural networks, we propose a novel method to query the vision-languagemodels using the temporal and spatial localization information provided by the object detector. The proposal can improve the retrieval performance, as evidenced by the experiments conducted on two recent fire detection datasets, showing the effectiveness and the generalization capabilities of FIRE-TASTIC, which surpasses the state of the art. Moreover, t
暂无评论