检索结果-内蒙古大学图书馆

ScanDTM: A Novel Dual-Temporal Modulation Scanpath Prediction Model for Omnidirectional Images

IEEE Transactions on Circuits and Systems for Video Technology 2025年

作者： Zhu, Dandan Zhang, Kaiwei Min, Xiongkuo Zhai, Guangtao Yang, Xiaokang East China Normal University School of Computer Science and Technology Shanghai200333 China Shanghai Jiao Tong University Institute of Image Communication and Network Engineering Shanghai200240 China Shanghai Jiao Tong University MoE Key Lab of Artificial Intelligence AI Institute Shanghai200240 China

Scanpath prediction for omnidirectional images aims to effectively simulate the human visual perception mechanism to generate dynamic realistic fixation trajectories. However, the majority of scanpath prediction methods for omnidirectional images are still in their infancy as they fail to accurately capture the time-dep.ndency of viewing behavior and suffer from sub-optimal performance along with limited generalization capability. A desirable solution should achieve a better trade-off between prediction performance and generalization ability. To this end, we propose a novel dual-temporal modulation scanpath prediction (ScanDTM) model for omnidirectional images. Such a model is designed to effectively capture long-range time-dep.ndencies between various fixation regions across both internal and external time dimensions, thereby generating more realistic scanpaths. In particular, we design a Dual Graph Convolutional Network (Dual-GCN) module comprising a semantic-level GCN and an image-level GCN. This module servers as a robust visual encoder that captures spatial relationships among various object regions within an image and fully utilizes similar images as complementary information to capture similarity relations across relevant images. Notably, the proposed Dual-GCN focuses on modeling temporal correlations from both local and global perspectives within the internal time dimension. Furthermore, drawing inspiration from the promising generalization capabilities of diffusion models across various generative tasks, we introduce a novel diffusion-guided saliency module. This module formulates the prediction issue as a conditional generative process for the saliency map, utilizing extracted semantic-level and image-level visual features as conditions. With the well-designed diffusion-guided saliency module, our proposed ScanDTM model acting as an external temporal modulator, we can progressively refine the generated scanpath from the noisy map. We conduct extensive expe

关键词： Prediction models

来源：评论

学校读者我要写书评

暂无评论

Beyond the Status Quo: A Contemporary Survey of Advances and Challenges in Audio Captioning

arXiv

引用

arXiv 2022年

作者： Xu, Xuenan Xie, Zeyu Wu, Mengyue Yu, Kai X-LANCE Lab Department of Computer Science and Engineering MoE Key Lab of Artificial Intelligence AI Institute Shanghai Jiao Tong University China

Automated audio captioning (AAC), a task that mimics human perception as well as innovatively links audio processing and natural language processing, has overseen much progress over the last few years. AAC requires recognizing contents such as the environment, sound events and the temporal relationships between sound events and describing these elements with a fluent sentence. Currently, an encoder-decoder-based deep learning framework is the standard approach to tackle this problem. Plenty of works have proposed novel network architectures and training schemes, including extra guidance, reinforcement learning, audio-text self-supervised learning and diverse or controllable captioning. Effective data augmentation techniques, especially based on large language models are explored. Benchmark datasets and AAC-oriented evaluation metrics also accelerate the improvement of this field. This paper situates itself as a comprehensive survey covering the comparison between AAC and its related tasks, the existing deep learning techniques, datasets, and the evaluation metrics in AAC, with insights provided to guide potential future research directions. © 2022, CC BY.

关键词： Network architecture

来源：评论

学校读者我要写书评

暂无评论

A Cognitive Evaluation Benchmark of Image Reasoning and Description for Large Vision-Language Models

arXiv

引用

arXiv 2024年

作者： Song, Xiujie Wu, Mengyue Zhu, Kenny Q. Zhang, Chunhao Chen, Yanyi X-LANCE Lab Department of Computer Science and Engineering MoE Key Lab of Artificial Intelligence AI Institute Shanghai Jiao Tong University Shanghai China University of Texas at Arlington ArlingtonTX United States University of Chicago ChicagoIL United States

Large Vision-Language Models (LVLMs), despite their recent success, are hardly comprehensively tested for their cognitive abilities. Inspired by the prevalent use of the Cookie Theft task in human cognitive tests, we propose a novel evaluation benchmark to evaluate high-level cognitive abilities of LVLMs using images with rich semantics. The benchmark consists of 251 images along with comprehensive annotations. It defines eight reasoning capabilities and comprises an image description task and a visual question answering task. Our evaluation of well-known LVLMs shows that there is still a significant gap in cognitive abilities between LVLMs and humans1 Copyright © 2024, The Authors. All rights reserved.

关键词： Visual languages

来源：评论

学校读者我要写书评

暂无评论

Robust Cross-Domain Speaker Verification with Multi-Level Domain Adapters

Robust Cross-Domain Speaker Verification with Multi-Level Do...

引用

International Conference on Acoustics, Speech, and Signal Processing (ICASSP)

作者： Wen Huang Bing Han Shuai Wang Zhengyang Chen Yanmin Qian AI Institute Department of Computer Science and Engineering Auditory Cognition and Computational Acoustics Lab MoE Key Lab of Artificial Intelligence Shanghai Jiao Tong University Shanghai China Shenzhen Research Institute of Big Data The Chinese University of Hong Kong Shenzhen China

Speaker verification encounters significant challenges when confronted with diverse domain data, often resulting in performance degradation due to domain mismatch. To enhance performance in cross-domain scenarios, we introduce the Domain Adapter, an adaptable module designed for specific domains. This module learns and integrates domain-specific information with speaker-related data, mitigating domain-related variations and promoting convergence of utterance embeddings from the same speaker across diverse domains. It offers configurability across multiple levels and is adaptable to various backbone architectures. Our proposed module substantially enhances cross-domain performance with minimal parameter increments while effectively generalizing to previously unseen domains. In our experiments, we present results on the 3D-Speaker dataset, which provides acoustically-relevant attributes crucial for domain categorization and the subsequent learning of domain information. The top-performing system integrated with domain adapters achieved 10.8%, 14.8%, and 21.1% EER improvements over the baseline across three 3D-Speaker dataset trials.

关键词：

来源：评论

学校读者我要写书评

暂无评论

ShapeBoost: Boosting Human Shape Estimation with Part-Based Parameterization and Clothing-Preserving Augmentation

arXiv

引用

arXiv 2024年

作者： Bian, Siyuan Li, Jiefeng Tang, Jiasheng Lu, Cewu Department of Computer Science and Engineering Shanghai Jiao Tong University Shanghai China MoE Key Lab of Artificial Intelligence AI Institute Shanghai Jiao Tong University Shanghai China DAMO Academy Alibaba Group Hangzhou China Hupan Lab Hangzhou China

Accurate human shape recovery from a monocular RGB image is a challenging task because humans come in different shapes and sizes and wear different clothes. In this paper, we propose ShapeBoost, a new human shape recovery framework that achieves pixel-level alignment even for rare body shapes and high accuracy for people wearing different types of clothes. Unlike previous approaches that rely on the use of PCA-based shape coefficients, we adopt a new human shape parameterization that decomposes the human shape into bone lengths and the mean width of each part slice. This part-based parameterization technique achieves a balance between flexibility and validity using a semi-analytical shape reconstruction algorithm. Based on this new parameterization, a clothing-preserving data augmentation module is proposed to generate realistic images with diverse body shapes and accurate annotations. Experimental results show that our method outperforms other state-of-the-art methods in diverse body shape situations as well as in varied clothing situations. Copyright © 2024, The Authors. All rights reserved.

关键词： Parameterization

来源：评论

学校读者我要写书评

暂无评论

Prototype and Instance Contrastive Learning for Unsupervised Domain Adaptation in Speaker Verification

arXiv

引用

arXiv 2024年

作者： Huang, Wen Han, Bing Chen, Zhengyang Wang, Shuai Qian, Yanmin Auditory Cognition and Computational Acoustics Lab MoE Key Lab of Artificial Intelligence AI Institute Department of Computer Science and Engineering Shanghai Jiao Tong University Shanghai China Shenzhen Research Institute of Big Data The Chinese University of Hong Kong Shenzhen China

Speaker verification system trained on one domain usually suffers performance degradation when applied to another domain. To address this challenge, researchers commonly use feature distribution matching-based methods in unsupervised domain adaptation scenarios where some unlabeled target domain data is available. However, these methods often have limited performance improvement and lack generalization in various mismatch situations. In this paper, we propose Prototype and Instance Contrastive Learning (PICL), a novel method for unsupervised domain adaptation in speaker verification through dual-level contrastive learning. For prototype contrastive learning, we generate pseudo labels via clustering to create dynamically updated prototype representations, aligning instances with their corresponding class or cluster prototypes. For instance contrastive learning, we minimize the distance between different views or augmentations of the same instance, ensuring robust and invariant representations resilient to variations like noise. This dual-level approach provides both high-level and low-level supervision, leading to improved generalization and robustness of the speaker verification model. Unlike previous studies that only evaluated mismatches in one situation, we have conducted relevant explorations on various datasets and achieved state-of-the-art performance currently, which also proves the generalization of our method. Copyright © 2024, The Authors. All rights reserved.

关键词： Contrastive Learning

来源：评论

学校读者我要写书评

暂无评论

Is Your Image a Good Storyteller? 39

Is Your Image a Good Storyteller?

引用

39th Annual AAai Conference on Artificial Intelligence, AAai 2025

作者： Song, Xiujie Pang, Xiaoyi Tang, Haifeng Wu, Mengyue Zhu, Kenny Q. X-LANCE Lab Department of Computer Science and Engineering MoE Key Lab of Artificial Intelligence AI Institute Shanghai Jiao Tong University Shanghai China China Merchants Bank Credit Card Center Shanghai China University of Texas at Arlington Arlington TX United States

ISBN: (纸本)157735897X

Quantifying image complexity at the entity level is straightforward, but the assessment of semantic complexity has been largely overlooked. In fact, there are differences in semantic complexity across images. Images with richer semantics can tell vivid and engaging stories and offer a wide range of application scenarios. For example, the Cookie Theft picture is such a kind of image and is widely used to assess human language and cognitive abilities due to its higher semantic complexity. Additionally, semantically rich images can benefit the development of vision models, as images with limited semantics are becoming less challenging for them. However, such images are scarce, highlighting the need for a greater number of them. For instance, there is a need for more images like Cookie Theft to cater to people from different cultural backgrounds and eras. Assessing semantic complexity requires human experts and empirical evidence. Automatic evaluation of how semantically rich an image will be the first step of mining or generating more images with rich semantics, and benefit human cognitive assessment, Artificial Intelligence, and various other applications. In response, we propose the Image Semantic Assessment (ISA) task to address this problem. We introduce the first ISA dataset and a novel method that leverages language to solve this vision problem. Experiments on our dataset demonstrate the effectiveness of our approach. © 2025, Association for the Advancement of Artificial Intelligence (***). All rights reserved.

关键词：

来源：评论

学校读者我要写书评

暂无评论

Exploring Effective Distillation of Self-Supervised Speech Models for Automatic Speech Recognition

Exploring Effective Distillation of Self-Supervised Speech M...

引用

IEEE Workshop on Automatic Speech Recognition and Understanding

作者： Yujin Wang Changli Tang Ziyang Ma Zhisheng Zheng Xie Chen Wei-Qiang Zhang Department of Electronic Engineering Tsinghua University Department of Computer Science and Engineering MoE Key Lab of Artificial Intelligence AI Institute X-LANCE Lab Shanghai Jiao Tong University Shanghai China Peng Cheng Laboratory Shenzhen China

Self-supervised learning (SSL) has achieved great success in speech processing, but always with a large model size to increase the modeling capacity. This may limit its potential applications due to the expensive computation and memory costs introduced by the oversize model. Compression for SSL models has become an important research direction of practical value. To this end, we explore the effective distillation of HuBERT-based SSL models for automatic speech recognition. First, a comprehensive study of different student model structures is conducted. On top of this, as a supplement to the regression loss widely adopted in previous works, a discriminative loss is introduced for HuBERT to enhance the distillation performance, especially in low-resource scenarios. In addition, we design a simple and effective algorithm to distill the front-end input from waveform to Fbank feature, resulting in 17% parameter reduction and doubling inference speed, at marginal performance degradation.

关键词：

来源：评论

学校读者我要写书评

暂无评论

EndoGSLAM: Real-Time Dense Reconstruction and Tracking in Endoscopic Surgeries using Gaussian Splatting

arXiv

引用

arXiv 2024年

作者： Wang, Kailing Yang, Chen Wang, Yuehao Li, Sikuang Wang, Yan Dou, Qi Yang, Xiaokang Shen, Wei MoE Key Lab of Artificial Intelligence AI Institute Shanghai Jiao Tong University China Dept. of Computer Science and Engineering The Chinese University of Hong Kong Hong Kong Shanghai Key Laboratory of Multidimensional Information Processing East China Normal University China

Precise camera tracking, high-fidelity 3D tissue reconstruction, and real-time online visualization are critical for intrabody medical imaging devices such as endoscopes and capsule robots. However, existing SLAM (Simultaneous Localization and Mapping) methods often struggle to achieve both complete high-quality surgical field reconstruction and efficient computation, restricting their intraoperative applications among endoscopic surgeries. In this paper, we introduce EndoGSLAM, an efficient SLAM approach for endoscopic surgeries, which integrates streamlined Gaussian representation and differentiable rasterization to facilitate over 100 fps rendering speed during online camera tracking and tissue reconstructing. Extensive experiments show that EndoGSLAM achieves a better trade-off between intraoperative availability and reconstruction quality than traditional or neural SLAM approaches, showing tremendous potential for endoscopic surgeries. The project page is at https://*** Copyright © 2024, The Authors. All rights reserved.

关键词： Medical imaging

来源：评论

学校读者我要写书评

暂无评论

Relation-Aware Multi-hop Reasoning forVisual Dialog 10th

Relation-Aware Multi-hop Reasoning forVisual Dialog

引用

10th CCF Conference on Natural Language Processing and Chinese Computing, NLPCC 2021

作者： Zhao, Yao Chen, Lu Yu, Kai X-LANCE Lab Department of Computer Science and Engineering MoE Key Lab of Artificial Intelligence AI Institute Shanghai Jiao Tong University Shanghai China State Key Lab of Media Convergence Production Technology and Systems Beijing China

ISBN: (纸本)9783030884796

Visual dialog is a multi-modal task that requires a dialog agent to answer a series of progressive questions grounded in an image. In this paper, we propose Relation-aware Multi-hop Reasoning Network (i.e. R2N for short) for visual dialog tasks, which can perform multi-hop reasoning during visual co-reference resolution process in a recurrent way. At each hop, in order to fully understand the visual scene in the image, a Relation-aware Graph Attention Network is used, which encodes each image into graphs with multi-type inter-object relations via a graph attention mechanism. Moreover, we find that the auxiliary clustering mechanism on answer candidates is conducive to model’s performance. We evaluate R2N on VisDial v1.0 dataset. Experimental results on the VisDial v1.0 dataset demonstrate that the proposed model is effective and outperforms compared models. © 2021, Springer Nature Switzerland AG.

关键词： computers

来源：评论

学校读者我要写书评

暂无评论

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

时间限定

文献类型

馆藏选择

核心期刊

语言

文献类型

帮助

文字说明：

检索规则说明：

检索范例：

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案：

请选择收藏分类：

通借通还

建议与咨询 留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

时间限定

文献类型

馆藏选择

核心期刊

语言

文献类型

帮助

文字说明：

检索规则说明：

检索范例：

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案： 新增检索档案 确定 取消

请选择收藏分类： 新增自定义分类 确定 取消

通借通还

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

请选择保存的检索档案：

请选择收藏分类：