检索结果-内蒙古大学图书馆

17th Asian Conference on Computer Vision, ACCV 2024

作者： Song, Jiayan Pan, Renjie Zhou, Jun Yang, Hua Institute of Image Communication and Network Engineering Shanghai Jiao Tong University Shanghai200240 China Shanghai Key Lab of Digital Media Processing and Transmission Shanghai200240 China

ISBN: (纸本)9789819609079

Current encoder-decoder methods for image captioning mai-nly consist of an object detection module (two-stage), or rely on big models with large-scale datasets to improve the effectiveness, which leads to increasing computation costs and cannot introduce new external knowledge. In this paper, we propose a novel end-to-end method Multi-grained Retrieval Augmentation Transformer (M-RAT) that innovatively fuses retrieved text derived from a changeable datastore with input visual feature through a Multi-modal Aligned Encoder, and introduce a specialized attention mechanism, Multi-MSA, to exploit both local and global interactions for delicate fine-grained details. Additionally, we enhance the decoder generation ability by employing low-level and high-level fused embeddings. Experiments demonstrate that M-RAT achieves comparable performance to state-of-the-art baselines with remarkable accuracy and details, as well as showing excellent domain adaptability for novel objects. © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2025.

关键词： Decoding

来源：评论

学校读者我要写书评

暂无评论

Learning group interaction for sports video understanding from a perspective of athlete

引用

Frontiers of Computer Science 2024年第4期18卷 175-188页

作者： Rui HE Zehua FU Qingjie LIU Yunhong WANG Xunxun CHEN Intelligent Recognition and Image Processing(IRIP)Lab School of Computer Science and EngineeringBeihang UniversityBeijing 100191China Hangzhou Innovation Institute Behang UniversityHangzhou 310051China National Computer Network Emergency Response Technical Team/Coordination Center of China(CNCERT or CNCERT/CC) Beijing 100029China

Learning activities interactions between small groups is a key step in understanding team sports *** research focusing on team sports videos can be strictly regarded from the perspective of the audience rather than the *** team sports videos such as volleyball and basketball videos,there are plenty of intra-team and inter-team *** this paper,a new task named Group Scene Graph Generation is introduced to better understand intra-team relations and inter-team relations in sports *** tackle this problem,a novel Hierarchical Relation network is *** all players in a video are finely divided into two teams,the feature of the two teams’activities and interactions will be enhanced by Graph Convolutional networks,which are finally recognized to generate Group Scene *** evaluation,built on Volleyball dataset with additional 9660 team activity labels,a Volleyball+dataset is proposed.A baseline is set for better comparison and our experimental results demonstrate the effectiveness of our ***,the idea of our method can be directly utilized in another video-based task,Group Activity *** show the priority of our method and display the link between the two ***,from the athlete’s view,we elaborately present an interpretation that shows how to utilize Group Scene Graph to analyze teams’activities and provide professional gaming suggestions.

关键词： group scene graph group activity recognition scene graph generation graph convolutional network sports video understanding

来源：评论

学校读者我要写书评

暂无评论

Hydrodynamics-Informed Neural network for Simulating Dense Crowd Motion Patterns 24

Hydrodynamics-Informed Neural Network for Simulating Dense C...

引用

32nd ACM International Conference on Multimedia, MM 2024

作者： Zhou, Yanshan Lai, Pingrui Yu, Jiaqi Xiong, Yingjie Yang, Hua Institute of Image Communication and Network Engineering Shanghai Jiao Tong University Shanghai China Shanghai Key Lab of Digital Media Processing and Transmission Shanghai Jiao Tong University Shanghai China

ISBN: (纸本)9798400706868

With global occurrences of crowd crushes and stampedes, dense crowd simulation has been drawing great attention. In this research, our goal is to simulate dense crowd motions under six classic motion patterns, more specifically, to generate subsequent motions of dense crowds from the given initial states. Since dense crowds share similarities with fluids, such as continuity and fluidity, one common approach for dense crowd simulation is to construct hydrodynamics-based models, which consider dense crowds as fluids, guide crowd motions with Navier-Stokes equations, and conduct dense crowd simulation by solving governing equations. Despite the proposal of these models, dense crowd simulation faces multiple challenges, including the difficulty of directly solving Navier-Stokes equations due to their nonlinear nature, the ignorance of distinctive crowd characteristics which fluids lack, and the gaps in the evaluation and validation of crowd simulation models. To address the above challenges, we build a hydrodynamic model, which captures the crowd physical properties (continuity, fluidity, etc.) with Navier-Stokes equations and reflects the crowd social properties (sociality, personality, etc.) with operators that describe crowd interactions and crowd-environment interactions. To tackle the computational problem, we propose to solve the governing equation based on Navier-Stokes equations using neural networks, and introduce the Hydrodynamics-Informed Neural network (HINN) which preserves the structure of the governing equation in its network architecture. To facilitate the evaluation, we construct a new dense crowd motion video dataset called Dense Crowd Flow Dataset (DCFD), containing six classic motion patterns (line, curve, circle, cross, cluster and scatter) and 457 video clips, which can serve as the groundtruths for various objective metrics. Numerous experiments are conducted using HINN to simulate dense crowd motions under six motion patterns with video clips fro

关键词： Navier Stokes equations

来源：评论

学校读者我要写书评

暂无评论

L2RT-FIQA: Face image Quality Assessment via Learning-to-Rank Transformer 9th

L2RT-FIQA: Face Image Quality Assessment via Learning-to-Ra...

引用

9th International Forum on Digital Multimedia Communication, IFTC 2022

作者： Chen, Zehao Yang, Hua Institute of Image Communication and Network Engineering Shanghai Jiao Tong University Shanghai China Shanghai Key Lab of Digital Media Processing and Transmission Shanghai China

ISBN: (纸本)9789819908554

Face recognition (FR) systems are easily constrained by complex environmental situations in the wild. To ensure the accuracy of FR systems, face image quality assessment (FIQA) is applied to reject low-quality face image unsuitable for recognition. Face quality can be defined as the accuracy or confidence of face images being correctly recognized by FR systems, which is desired to be consistent with recognition results. However, current FIQA methods show more or less inconsistency with face recognition due to the following four biases, including implicit constraint, quality labels, regression models, and backbone networks. In order to reduce such biases and enhance the consistency between FR and FIQA, this paper proposes a FIAQ method based on Learning to rank (L2R) algorithm and vision Transformer named L2RT-FIQA. L2RT-FIQA consists of three parts: relative quality labels, L2R framework, and vision Transformer backbone. Specifically, we utilize normalized intra-class and inter-class angular distance to generate relative quality labels;we employ L2R model to focus more on the quality order rather than the absolute quality value;we apply unpretrained vision transformer as our backbone to improve generalization and global information learning. Experimental results show our L2RT-FIQA effectively reduces the aforementioned four kinds of biases and outperforms other state-of-the-art FIQA methods on several challenging benchmarks. © 2023, The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.

关键词： Face recognition

来源：评论

学校读者我要写书评

暂无评论

FC-GNN: Recovering Reliable and Accurate Correspondences from Interferences

FC-GNN: Recovering Reliable and Accurate Correspondences fro...

引用

Conference on Computer Vision and Pattern Recognition (CVPR)

作者： Haobo Xu Jun Zhou Hua Yang Renjie Pan Cunyan Li Institute of Image Communication and Network Engineering Shanghai Jiao Tong University Shanghai Key Lab of Digital Media Processing and Transmission

ISBN: (数字)9798350353006

ISBN: (纸本)9798350353013

Finding correspondences between images is essential for many computer vision tasks and sparse matching pipelines have been popular for decades. However, matching noise within and between images, along with inconsistent key-point detection, frequently degrades the matching performance. We review these problems and thus propose: 1) a novel and unified Filtering and Calibrating (FC) approach that jointly rejects outliers and optimizes inliers, and 2) leveraging both the matching context and the underlying image texture to remove matching uncertainties. Under the guidance of the above innovations, we construct Filtering and Calibrating Graph Neural network (FC-GNN), which follows the FC approach to recover reliable and accurate correspondences from various interferences. FC-GNN conducts an effectively combined inference of contextual and local information through careful embedding and multiple information aggregations, predicting confidence scores and calibration offsets for the input correspondences to jointly filter out outliers and improve pixel-level matching accuracy. Moreover, we exploit the local coherence of matches to perform inference on local graphs, thereby reducing computational complexity. Overall, FC-GNN operates at lightning speed and can greatly boost the performance of diverse matching pipelines across various tasks, showcasing the immense potential of such approaches to become standard and pivotal components of image matching. Code is avaiable at https://***/xuy123456/fcgnn.

关键词： Matched filters Computer vision Technological innovation Accuracy Uncertainty Computer network reliability Pipelines

来源：评论

学校读者我要写书评

暂无评论

Adaptive and Collaborative Multi-scale Alignment for Text-Based Person Search

Adaptive and Collaborative Multi-scale Alignment for Text-Ba...

引用

2023 IEEE International Conference on Visual Communications and image processing, VCIP 2023

作者： Yang, Xinxin Pan, Renjie Yang, Hua Institute of Image Communication and Network Engineering Shanghai Jiao Tong University Shanghai Key Lab of Digital Media Processing and Transmission Shanghai China Shanghai Jiao Tong University China MoE Key Lab of Artificial Intelligence AI Institute China

ISBN: (纸本)9798350359855

Text-To-image person search is challenging due to the cross-scale correspondences and information inequality between modalities. Specifically, images and text are complexly linked at different scales and images are usually more informative and complete than text. It is crucial to establish semantic correlations between modalities and focus on task-relevant information in images. In this paper, we propose a novel Adaptive and Collaborative Multi-scale Alignment network (ACMA) for text-based person search that learns semantically consistent and information-Aligned multi-modal representations. Firstly, we introduce a novel joint embedding module that adaptively integrates features of different pixels and words, thereby extracting semantically consistent multi-modal features at different scales. Second, we design a cross-modal fusion feature-based auxiliary visual branch to guide the extraction of key visual features that are beneficial for cross-modal matching. Extensive experiments validate that ACMA outperforms the state-of-The-Art method. © 2023 IEEE.

关键词： Embeddings

来源：评论

学校读者我要写书评

暂无评论

AGAV-Rater: Adapting Large Multimodal Model for AI-Generated Audio-Visual Quality Assessment

arXiv

引用

arXiv 2025年

作者： Cao, Yuqin Min, Xiongkuo Gao, Yixuan Sun, Wei Zhai, Guangtao Institute of Image Communication and Network Engineering Shanghai Key Laboratory of Digital Media Processing and Transmissions Shanghai Jiao Tong University Shanghai China

Many video-to-audio (VTA) methods have been proposed for dubbing silent AI-generated videos. An efficient quality assessment method for AI-generated audio-visual content (AGAV) is crucial for ensuring audio-visual quality. Existing audio-visual quality assessment methods struggle with unique distortions in AGAVs, such as unrealistic and inconsistent elements. To address this, we introduce AGAVQA, the first large-scale AGAV quality assessment dataset, comprising 3, 382 AGAVs from 16 VTA methods. AGAVQA includes two subsets: AGAVQA-MOS, which provides multi-dimensional scores for audio quality, content consistency, and overall quality, and AGAVQA-Pair, designed for optimal AGAV pair selection. We further propose AGAV-Rater, a LMM-based model that can score AGAVs, as well as audio and music generated from text, across multiple dimensions, and selects the best AGAV generated by VTA methods to present to the user. AGAV-Rater achieves state-of-the-art performance on AGAVQA, Text-to-Audio, and Text-to-Music datasets. Subjective tests also confirm that AGAV-Rater enhances VTA performance and user experience. The project page is available at https://***. Copyright © 2025, The Authors. All rights reserved.

关键词： Subjective testing

来源：评论

学校读者我要写书评

暂无评论

Physics-Environment Interaction network for Dense Crowd Behavior Recognition

SSRN

引用

SSRN 2024年

作者： Yu, Jiaqi Zhou, Yanshan Pan, Renjie Lai, Pingrui Yang, Hua Institute of Image Communication and Network Engineering Shanghai Key Lab of Digital Media Processing and Transmission Shanghai Jiao Tong University Shanghai200240 China

The analysis of large-scale crowd behavior plays a crucial role in public safety. However, intelligent systems face three major challenges in analyzing dense crowd behavior: the severe occlusion between individuals, the variability in behavior patterns, and the complexity of behavioral evolution. To address these challenges, we propose the Physics-Environment Interaction network (PEIN), which directly models the motion characteristics of a group with its physics attributes. Specifically, our method consists of two streams. The first stream is the physics-informed crowd property stream, which leverages on the similarity between dense crowd motion and fluid dynamics, using the Navier-Stokes (N-S) equation from fluid mechanics as the modeling framework to describe crowd motion. Considering the inherent relationship between the terms in the N-S equation and various crowd properties (collectiveness, conflict, uniformity and stability), we model these terms with operators and neural networks guided by these crowd properties, enabling the modeling of crowd motion characteristic without relying on the extraction of individual motion information. The second stream is the environment perception stream. Considering that the physics-informed crowd property stream mainly focuses on instantaneous information and that scenes with dense crowd behavior have variability, we introduce a 3D network to enhance the model's robustness. This stream can extract global spatiotemporal information from input video frames, providing a comprehensive perception of the surrounding crowd environment. Since the two streams process different data sources, we design a dual cross-attention mechanism to to enable between features from different modalities, resulting in a joint learnable representation for the final crowd behavior recognition. By incorporating physical laws as constraints, we design a physics-informed loss function combined with a crowd behavior loss function to optimize the model. Consi

关键词： Navier Stokes equations

来源：评论

学校读者我要写书评

暂无评论

UNQA: Unified No-Reference Quality Assessment for Audio, image, Video, and Audio-Visual Content

arXiv

引用

arXiv 2024年

作者： Cao, Yuqin Min, Xiongkuo Gao, Yixuan Sun, Wei Lin, Weisi Zhai, Guangtao The Institute of Image Communication and Network Engineering Shanghai Key Laboratory of Digital Media Processing and Transmissions Shanghai Jiao Tong University Shanghai200240 China The School of Computer Science and Engineering Nanyang Technological University Singapore639798 Singapore

As multimedia data flourishes on the Internet, quality assessment (QA) of multimedia data becomes paramount for digital media applications. Since multimedia data includes multiple modalities including audio, image, video, and audiovisual (A/V) content, researchers have developed a range of QA methods to evaluate the quality of different modality data. While they exclusively focus on addressing the single modality QA issues, a unified QA model that can handle diverse media across multiple modalities is still missing, whereas the latter can better resemble human perception behaviour and also have a wider range of applications. In this paper, we propose the Unified No-reference Quality Assessment model (UNQA) for audio, image, video, and A/V content, which tries to train a single QA model across different media modalities. To tackle the issue of inconsistent quality scales among different QA databases, we develop a multi-modality strategy to jointly train UNQA on multiple QA databases. Based on the input modality, UNQA selectively extracts the spatial features, motion features, and audio features, and calculates a final quality score via the four corresponding modality regression modules. Compared with existing QA methods, UNQA has two advantages: 1) the multi-modality training strategy makes the QA model learn more general and robust quality-aware feature representation as evidenced by the superior performance of UNQA compared to state-of-the-art QA methods. 2) UNQA reduces the number of models required to assess multimedia data across different modalities. and is friendly to deploy to practical applications. Copyright © 2024, The Authors. All rights reserved.

关键词： Audiovisual

来源：评论

学校读者我要写书评

暂无评论

Spatial-Temporal Constrained Pseudo-labeling for Unsupervised Person Re-identification via GCN Inference 18th

Spatial-Temporal Constrained Pseudo-labeling for Unsupervis...

引用

18th International Forum of Digital Multimedia Communication, IFTC 2021

作者： Ling, Sen Yang, Hua Liu, Chuang Chen, Lin Zhao, Hongtian The Institute of Image Communication and Network Engineering Department of Electronic Engineering Shanghai Jiao Tong University Shanghai China Shanghai Key Laboratory of Digital Media Processing and Transmission Shanghai Jiao Tong University Shanghai China MoE Key Lab of Artificial Intelligence AI Institute Shanghai Jiao Tong University Shanghai China

ISBN: (纸本)9789811922657

Most existing unsupervised person re-identification (Re-ID) methods primarily depend on the cluster distance, and merely exploit the available source labeled data to assign pseudo labels for the unannotated data. Whereas, the cluster distance usually fails to adapt to different datasets due to the domain gap. Besides, learning exclusively from the source data can not generate accurate pseudo labels for the lack of the target data information. To address this problem, we propose to exploit the spatial-temporal constraints to facilitate the pseudo label generation process. Specifically, graphs for the labeled source data are constructed and the graph convolution network (GCN) is used to learn graph embeddings. Based on these graph embeddings, the likelihood of linkages between graph nodes is estimated and utilized to assign pseudo labels for the unlabeled data. Then, with the pseudo labels, a smoothed spatial-temporal probability distribution model is generated to amend the likelihood of linkages between graph nodes as well as correct the visual similarity scores for person Re-ID. Finally, we optimize the pseudo label assignment, feature extraction networks, and spatial-temporal model alternatively and iteratively to improve the person Re-ID performance. Comprehensive experiments demonstrate that the proposed method outperforms state-of-the-art methods. © 2022, Springer Nature Singapore Pte Ltd.

关键词： Graph theory

来源：评论

学校读者我要写书评

暂无评论

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

时间限定

文献类型

馆藏选择

核心期刊

语言

文献类型

帮助

文字说明：

检索规则说明：

检索范例：

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案：

请选择收藏分类：

通借通还

建议与咨询 留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

时间限定

文献类型

馆藏选择

核心期刊

语言

文献类型

帮助

文字说明：

检索规则说明：

检索范例：

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案： 新增检索档案 确定 取消

请选择收藏分类： 新增自定义分类 确定 取消

通借通还

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

请选择保存的检索档案：

请选择收藏分类：