检索结果-内蒙古大学图书馆

STRNet:Triple-stream Spatiotemporal Relation Network for Action recognition

学校读者我要写书评

暂无评论

International Journal of Automation and computing 2021年第5期18卷 718-730页

作者： Zhi-Wei Xu Xiao-Jun Wu Josef Kittler School of Artificial Intelligence and Computer Science Jiangnan UniversityWuxi 214122China Jiangsu Provincial Engineering Laboratory of Pattern Recognition and Computational Intelligence Wuxi 214122China Centre for Vision Speech and Signal ProcessingUniversity of SurreyGuildfordGU27XHUK

Learning comprehensive spatiotemporal features is crucial for human action recognition. Existing methods tend to model the spatiotemporal feature blocks in an integrate-separate-integrate form, such as appearance-and-relation network(ARTNet) and spatiotemporal and motion network(STM). However, with blocks stacking up, the rear part of the network has poor interpretability. To avoid this problem, we propose a novel architecture called spatial temporal relation network(STRNet), which can learn explicit information of appearance, motion and especially the temporal relation information. Specifically, our STRNet is constructed by three branches,which separates the features into 1) appearance pathway, to obtain spatial semantics, 2) motion pathway, to reinforce the spatiotemporal feature representation, and 3) relation pathway, to focus on capturing temporal relation details of successive frames and to explore long-term representation dependency. In addition, our STRNet does not just simply merge the multi-branch information, but we apply a flexible and effective strategy to fuse the complementary information from multiple pathways. We evaluate our network on four major action recognition benchmarks: Kinetics-400, UCF-101, HMDB-51, and Something-Something v1, demonstrating that the performance of our STRNet achieves the state-of-the-art result on the UCF-101 and HMDB-51 datasets, as well as a comparable accuracy with the state-of-the-art method on Something-Something v1 and Kinetics-400.

关键词： Action recognition spatiotemporal relation multi-branch fusion long-term representation video classification

Learning Analysis of Kernel Ridgeless Regression with Asymmetric Kernel Learning

学校读者我要写书评

暂无评论

arXiv 2024年

作者： He, Fan He, Mingzhen Shi, Lei Huang, Xiaolin Suykens, Johan A.K. STADIUS Center for Dynamical Systems Signal Processing and Data Analytics KU Leuven Leuven Belgium MOE Key Laboratory of System Control and Information Processing Institute of Image Processing and Pattern Recognition Shanghai Jiao Tong University Shanghai China Shanghai Key Laboratory for Contemporary Applied Mathematics School of Mathematical Sciences Fudan University Shanghai200433 China Shanghai Artificial Intelligence Laboratory Shanghai200232 China MOE Key Laboratory of System Control and Information Processing Institute of Image Processing and Pattern Recognition Institute of Medical Robotics Shanghai Jiao Tong University Shanghai200240 China

Ridgeless regression has garnered attention among researchers, particularly in light of the "Benign Overfitting" phenomenon, where models interpolating noisy samples demonstrate robust generalization. However, kernel ridgeless regression does not always perform well due to the lack of flexibility. This paper enhances kernel ridgeless regression with Locally-Adaptive-Bandwidths (LAB) RBF kernels, incorporating kernel learning techniques to improve performance in both experiments and theory. For the first time, we demonstrate that functions learned from LAB RBF kernels belong to an integral space of Reproducible Kernel Hilbert Spaces (RKHSs). Despite the absence of explicit regularization in the proposed model, its optimization is equivalent to solving an 0-regularized problem in the integral space of RKHSs, elucidating the origin of its generalization ability. Taking an approximation analysis viewpoint, we introduce an lq-norm analysis technique (with 0 © 2024, CC BY.

关键词： Vector spaces

A Riemannian Residual Learning Mechanism for SPD Network

学校读者我要写书评

暂无评论

A Riemannian Residual Learning Mechanism for SPD Network

International Joint Conference on Neural Networks (IJCNN)

作者： Zhenyu Cai Rui Wang Tianyang Xu Xiaojun Wu Josef Kittler School of Artificial Intelligence and Computer Science Jiangnan University Wuxi China Jiangsu Provincial Engineering Laboratory of Pattern Recognition and Computational Wuxi China Centre for Vision Speech and Signal Processing University of Surrey Guildford U.K.

ISBN: (数字)9798350359312

ISBN: (纸本)9798350359329

The generalization of Euclidean network paradigm to the Riemannian manifolds has attracted much attention for offering useful geometric representations in processing manifold-valued data in recent years. However, the information degradation during data compression mapping hinders Riemannian networks from going deeper, and there are very few solutions specifically designed for this problem. Given the remarkable success of deep Residual learning in Euclidean networks, a novel Riemannian residual learning mechanism (RRLM) is proposed in the context of Symmetric Positive Definite (SPD) manifolds, enabling the characterization of deep spatiotemporal features while preserving the manifold properties. Based on RRLM, a stack of SPD manifold-constrained residual-like blocks is designed on the tail of the original SPDNet(backbone) for the sake of conducting deep Riemannian residual learning. For simplicity, we refer to the network architecture introduced above as Riemannian residual SPD network (ResSPDNet). The experimental results achieved on three types of visual classification tasks, i.e., facial emotion recognition, drone recognition, and action recognition, demonstrate that our method can achieve improved accuracy with a deepened network structure.

关键词： Manifolds Learning systems Degradation Emotion recognition Visualization Accuracy Face recognition

Zero-Shot Audio Captioning Using Soft and Hard Prompts

学校读者我要写书评

暂无评论

IEEE Transactions on Audio, Speech and Language Processing

IEEE Transactions on Audio, Speech and Language processing 2025年 33卷 2045-2058页

作者： Yiming Zhang Xuenan Xu Ruoyi Du Haohe Liu Yuan Dong Zheng-Hua Tan Wenwu Wang Zhanyu Ma Pattern Recognition and Intelligent System Laboratory School of Artificial Intelligence Beijing University of Posts and Telecommunications Beijing China Department of Computer Science and Engineering Shanghai Jiao Tong University Shanghai China Centre for Vision Speech and Signal Processing University of Surrey Guildford U.K. Department of Electronic Systems Aalborg University Aalborg Denmark

In traditional audio captioning methods, a model is usually trained in a fully supervised manner using a human-annotated dataset containing audio-text pairs and then evaluated on the test set from the same dataset. Such methods have two limitations. First, these methods are often data-hungry and require time-consuming and expensive human annotations to obtain audio-text pairs. Second, these models often suffer from performance degradation in cross-domain scenarios, i.e., when the input audio comes from a different domain than the training set, and this issue has received little attention. To address these issues, we propose a new zero-shot method for audio captioning. Our method is built on the contrastive language-audio pre-training (CLAP) model. During training, the model reconstructs the ground-truth caption using the CLAP text encoder. In the inference stage, the model generates text descriptions from the CLAP audio embeddings of given audio inputs. To enhance the ability of the model in transitioning from text-to-text generation to audio-to-text generation, we propose to use the mixed-augmentations-based soft prompt to learn more robust latent representations, leveraging instance replacement and embedding augmentation. Additionally, we introduce the retrieval-based acoustic-aware hard prompt to improve the cross-domain performance of the model by employing the domain-agnostic label information of sound events. Extensive experiments on AudioCaps and Clotho benchmarks show the effectiveness of our proposed method, which outperforms other zero-shot audio captioning approaches for in-domain scenarios and outperforms the compared methods for cross-domain scenarios, underscoring the generalization ability of our method.

关键词： Training Decoding Semantics Data models Acoustics Electronic mail Benchmark testing Transformers Robustness Perturbation methods

ENHANCING KERNEL FLEXIBILITY VIA LEARNING ASYMMETRIC LOCALLY-ADAPTIVE KERNELS

学校读者我要写书评

暂无评论

arXiv 2023年

作者： He, Fan He, Mingzhen Shi, Lei Huang, Xiaolin Suykens, Johan A.K. STADIUS Center for Dynamical Systems Signal Processing and Data Analytics KU Leuven Belgium Institute of Image Processing and Pattern Recognition Shanghai Jiao Tong University China Shanghai Key Laboratory for Contemporary Applied Mathematics School of Mathematical Sciences Fudan University China

The lack of sufficient flexibility is the key bottleneck of kernel-based learning that relies on manually designed, pre-given, and non-trainable kernels. To enhance kernel flexibility, this paper introduces the concept of Locally-Adaptive-Bandwidths (LAB) as trainable parameters to enhance the Radial Basis Function (RBF) kernel, giving rise to the LAB RBF kernel. The parameters in LAB RBF kernels are data-dependent, and its number can increase with the dataset, allowing for better adaptation to diverse data patterns and enhancing the flexibility of the learned function. This newfound flexibility also brings challenges, particularly with regards to asymmetry and the need for an efficient learning algorithm. To address these challenges, this paper for the first time establishes an asymmetric kernel ridge regression framework and introduces an iterative kernel learning algorithm. This novel approach not only reduces the demand for extensive support data but also significantly improves generalization by training bandwidths on the available training data. Experimental results on real datasets underscore the remarkable performance of the proposed algorithm, showcasing its superior capability in handling large-scale datasets compared to Nyström approximation-based algorithms. Moreover, it demonstrates a significant improvement in regression accuracy over existing kernel-based learning methods and even surpasses residual neural networks. Copyright © 2023, The Authors. All rights reserved.

关键词： Bandwidth

Riemannian Self-Attention Mechanism for SPD Networks

学校读者我要写书评

暂无评论

arXiv 2023年

作者： Wang, Rui Wu, Xiao-Jun Li, Hui Kittler, Josef School of Artificial Intelligence and Computer Science Jiangnan University Wuxi214122 China Jiangsu Provincial Engineering Laboratory of Pattern Recognition and Computational Intelligence Jiangnan University China Centre for Vision Speech and Signal Processing University of Surrey GuildfordGU2 7XH United Kingdom

Symmetric positive definite (SPD) matrix has been demonstrated to be an effective feature descriptor in many scientific areas, as it can encode spatiotemporal statistics of the data adequately on a curved Riemannian manifold, i.e., SPD manifold. Although there are many different ways to design network architectures for SPD matrix nonlinear learning, very few solutions explicitly mine the geometrical dependencies of features at different layers. Motivated by the great success of self-attention mechanism in capturing long-range relationships, an SPD manifold self-attention mechanism (SMSA) is proposed in this paper using some manifold-valued geometric operations, mainly the Riemannian metric, Riemannian mean, and Riemannian optimization. Then, an SMSA-based geometric learning module (SMSA-GLM) is designed for the sake of improving the discrimination of the generated deep structured representations. Extensive experimental results achieved on three benchmarking datasets show that our modification against the baseline network further alleviates the information degradation problem and leads to improved accuracy. Copyright © 2023, The Authors. All rights reserved.

关键词： Network architecture

Zero-Shot Audio Captioning Using Soft and Hard Prompts

学校读者我要写书评

暂无评论

arXiv 2024年

作者： Zhang, Yiming Xu, Xuenan Du, Ruoyi Liu, Haohe Dong, Yuan Tan, Zheng-Hua Wang, Wenwu Ma, Zhanyu The Pattern Recognition and Intelligent System Laboratory School of Artificial Intelligence Beijing University of Posts and Telecommunications Beijing100876 China The Department of Computer Science and Engineering Shanghai Jiao Tong University Shanghai200240 China The Department of Electronic Systems Aalborg University Aalborg9220 Denmark The Centre for Vision Speech and Signal Processing University of Surrey GuildfordGU2 7XH United Kingdom

In traditional audio captioning methods, a model is usually trained in a fully supervised manner using a human-annotated dataset containing audio-text pairs and then evaluated on the test sets from the same dataset. Such methods have two limitations. First, these methods are often data-hungry and require time-consuming and expensive human annotations to obtain audio-text pairs. Second, these models often suffer from performance degradation in cross-domain scenarios, i.e., when the input audio comes from a different domain than the training set, which, however, has received little attention. We propose an effective audio captioning method based on the contrastive language-audio pre-training (CLAP) model to address these issues. Our proposed method requires only textual data for training, enabling the model to generate text from the textual feature in the cross-modal semantic space. In the inference stage, the model generates the descriptive text for the given audio from the audio feature by leveraging the audio-text alignment from CLAP. We devise two strategies to mitigate the discrepancy between text and audio embeddings: a mixed-augmentation-based soft prompt and a retrieval-based acoustic-aware hard prompt. These approaches are designed to enhance the generalization performance of our proposed model, facilitating the model to generate captions more robustly and accurately. Extensive experiments on AudioCaps and Clotho benchmarks show the effectiveness of our proposed method, which outperforms other zero-shot audio captioning approaches for in-domain scenarios and outperforms the compared methods for cross-domain scenarios, underscoring the generalization ability of our method. © 2024, CC BY.

关键词： Semantics

Exploring Fusion Strategies for Accurate RGBT Visual Object Tracking

学校读者我要写书评

暂无评论

arXiv 2022年

作者： Tang, Zhangyong Xu, Tianyang Li, Hui Wu, Xiao-Jun Zhu, XueFeng Kittler, Josef Jiangsu Provincial Engineering Laboratory of Pattern Recognition and Computational Intelligence School of Artificial Intelligence and Computer Science Jiangnan University Wuxi214122 China The Center for Vision Speech and Signal Processing University of Surrey GuildfordGU2 7XH United Kingdom

We address the problem of multi-modal object tracking in video and explore various options of fusing the complementary information conveyed by the visible (RGB) and thermal infrared (TIR) modalities including pixel-level, feature-level and decision-level fusion. Specifically, different from the existing methods, paradigm of image fusion task is heeded for fusion at pixel level. Feature-level fusion is fulfilled by attention mechanism with channels excited optionally. Besides, at decision level, a novel fusion strategy is put forward since an effortless averaging configuration has shown the superiority. The effectiveness of the proposed decision-level fusion strategy owes to a number of innovative contributions, including a dynamic weighting of the RGB and TIR contributions and a linear template update operation. A variant of which produced the winning tracker at the Visual Object Tracking Challenge 2020 (VOT-RGBT2020). The concurrent exploration of innovative pixel- and feature-level fusion strategies highlights the advantages of the proposed decision-level fusion method. Extensive experimental results on three challenging datasets, i.e., GTOT, VOT-RGBT2019, and VOT-RGBT2020, demonstrate the effectiveness and robustness of the proposed method, compared to the state-of-the-art approaches. Code will be shared at https://***/Zhangyong-Tang/DFAT. © 2022, CC BY.

关键词： Pixels

Parameter estimation of photovoltaic modules using analytical and numerical/iterative approaches: A comparative study

学校读者我要写书评

暂无评论

Materials Today: Proceedings 2022年 52卷 1-6页

作者： Souad Lidaighbi Mustapha Elyaqouti Khalid Assalaou Dris Ben Hmamou Driss Saadaoui Jihad H'roura Materials and Renewable Energy Laboratory Agadir Faculty of Sciences Ibn Zohr University BP 8106 80000 Agadir Morocco Laboratory of Electronics Signal Processing and Physical Modeling Faculty of Sciences of Agadir Ibn Zohr University BP 8106 80000 Agadir Morocco Laboratory of Images and Pattern Recognition - Intelligent and Communicating Systems Faculty of Sciences of Agadir Ibn Zohr University BP 8106 80000 Agadir Morocco

Modeling photovoltaic (PV) modules is fundamental for analyzing their efficiency and performance under different operating conditions. Generally, photovoltaic module modeling is based on a suitable equivalent circuit and a set of parameters representing the PV module's properties. Defining these parameters is not a trivial task, as they are usually not included in the technical documentation of the module. Therefore, the need to find suitable values for the model parameters becomes imperative. In this paper, we will test and compare two methods of extracting single-diode model (SDM) parameters: analytical and numerical, under standard test conditions (STC conditions). We examined the performance of six techniques: three analytical and three numerical. Except for Saleem technique, which tends to underestimate the I-V and P-V curves in the MPP zone, the results demonstrated that both approaches (analytical and iterative/numerical) are valid and accurate for predicting the I-V and P-V curves under STC conditions.

关键词： Photovoltaic module Single-diode model Parameters extraction Analytical methods Numerical methods