Random feature (RF) has been widely used for node consistency in decentralized kernel ridge regression (KRR). Currently, the consistency is guaranteed by imposing constraints on coefficients of features, necessitating...
详细信息
Learning comprehensive spatiotemporal features is crucial for human action recognition. Existing methods tend to model the spatiotemporal feature blocks in an integrate-separate-integrate form, such as appearance-and-...
详细信息
Learning comprehensive spatiotemporal features is crucial for human action recognition. Existing methods tend to model the spatiotemporal feature blocks in an integrate-separate-integrate form, such as appearance-and-relation network(ARTNet) and spatiotemporal and motion network(STM). However, with blocks stacking up, the rear part of the network has poor interpretability. To avoid this problem, we propose a novel architecture called spatial temporal relation network(STRNet), which can learn explicit information of appearance, motion and especially the temporal relation information. Specifically, our STRNet is constructed by three branches,which separates the features into 1) appearance pathway, to obtain spatial semantics, 2) motion pathway, to reinforce the spatiotemporal feature representation, and 3) relation pathway, to focus on capturing temporal relation details of successive frames and to explore long-term representation dependency. In addition, our STRNet does not just simply merge the multi-branch information, but we apply a flexible and effective strategy to fuse the complementary information from multiple pathways. We evaluate our network on four major action recognition benchmarks: Kinetics-400, UCF-101, HMDB-51, and Something-Something v1, demonstrating that the performance of our STRNet achieves the state-of-the-art result on the UCF-101 and HMDB-51 datasets, as well as a comparable accuracy with the state-of-the-art method on Something-Something v1 and Kinetics-400.
Ridgeless regression has garnered attention among researchers, particularly in light of the "Benign Overfitting" phenomenon, where models interpolating noisy samples demonstrate robust generalization. Howeve...
详细信息
The generalization of Euclidean network paradigm to the Riemannian manifolds has attracted much attention for offering useful geometric representations in processing manifold-valued data in recent years. However, the ...
详细信息
ISBN:
(数字)9798350359312
ISBN:
(纸本)9798350359329
The generalization of Euclidean network paradigm to the Riemannian manifolds has attracted much attention for offering useful geometric representations in processing manifold-valued data in recent years. However, the information degradation during data compression mapping hinders Riemannian networks from going deeper, and there are very few solutions specifically designed for this problem. Given the remarkable success of deep Residual learning in Euclidean networks, a novel Riemannian residual learning mechanism (RRLM) is proposed in the context of Symmetric Positive Definite (SPD) manifolds, enabling the characterization of deep spatiotemporal features while preserving the manifold properties. Based on RRLM, a stack of SPD manifold-constrained residual-like blocks is designed on the tail of the original SPDNet(backbone) for the sake of conducting deep Riemannian residual learning. For simplicity, we refer to the network architecture introduced above as Riemannian residual SPD network (ResSPDNet). The experimental results achieved on three types of visual classification tasks, i.e., facial emotion recognition, drone recognition, and action recognition, demonstrate that our method can achieve improved accuracy with a deepened network structure.
In traditional audio captioning methods, a model is usually trained in a fully supervised manner using a human-annotated dataset containing audio-text pairs and then evaluated on the test set from the same dataset. Su...
详细信息
In traditional audio captioning methods, a model is usually trained in a fully supervised manner using a human-annotated dataset containing audio-text pairs and then evaluated on the test set from the same dataset. Such methods have two limitations. First, these methods are often data-hungry and require time-consuming and expensive human annotations to obtain audio-text pairs. Second, these models often suffer from performance degradation in cross-domain scenarios, i.e., when the input audio comes from a different domain than the training set, and this issue has received little attention. To address these issues, we propose a new zero-shot method for audio captioning. Our method is built on the contrastive language-audio pre-training (CLAP) model. During training, the model reconstructs the ground-truth caption using the CLAP text encoder. In the inference stage, the model generates text descriptions from the CLAP audio embeddings of given audio inputs. To enhance the ability of the model in transitioning from text-to-text generation to audio-to-text generation, we propose to use the mixed-augmentations-based soft prompt to learn more robust latent representations, leveraging instance replacement and embedding augmentation. Additionally, we introduce the retrieval-based acoustic-aware hard prompt to improve the cross-domain performance of the model by employing the domain-agnostic label information of sound events. Extensive experiments on AudioCaps and Clotho benchmarks show the effectiveness of our proposed method, which outperforms other zero-shot audio captioning approaches for in-domain scenarios and outperforms the compared methods for cross-domain scenarios, underscoring the generalization ability of our method.
The lack of sufficient flexibility is the key bottleneck of kernel-based learning that relies on manually designed, pre-given, and non-trainable kernels. To enhance kernel flexibility, this paper introduces the concep...
详细信息
Symmetric positive definite (SPD) matrix has been demonstrated to be an effective feature descriptor in many scientific areas, as it can encode spatiotemporal statistics of the data adequately on a curved Riemannian m...
详细信息
In traditional audio captioning methods, a model is usually trained in a fully supervised manner using a human-annotated dataset containing audio-text pairs and then evaluated on the test sets from the same dataset. S...
详细信息
We address the problem of multi-modal object tracking in video and explore various options of fusing the complementary information conveyed by the visible (RGB) and thermal infrared (TIR) modalities including pixel-le...
详细信息
Modeling photovoltaic (PV) modules is fundamental for analyzing their efficiency and performance under different operating conditions. Generally, photovoltaic module modeling is based on a suitable equivalent circuit ...
详细信息
Modeling photovoltaic (PV) modules is fundamental for analyzing their efficiency and performance under different operating conditions. Generally, photovoltaic module modeling is based on a suitable equivalent circuit and a set of parameters representing the PV module's properties. Defining these parameters is not a trivial task, as they are usually not included in the technical documentation of the module. Therefore, the need to find suitable values for the model parameters becomes imperative. In this paper, we will test and compare two methods of extracting single-diode model (SDM) parameters: analytical and numerical, under standard test conditions (STC conditions). We examined the performance of six techniques: three analytical and three numerical. Except for Saleem technique, which tends to underestimate the I-V and P-V curves in the MPP zone, the results demonstrated that both approaches (analytical and iterative/numerical) are valid and accurate for predicting the I-V and P-V curves under STC conditions.
暂无评论