The convolutional neural network (CNN) is a machine learning methodology that was successfully implemented in many domains, including electromagnetics and wireless communications. This paper investigates the use of CN...
详细信息
With the widespread use of language models (LMs) in NLP tasks, researchers have discovered the potential of Chain-of-thought (CoT) to assist LMs in accomplishing complex reasoning tasks by generating intermediate step...
详细信息
Recently, advancements in artificial intelligence technology have greatly influenced the field of education, particularly in the area of intelligent homework assistance. However, current approaches are primarily desig...
Recently, advancements in artificial intelligence technology have greatly influenced the field of education, particularly in the area of intelligent homework assistance. However, current approaches are primarily designed for procedural and logical tasks and often lack comprehension abilities. This limitation is particularly evident when it comes to multi-hop and continuous tasks. To address this challenge, the integration of Large Language Model (LLM) has significantly enhanced the capability of AI systems to handle multi-hop and highly interconnected inputs. In this study, we focus on the learning needs of students in Acting Department, specifically their study of movies and the significance of classic movie videos in their learning process. However, assessing deep comprehension of classic movies poses its own challenges. To overcome these challenges, we develop a quiz system utilizing Knowledge Graphs (KG) and LLM to facilitate a deeper understanding of classic films. The generation of video quiz pairs is achieved through the use of Automatic Speech Recognition (ASR) technology, which leverages movie subtitles for question generation. For answering these questions, we employ techniques KG and LLM to process questions and retrieve corresponding answers. The proposed method achieves good performance in Deep Video Understanding (DVU) task of NIST TRECVID, demonstrating its effectiveness.
Imitation learning has emerged as a promising approach for addressing sequential decision-making problems, with the assumption that expert demonstrations are optimal. However, in real-world scenarios, most demonstrati...
详细信息
Imitation learning has emerged as a promising approach for addressing sequential decision-making problems, with the assumption that expert demonstrations are optimal. However, in real-world scenarios, most demonstrations are often imperfect, leading to challenges in the effectiveness of imitation learning. While existing research has focused on optimizing with imperfect demonstrations, the training typically requires a certain proportion of optimal demonstrations to guarantee performance. To tackle these problems, we propose to purify the potential noises in imperfect demonstrations first, and subsequently conduct imitation learning from these purified demonstrations. Motivated by the success of diffusion model, we introduce a two-step purification via diffusion process. In the first step, we apply a forward diffusion process to smooth potential noises in imperfect demonstrations by introducing additional noise. Subsequently, a reverse generative process is utilized to recover the optimal demonstration from the diffused ones. We provide theoretical evidence supporting our approach, demonstrating that the distance between the purified and optimal demonstration can be bounded. Empirical results on MuJoCo and RoboSuite demonstrate the effectiveness of our method from different aspects. Copyright 2024 by the author(s)
Recently, more and more college ranking systems are receiving attention due to the demand and necessity of higher education and college ranking system is a key topic in the field of social choice. However, these ranki...
Recently, more and more college ranking systems are receiving attention due to the demand and necessity of higher education and college ranking system is a key topic in the field of social choice. However, these rankings have different evaluation criteria that lead to confusion for decision-makers. To address this issue, a simple and practical approach is to aggregate these ranking systems from different sources. In this paper, we conduct an experimental study on aggregation of world university ranking. Specifically, we first classify unsupervised RA methods. Then, we compare the aggregation effects of 28 unsupervised RA methods on five public university rankings.
Multimodal Federated learning (FL) is a collaborative and privacy preserving machine learning paradigm for multimodal data. With the impressive performance of large-scale pre-trained models, an increasing number of th...
详细信息
ISBN:
(纸本)9798400718779
Multimodal Federated learning (FL) is a collaborative and privacy preserving machine learning paradigm for multimodal data. With the impressive performance of large-scale pre-trained models, an increasing number of these models are being applied to FL. However, multimodal data in the real world is usually incomplete in modalities. Additionally, directly applying these large-scale pre-trained models in the federated learning framework will lead to the problem of high computational and communication costs. To address these problems, we propose a novel Personalized Multimodal Federated Learning method via Collaborative Prompting with Missing Modalities (MFLCP) . Specifically, we propose an efficient large-scale pre-trained personalized multimodal federated learning framework. To address the issue of incomplete modalities in multimodal data, we propose a modal projection-aware collaborative prompting strategy for incomplete multimodal federated learning. Different categories of prompts are designed for the missing categories, and the modality mapping part and leverage complementary semantic information from different modalities are designed to guide prompt learning, promoting better interaction between modalities. In addition, we propose a communication optimization method for efficient multimodal federated learning, which reduces the parameters of multimodal pre-trained models in the process of federated communication transmission, enhances the speed of local training, and significantly improves convergence speed by integrating large-scale pre-trained models in a lightweight manner. Meanwhile, we establish a personalized adaptive update mechanism for the federated local model, which can adaptively update the local model according to the characteristics of local data, effectively reduces the impact of data heterogeneity. Extensive experimental results on several benchmark datasets demonstrate that the proposed method outperforms the state-of-the-art baselines.
Person re-identification (ReID) aims to retrieve a target person across non-overlapping cameras. Due to the uncontrollable environment and the privacy concerns, the diversity and scale of real-world training data are ...
详细信息
This paper studies Who-What-Where (3W) composite-semantic video instance search (INS) problem, which aims to find a specific person doing a queried action in a particular place. Mainstream approaches adopt a complete ...
This paper studies Who-What-Where (3W) composite-semantic video instance search (INS) problem, which aims to find a specific person doing a queried action in a particular place. Mainstream approaches adopt a complete decomposition strategy, which divides a composite-semantic query into multiple single-semantic queries. However, due to the lack of necessary correlation analysis among constituent semantics, these methods cannot always generate identity-matching and semantics-consistent 3W INS results. To address the above challenges, we propose a partial decomposition scheme with action as the link. Specifically, we selectively split the 3W INS as person-action INS and action-location INS. The former ensures the retrieved person and action share the same identity by modeling their relative spatial positions at the frame level, while the latter improves the semantic consistency between action and location with a cross-semantic attention mechanism at the shot level. Particularly, we build a large-scale 3W INS dataset, containing over 470k video shots, on basis of NIST TRECVID 2016-2021 INS tasks and verify the effectiveness of the proposed method with both quantitative and qualitative experiments.
We present NNVISR - an open-source filter plugin for the VapourSynth1 video processing framework, which facilitates the application of neural networks for various kinds of video enhancing tasks, including denoising, s...
详细信息
Weakly supervised object localization (WSOL) aims at pre-dicting the location of objects with image-level labels. Fine-grained WSOL task has its characteristic challenge compared with generic object localization. The ...
详细信息
暂无评论