With the rapid development of multimedia technology, audio-visual learning has emerged as a promising research topic within the field of multimodal analysis. In this paper, we explore parameter-efficient transfer lear...
ISBN:
(纸本)9798331314385
With the rapid development of multimedia technology, audio-visual learning has emerged as a promising research topic within the field of multimodal analysis. In this paper, we explore parameter-efficient transfer learning for audio-visual learning and propose the Audio-visual Mixture of Experts (AVMoE) to inject adapters into pre-trained models flexibly. Specifically, we introduce unimodal and cross-modal adapters as multiple experts to specialize in intra-modal and intermodal information, respectively, and employ a lightweight router to dynamically allocate the weights of each expert according to the specific demands of each task. Extensive experiments demonstrate that our proposed approach AVMoE achieves superior performance across multiple audio-visual tasks, including AVE, AVVP, AVS, and AVQA. Furthermore, visual-only experimental results also indicate that our approach can tackle challenging scenes where modality information is missing. The source code is available at https://***/yingchengy/AVMOE.
Fast adapting to unknown peers (partners or opponents) with different strategies is a key challenge in multi-agent games. To do so, it is crucial for the agent to probe and identify the peer's strategy efficiently...
详细信息
Fast adapting to unknown peers (partners or opponents) with different strategies is a key challenge in multi-agent games. To do so, it is crucial for the agent to probe and identify the peer's strategy efficiently, as this is the prerequisite for carrying out the best response in adaptation. However, exploring the strategies of unknown peers is difficult, especially when the games are partially observable and have a long horizon. In this paper, we propose a peer identification reward, which rewards the learning agent based on how well it can identify the behavior pattern of the peer over the historical context, such as the observation over multiple episodes. This reward motivates the agent to learn a context-aware policy for effective exploration and fast adaptation, i.e., to actively seek and collect informative feedback from peers when uncertain about their policies and to exploit the context to perform the best response when confident. We evaluate our method on diverse testbeds that involve competitive (Kuhn Poker), cooperative (PO-Overcooked), or mixed (Predator-Prey-W) games with peer agents. We demonstrate that our method induces more active exploration behavior, achieving faster adaptation and better outcomes than existing methods1 Copyright 2024 by the author(s)
Large Language Models (LLMs) have demonstrated considerable cross-lingual alignment and generalization ability. Current research primarily focuses on improving LLMs' crosslingual generalization capabilities. Howev...
详细信息
Tool learning is widely acknowledged as a foundational approach for deploying large language models (LLMs) in real-world scenarios. While current research primarily emphasizes leveraging tools to augment LLMs, it freq...
详细信息
Multilingual text recognition (MLTR) systems typically focus on a fixed set of languages, which makes it difficult to handle newly added languages or adapt to ever-changing data distribution. In this paper, we propose...
详细信息
Multilingual text recognition (MLTR) systems typically focus on a fixed set of languages, which makes it difficult to handle newly added languages or adapt to ever-changing data distribution. In this paper, we propose...
Multilingual text recognition (MLTR) systems typically focus on a fixed set of languages, which makes it difficult to handle newly added languages or adapt to ever-changing data distribution. In this paper, we propose the Incremental MLTR (IMLTR) task in the context of incremental learning (IL), where different languages are introduced in batches. IMLTR is particularly challenging due to rehearsal-imbalance, which refers to the uneven distribution of sample characters in the rehearsal set, used to retain a small amount of old data as past memories. To address this issue, we propose a Multiplexed Routing Network (MRN). MRN trains a recognizer for each language that is currently seen. Subsequently, a language domain predictor is learned based on the rehearsal set to weigh the recognizers. Since the recognizers are derived from the original data, MRN effectively reduces the reliance on older data and better fights against catastrophic forgetting, the core issue in IL. We extensively evaluate MRN on MLT17 and MLT19 datasets. It outperforms existing general-purpose IL methods by large margins, with average accuracy improvements ranging from 10.3% to 35.8% under different settings. Code is available at https://***/simplify23/MRN.
Multimodal Question Answering (MMQA) is crucial as it enables comprehensive understanding and accurate responses by integrating insights from diverse data representations such as tables, charts, and text. Most existin...
详细信息
Text-to-image models encounter safety issues, including concerns related to copyright and Not-Safe-For-Work (NSFW) content. Despite several methods have been proposed for erasing inappropriate concepts from diffusion ...
详细信息
A fault in large online service systems often triggers numerous alerts due to the complex business and component dependencies among services, which is known as "alert storm". In a short time, an online servi...
详细信息
ISBN:
(纸本)9798350329964
A fault in large online service systems often triggers numerous alerts due to the complex business and component dependencies among services, which is known as "alert storm". In a short time, an online service system may generate a huge amount of alert data. This poses a challenge for on-call engineers to identify alerts that are associated with a system failure for root cause analysis. In this paper, we propose DyAlert, a dynamic graph neural networks-based approach for linking alerts that might be triggered by a same fault to reduce the burden of on-call engineers in the fault analysis. Our insight is that alerts are often triggered by alert propagation when a system failure occurs, e.g., alert a would lead to the occurrence of alert b. Whether two alerts should be linked depends on if one alert is triggered by the propagation of the other. Leveraging this insight, we design a dynamic graph (namely Alert-Metric Dynamic Graph) that describes the propagation process of alerts. Based on the dynamic graph, we train a neural networks-based model to predict alert links. We evaluate DyAlert with real-world data collected from an online service system running 85 business units and about 30,000 different services in a large enterprise. The results show that DyAlert is effective in predicting alert links and it outperforms the state-of-the-art approaches with an average increase of 0.259 in F1-score.
Vision-language pre-trained models (e.g., CLIP) trained on large-scale datasets via self-supervised learning, are drawing increasing research attention since they can achieve superior performances on multi-modal downs...
Vision-language pre-trained models (e.g., CLIP) trained on large-scale datasets via self-supervised learning, are drawing increasing research attention since they can achieve superior performances on multi-modal downstream tasks. Nevertheless, we find that the adversarial perturbations crafted on vision-language pre-trained models can be used to attack different corresponding downstream task models. Specifically, to investigate such adversarial transferability, we introduce a task-agnostic method named Global and Local Augmentation (GLA) attack to generate highly transferable adversarial examples on CLIP, to attack black-box downstream task models. GLA adopts random crop and resize at both global and local patch levels, to create more diversity and make adversarial noises robust. Then GLA generates the adversarial perturbations by minimizing the cosine similarity between intermediate features from augmented adversarial and benign examples. Extensive experiments on three CLIP image encoders with different backbones and three different downstream tasks demonstrate the superiority of our method compared with other strong baselines. The code is available at https://***/yqlvcoding/GLAattack.
暂无评论