检索结果-内蒙古大学图书馆

Long-Term Invariant Local Features via Implicit Cross-Domain Correspondences

学校读者我要写书评

暂无评论

arXiv 2023年

作者： Pataki, Zador Altillawi, Mohammad Kanakis, Menelaos Pautrat, Rémi Shen, Fengyi Liu, Ziyuan Van Gool, Luc Pollefeys, Marc The Computer Vision and Geometry Lab Department of Computer Science ETH Zurich Switzerland The Computer Vision Center CVC-Barcelona The Intelligent Robotics Cloud Technology lab of Huawei-Munich Germany The Computer Vision Lab Department Electrical Engineering ETH Zurich Switzerland The Intelligent Robotics Cloud Technology lab of Huawei-Munich Germany The Intelligent Robotics Cloud Technology lab of Huawei-Munich Germany The Center for Processing Speech and Images KU Leuven The Computer Vision Lab ETH Zurich Switzerland

Modern learning-based visual feature extraction networks perform well in intra-domain localization, however, their performance significantly declines when image pairs are captured across long-term visual domain variations, such as different seasonal and daytime variations. In this paper, our first contribution is a benchmark to investigate the performance impact of long-term variations on visual localization. We conduct a thorough analysis of the performance of current state-of-the-art feature extraction networks under various domain changes and find a significant performance gap between intra- and cross-domain localization. We investigate different methods to close this gap by improving the supervision of modern feature extractor networks. We propose a novel data-centric method, Implicit Cross-Domain Correspondences (iCDC). iCDC represents the same environment with multiple Neural Radiance Fields, each fitting the scene under individual visual domains. It utilizes the underlying 3D representations to generate accurate correspondences across different long-term visual conditions. Our proposed method enhances cross-domain localization performance, significantly reducing the performance gap. When evaluated on popular long-term localization benchmarks, our trained networks consistently outperform existing methods. This work serves as a substantial stride toward more robust visual localization pipelines for long-term deployments, and opens up research avenues in the development of long-term invariant descriptors. Copyright © 2023, The Authors. All rights reserved.

关键词： Feature extraction

An Asynchronous WFST-Based Decoder for Automatic speech Recognition

学校读者我要写书评

暂无评论

An Asynchronous WFST-Based Decoder for Automatic Speech Reco...

IEEE International Conference on Acoustics, speech and Signal processing

作者： Hang Lv Zhehuai Chen Hainan Xu Daniel Povey Lei Xie Sanjeev Khudanpur Audio Speech and Language Processing Lab (ASLP@NPU) School of Computer Science Northwestern Polytechnical University Xi’an China Center of Language and Speech Processing Johns Hopkins University Baltimore MD USA Shanghai Jiao Tong University Xiaomi Corporation Beijing China Human Language Technology Center of Excellence Johns Hopkins University Baltimore MD USA

We introduce asynchronous dynamic decoder, which adopts an efficient A~* algorithm to incorporate big language models in the one-pass decoding for large vocabulary continuous speech recognition. Unlike standard one-pass decoding with on-the-fly composition decoder which might induce a significant computation overhead, the asynchronous dynamic decoder has a novel design where it has two fronts, with one performing "exploration" and the other "backfill". The computation of the two fronts alternates in the decoding process, resulting in more effective pruning than the standard one-pass decoding with an on-the-fly composition decoder. Experiments show that the proposed decoder works notably faster than the standard one-pass decoding with on-the-fly composition decoder, while the acceleration will be more obvious with the increment of data complexity.

关键词： Vocabulary Heuristic algorithms Conferences Computational modeling Signal processing algorithms Signal processing Decoding

WINVC: One-Shot Voice Conversion with Weight Adaptive Instance Normalization 18th

学校读者我要写书评

暂无评论

WINVC: One-Shot Voice Conversion with Weight Adaptive Instan...

18th Pacific Rim International Conference on Artificial Intelligence, PRICAI 2021

作者： Huang, Shengjie Chen, Mingjie Xu, Yanyan Ke, Dengfeng Hain, Thomas School of Information Science and Technology Beijing Forestry University Beijing China Engineering Research Center for Forestry-Oriented Intelligent Information Processing of National Forestry and Grassland Administration Beijing China Computer Science Department University of Sheffield Sheffield United Kingdom School of Information Science Beijing Language and Culture University Beijing China

ISBN: (纸本)9783030893620

This paper proposes a one-shot voice conversion (VC) solution. In many one-shot voice conversion solutions (e.g., Auto-encoder-based VC methods), performances have dramatically been improved due to instance normalization and adaptive instance normalization. However, one-shot voice conversion fluency is still lacking, and the similarity is not good enough. This paper introduces the weight adaptive instance normalization strategy to improve the naturalness and similarity of one-shot voice conversion. Experimental results prove that under the VCTK data set, the MOS score of our proposed model, weight adaptive instance normalization voice conversion (WINVC), reaches 3.97 with five scales, and the SMOS reaches 3.31 with four scales. Besides, WINVC can achieve a MOS score of 3.44 and a SMOS score of 3.11 respectively for one-shot voice conversion under a small data set of 80 speakers with 5 pieces of utterances per person. © 2021, Springer Nature Switzerland AG.

关键词： Generative adversarial networks

THE DATABASE AND BENCHMARK FOR THE SOURCE SPEAKER TRACING CHALLENGE 2024

学校读者我要写书评

暂无评论

arXiv 2024年

作者： Li, Ze Lin, Yuke Yao, Tian Suo, Hongbin Zhang, Pengyuan Ren, Yanzhen Cai, Zexin Nishizaki, Hiromitsu Li, Ming School of Computer Science Wuhan University Wuhan China Suzhou Municipal Key Laboratory of Multimodal Intelligent Systems Duke Kunshan University Kunshan China AI Center OPPO Beijing China Key Laboratory of Speech Acoustics and Content Understanding Institute of Acoustics CAS China Key Laboratory of Aerospace Information Security and Trusted Computing Ministry of Education School of Cyber Science and Engineering Wuhan University China Center for Language and Speech Processing Johns Hopkins University United States Integrated Graduate School of Medicine Engineering and Agricultural Sciences University of Yamanashi 4-4-37 Takeda Yamanashi Kofu400-8510 Japan

关键词： Database systems

An asynchronous wfst-based decoder for automatic speech recognition

学校读者我要写书评

暂无评论

arXiv 2021年

作者： Lv, Hang Chen, Zhehuai Xu, Hainan Povey, Daniel Xie, Lei Khudanpur, Sanjeev School of Computer Science Northwestern Polytechnical University Xi'an China Center of Language and Speech Processing United States Human Language Technology Center of Excellence Johns Hopkins University BaltimoreMD United States Xiaomi Corporation Beijing China SpeechLab Department of Computer Science and Engineering Shanghai Jiao Tong University China

We introduce asynchronous dynamic decoder, which adopts an efficient A∗ algorithm to incorporate big language models in the onepass decoding for large vocabulary continuous speech recognition. Unlike standard one-pass decoding with on-the-fly composition decoder which might induce a significant computation overhead, the asynchronous dynamic decoder has a novel design where it has two fronts, with one performing "exploration" and the other "backfill". The computation of the two fronts alternates in the decoding process, resulting in more effective pruning than the standard one-pass decoding with an on-the-fly composition decoder. Experiments show that the proposed decoder works notably faster than the standard onepass decoding with on-the-fly composition decoder, while the acceleration will be more obvious with the increment of data complexity. Copyright © 2021, The Authors. All rights reserved.

关键词： Decoding

Trainable reference-based evaluation metric for identifying quality of English-Gujarati machine translation system

学校读者我要写书评

暂无评论

AIP Conference Proceedings 2025年第1期3253卷

作者： Nisheeth Joshi Pragya Katyayan Palak Arora Speech and Language Processing Lab Centre for Artificial Intelligence Banasthali Vidyapith Raj. Niwai India Department of Computer Science Banasthali Vidyapith Raj. Niwai India

Machine Translation (MT) Evaluation is an integral part of the MT development life cycle. Without analyzing the outputs of MT engines, it is impossible to performance of an MT system. Through experiments, it has been identified that what works for English and other European languages does not work well with Indian languages. Thus, In this paper, we have introduced a reference-based MT evaluation metric for Gujarati which is based on supervised learning. We have trained two versions of the metric which uses 25 features for training. Among the two models, one model is trained using 6 hidden layers with 500 epochs while the other model is trained using 10 hidden layers with 500 epochs. To test the performance of the metric, we collected 1000 MT outputs of seven MT systems. These MT engine outputs were compared with 1 human reference translation. While comparing the developed metrics with other available metrics, it was found that the metrics produced better human correlations.

关键词：

LMP-GAN: Out-Of-Distribution Detection For Non-Control Data Malware Attacks

学校读者我要写书评

暂无评论

IEEE Transactions on Pattern Analysis and Machine Intelligence 2025年第7期PP卷 PP页

作者： Wood, David Kapp, David Kebede, Temesgen Hirakawa, Keigo Wuhan University School of Computer Science China Wuhan University National Engineering Research Center for Multimedia Software Hubei Key Laboratory of Multimedia and Network Communication Engineering China Zhongguancun Academy China Wuhan University State Key Laboratory of Information Engineering in Surveying Mapping and Remote Sensing China Sun Yat-sen University School of Geography and Planning China Mohamed bin Zayed University of Artificial Intelligence United Arab Emirates Chongqing University College of Computer Science China The University of Tokyo Japan RIKEN Center for Advanced Intelligence Project Japan Intelligent Science & Technology Academy Limited CASIC China iFlytek Company Ltd. National Engineering Research Center of Speech and Language Information Processing China Nanyang Technological University College of Computing & Data Science Singapore Henan Academy of Sciences Aerospace Information Research Institute China

Anomaly detection is a common application of machine learning. Out-of-distribution (OOD) detection in particular is a semi-supervised anomaly detection technique where the detection method is trained only on the inlier (in-distribution) samples - unlike the fully supervised variant, the distribution of the outlier samples are never explicitly modeled in OOD detection tasks. In this work, we design a novel GAN-based OOD detection network specifically designed to protect a cyber-physical signal systems from novel Trojan malware called non-control data (NCD) attack that evades conventional malware detection techniques. Inspired in part by the classical locally most powerful (LMP) test in statistical inferences, the proposed LMP-GAN trains the OOD detector (discriminator) by generating OOD samples that are aimed at making maximal alteration to the inlier samples while evading detection. We experimentally compare the results to the state-of-the-art anomaly detection methods to demonstrate the benefits and the appropriateness of the LMP-GAN OOD detector. © 2025 IEEE.

关键词： Malware

MTP: Advancing Remote Sensing Foundation Model via Multi-Task Pretraining

学校读者我要写书评

暂无评论

arXiv 2024年

作者： Wang, Di Zhang, Jing Xu, Minqiang Liu, Lin Wang, Dongsheng Gao, Erzhong Han, Chengxi Guo, Haonan Du, Bo Tao, Dacheng Zhang, Liangpei School of Computer Science Wuhan University Wuhan430072 China Institute of Artificial Intelligence Wuhan University Wuhan430072 China National Engineering Research Center for Multimedia Software Wuhan University Wuhan430072 China Hubei Key Laboratory of Multimedia and Network Communication Engineering Wuhan University Wuhan430072 China School of Computer Science Faculty of Engineering The University of Sydney Australia iFlytek Co Ltd National Engineering Research Center of Speech and Language Information Processing Hefei230088 China State Key Laboratory of Information Engineering in Surveying Mapping and Remote Sensing Wuhan University Wuhan430079 China School of Computer Science and Engineering Nanyang Technological University Singapore Singapore

Foundation models have reshaped the landscape of Remote Sensing (RS) by enhancing various image interpretation tasks. Pretraining is an active research topic, encompassing supervised and self-supervised learning methods to initialize model weights effectively. However, transferring the pretrained models to downstream tasks may encounter task discrepancy due to their formulation of pretraining as image classification or object discrimination tasks. In this study, we explore the Multi-Task Pretraining (MTP) paradigm for RS foundation models to address this issue. Using a shared encoder and task-specific decoder architecture, we conduct multi-task supervised pretraining on the SAMRS dataset, encompassing semantic segmentation, instance segmentation, and rotated object detection. MTP supports both convolutional neural networks and vision transformer foundation models with over 300 million parameters. The pretrained models are finetuned on various RS downstream tasks, such as scene classification, horizontal and rotated object detection, semantic segmentation, and change detection. Extensive experiments across 14 datasets demonstrate the superiority of our models over existing ones of similar size and their competitive performance compared to larger state-of-the-art models, thus validating the effectiveness of MTP. The codes and pretrained models will be released at https://***/ViTAE-Transformer/MTP. Copyright © 2024, The Authors. All rights reserved.

关键词： Object detection

SAV-SE: Scene-aware Audio-Visual speech Enhancement with Selective State Space Model

学校读者我要写书评

暂无评论

arXiv 2024年

作者： Qian, Xinyuan Gao, Jiaran Zhang, Yaodan Zhang, Qiquan Liu, Hexin Garcia, Leibny Paola Li, Haizhou The School of Computer and Communication Engineering University of Science and Technology Beijing Beijing100083 China The School of Electrical Engineering and Telecommunications The University of New South Wales Sydney2052 Australia The College of Computing and Data Science Nanyang Technological University Singapore The Center for Language and Speech Processing Johns Hopkins University United States The Guangdong Provincial Key Laboratory of Big Data Computing The Chinese University of Hong Kong Shenzhen518172 China Shenzhen Research Institute of Big data Shenzhen51872 China

speech enhancement plays an essential role in various applications, and the integration of visual information has been demonstrated to bring substantial advantages. However, the majority of current research concentrates on the examination of facial and lip movements, which can be compromised or entirely inaccessible in scenarios where occlusions occur or when the camera view is distant. Whereas contextual visual cues from the surrounding environment have been overlooked: for example, when we see a dog bark, our brain has the innate ability to discern and filter out the barking noise. To this end, in this paper, we introduce a novel task, i.e. Scene-aware AudioVisual speech Enhancement (SAV-SE). To our best knowledge, this is the first proposal to use rich contextual information from synchronized video as auxiliary cues to indicate the type of noise, which eventually improves the speech enhancement performance. Specifically, we propose the VC-S2E method, which incorporates the Conformer and Mamba modules for their complementary strengths. Extensive experiments are conducted on public MUSIC, AVspeech and AudioSet datasets, where the results demonstrate the superiority of VC-S2E over other competitive methods. We will make the source code publicly available. Project demo page: https://***/ © 2024, CC BY.

关键词： speech enhancement