检索结果-内蒙古大学图书馆

Study of pre-processing defenses against adversarial attacks on state-of-the-art speaker recognition systems

学校读者我要写书评

暂无评论

arXiv 2021年

作者： Joshi, Sonal Villalba, Jesús Zelasko, Piotr Moro-Velázquez, Laureano Dehak, Najim Johns Hopkins University BaltimoreMD21218 United States The Human Language Technology Center of Excellence Johns Hopkins University BaltimoreMD21218 United States

Adversarial examples to speaker recognition (SR) systems are generated by adding a carefully crafted noise to the speech signal to make the system fail while being imperceptible to humans. Such attacks pose severe security risks, making it vital to deep-dive and understand how much the state-of-the-art SR systems are vulnerable to these attacks. Moreover, it is of greater importance to propose defenses that can protect the systems against these attacks. Addressing these concerns, this paper at first investigates how state-of-the-art x-vector based SR systems are affected by white-box adversarial attacks, i.e., when the adversary has full knowledge of the system. x-Vector based SR systems are evaluated against white-box adversarial attacks common in the literature like fast gradient sign method (FGSM), basic iterative method (BIM)-a.k.a. iterative-FGSM-, projected gradient descent (PGD), and Carlini-Wagner (CW) attack. To mitigate against these attacks, the paper proposes four pre-processing defenses. It evaluates them against powerful adaptive white-box adversarial attacks, i.e., when the adversary has full knowledge of the system, including the defense. The four pre-processing defenses-viz. randomized smoothing, DefenseGAN, variational autoencoder (VAE), and Parallel WaveGAN vocoder (PWG) are compared against the baseline defense of adversarial training. Conclusions indicate that SR systems were extremely vulnerable under BIM, PGD, and CW attacks. Among the proposed pre-processing defenses, PWG combined with randomized smoothing offers the most protection against the attacks, with accuracy averaging 93% compared to 52% in the undefended system and an absolute improvement > 90% for BIM attacks with L∞ > 0.001 and CW attack. Copyright © 2021, The Authors. All rights reserved.

关键词： Gradient methods

A call for prudent choice of subword merge operations in neural machine translation

学校读者我要写书评

暂无评论

arXiv 2019年

作者： Ding, Shuoyang Renduchintala, Adithya Duh, Kevin Center for Language and Speech Processing Human Language Technology Center of Excellence Johns Hopkins University

Most neural machine translation systems are built upon subword units extracted by methods such as Byte-Pair Encoding (BPE) or wordpiece. However, the choice of number of merge operations is generally made by following existing recipes. In this paper, we conduct a systematic exploration on different numbers of BPE merge operations to understand how it interacts with the model architecture, the strategy to build vocabularies and the language pair. Our exploration could provide guidance for selecting proper BPE configurations in the future. Most prominently: we show that for LSTM-based architectures, it is necessary to experiment with a wide range of different BPE operations as there is no typical optimal BPE configuration, whereas for Transformer architectures, smaller BPE size tends to be a typically optimal choice. We urge the community to make prudent choices with subword merge operations, as our experiments indicate that a sub-optimal BPE configuration alone could easily reduce the system performance by 3–4 BLEU points. Copyright © 2019, The Authors. All rights reserved.

关键词： Computational linguistics

Machine Translation System Selection from Bandit Feedback

学校读者我要写书评

暂无评论

arXiv 2020年

作者： Naradowsky, Jason Zhang, Xuan Duh, Kevin Preferred Networks Johns Hopkins University Human Language Technology Center of Excellence

Adapting machine translation systems in the real world is a difficult problem. In contrast to offline training, users cannot provide the type of fine-grained feedback typically used for improving the system. Moreover, users have different translation needs, and even a single user’s needs may change over time. In this work we take a different approach, treating the problem of adapting as one of selection. Instead of adapting a single system, we train many translation systems using different architectures and data partitions. Using bandit learning techniques on simulated user feedback, we learn a policy to choose which system to use for a particular translation task. We show that our approach can (1) quickly adapt to address domain changes in translation tasks, (2) outperform the single best system in mixed-domain translation tasks, and (3) make effective instance-specific decisions when using contextual bandit strategies. Copyright © 2020, The Authors. All rights reserved.

关键词： Computational linguistics

Wake Word Detection with Alignment-Free Lattice-Free MMI

学校读者我要写书评

暂无评论

arXiv 2020年

作者： Wang, Yiming Lv, Hang Povey, Daniel Xie, Lei Khudanpur, Sanjeev Center for Language and Speech Processing Human Language Technology Center of Excellence Johns Hopkins University BaltimoreMD United States Xiaomi Inc. Beijing China ASLP@NPU School of Computer Science Northwestern Polytechnical University Xi’an China

Always-on spoken language interfaces, e.g. personal digital assistants, rely on a wake word to start processing spoken input. We present novel methods to train a hybrid DNN/HMM wake word detection system from partially labeled training data, and to use it in on-line applications: (i) we remove the prerequisite of frame-level alignments in the LF-MMI training algorithm, permitting the use of un-transcribed training examples that are annotated only for the presence/absence of the wake word;(ii) we show that the classical keyword/filler model must be supplemented with an explicit non-speech (silence) model for good performance;(iii) we present an FST-based decoder to perform online detection. We evaluate our methods on two real data sets, showing 50%–90% reduction in false rejection rates at prespecified false alarm rates over the best previously published figures, and re-validate them on a third (large) data set. Copyright © 2020, The Authors. All rights reserved.

关键词： Wakes

Software in the natural world: A computational approach to hierarchical emergence

学校读者我要写书评

暂无评论

arXiv 2024年

作者： Rosas, Fernando E. Geiger, Bernhard C. Luppi, Andrea I. Seth, Anil K. Polani, Daniel Gastpar, Michael Mediano, Pedro A.M. Department of Informatics University of Sussex United Kingdom Sussex Centre for Consciousness Science and Sussex AI University of Sussex United Kingdom Center for Psychedelic Research and Centre for Complexity Science Department of Brain Science Imperial College London United Kingdom Center for Eudaimonia and Human Flourishing University of Oxford United Kingdom Know-Center GmbH Graz Austria Signal Processing and Speech Communication Laboratory Graz University of Technology Graz Austria Montreal Neurological Institute McGill University Canada Department of Computer Science University of Hertfordshire Hatfield United Kingdom School of Computer and Communication Sciences EPFL Lausanne Switzerland Department of Computing Imperial College London United Kingdom Division of Psychology and Language Sciences University College London United Kingdom

Understanding the functional architecture of complex systems is crucial to illuminate their inner workings and enable effective methods for their prediction and control. Recent advances have introduced tools to characterise emergent macroscopic levels;however, while these approaches are successful in identifying when emergence takes place, they are limited in the extent they can determine how it does. Here we address this important limitation by developing a computational approach to emergence, which characterises macroscopic processes in terms of their computational capabilities. Concretely, we articulate a view on emergence based on how software works, which is rooted on a mathematical formalisation of how macroscopic processes can express self-contained informational, interventional, and computational properties. This framework reveals a hierarchy of nested self-contained processes that determines what computations take place at what level, which in turn delineates the functional architecture of a complex system. This approach is illustrated on paradigmatic models from the statistical physics and computational neuroscience literature, which are shown to exhibit macroscopic processes that are akin to software in human-engineered systems. Overall, this framework enables a deeper understanding of the multi-level structure of complex systems, revealing specific ways in which they can be efficiently simulated, predicted, and controlled. Copyright © 2024, The Authors. All rights reserved.

关键词： Statistical Physics

That sounds familiar: An analysis of phonetic representations transfer across languages

学校读者我要写书评

暂无评论

arXiv 2020年

作者： Zelasko, Piotr Velazquez, Laureano Moro Johnson, Mark Hasegawa Scharenborg, Odette Dehak, Najim Center for Language and Speech Processing Human Language Technology Center of Excellence Johns Hopkins University BaltimoreMD United States Ece Department and Beckman Institute University of Illinois Urbana-Champaign United States Multimedia Computing Group Delft University of Technology Delft Netherlands

Only a handful of the worlds languages are abundant with the resources that enable practical applications of speech processing technologies. One of the methods to overcome this problem is to use the resources existing in other languages to train a multilingual automatic speech recognition (ASR) model, which, intuitively, should learn some universal phonetic representations. In this work, we focus on gaining a deeper understanding of how general these representations might be, and how individual phones are getting improved in a multilingual setting. To that end, we select a phonetically diverse set of languages, and perform a series of monolingual, multilingual and crosslingual (zero-shot) experiments. The ASR is trained to recognize the International Phonetic Alphabet (IPA) token sequences. We observe significant improvements across all languages in the multilingual setting, and stark degradation in the crosslingual setting, where the model, among other errors, considers Javanese as a tone language. Notably, as little as 10 hours of the target language training data tremendously reduces ASR error rates Our analysis uncovered that even the phones that are unique to a single language can benefit greatly from adding training data from other languages-an encouraging result for the lowresource speech community. Copyright © 2020, The Authors. All rights reserved.

关键词： speech recognition

Jhu-HLTCOE System for the Voxsrc Speaker Recognition Challenge

学校读者我要写书评

暂无评论

Jhu-HLTCOE System for the Voxsrc Speaker Recognition Challen...

IEEE International Conference on Acoustics, speech and Signal processing

作者： Daniel Garcia-Romero Alan McCree David Snyder Gregory Sell Human Language Technology Center of Excellence Johns Hopkins University Baltimore MD USA

ISBN: (数字)9781509066315

ISBN: (纸本)9781509066322

The VoxSRC speaker recognition challenge comprises data obtained from YouTube videos of celebrity interviews in a wide range of recording environments. The challenge provides FIXED and OPEN training conditions to allow cross-system comparisons and to characterize the effects of additional amounts of training data on system performance. This paper describes our submission to this challenge where we have explored x-vector extractor topologies, classification head alternatives, data augmentation, and angular margin penalty. Our final entry to the FIXED condition (which achieved 2nd place) is the score average of 4 diverse systems. We find that this system outperforms a large single DNN with similar number of parameters.

关键词：

Script identification using across-and within-image distribution estimation 15

学校读者我要写书评

暂无评论

Script identification using across-and within-image distribu...

15th IAPR International Conference on Document Analysis and Recognition, ICDAR 2019

作者： Sell, Gregory Etter, David Garcia-Romero, Daniel McCree, Alan Human Language Technology Center of Excellence Johns Hopkins University Baltimore United States

ISBN: (纸本)9781728128610

In this paper, we apply several modifications to script identification, several of which inspired by techniques from the similar audio task of spoken language recognition. Specifically, we alter the architecture of a convolutional network with global average pooling to include variance pooling as well, we utilize score calibration of the output scores of the network, and we utilize prior distribution estimation to condition the calibrated scores. We show that these methods are effective in script identification, with the use of priors showing especially promising improvements. Furthermore, in the domain of script identification, several additional extensions of distribution estimation are available which consider the distribution within each image, and we demonstrate much larger improvements when employing these extensions. Finally, we also show that an embedding-plus-classifier approach performs similarly to the full network, and so its potential for increased flexibility may be beneficial for future consideration. With all modifications, overall accuracy on the ICDAR 2017 validation dataset increases from 89.7% to 93.6%. © 2019 IEEE.

关键词： Image enhancement

How phonotactics affect multilingual and zero-shot ASR performance

学校读者我要写书评

暂无评论

arXiv 2020年

作者： Feng, Siyuan Zelasko, Piotr Moro-Velázquez, Laureano Abavisani, Ali Hasegawa-Johnson, Mark Scharenborg, Odette Dehak, Najim Multimedia Computing Group Delft University of Technology Delft Netherlands Center for Language and Speech Processing United States Human Language Technology Center of Excellence Johns Hopkins University BaltimoreMD United States Department of Electrical and Computer Engineering University of Illinois at Urbana-Champaign IL United States

The idea of combining multiple languages’ recordings to train a single automatic speech recognition (ASR) model brings the promise of the emergence of universal speech representation. Recently, a Transformer encoder-decoder model has been shown to leverage multilingual data well in IPA transcriptions of languages presented during training. However, the representations it learned were not successful in zero-shot transfer to unseen languages. Because that model lacks an explicit factorization of the acoustic model (AM) and language model (LM), it is unclear to what degree the performance suffered from differences in pronunciation or the mismatch in phonotactics. To gain more insight into the factors limiting zero-shot ASR transfer, we replace the encoder-decoder with a hybrid ASR system consisting of a separate AM and LM. Then, we perform an extensive evaluation of monolingual, multilingual, and crosslingual (zero-shot) acoustic and language models on a set of 13 phonetically diverse languages. We show that the gain from modeling crosslingual phonotactics is limited, and imposing a too strong model can hurt the zero-shot transfer. Furthermore, we find that a multilingual LM hurts a multilingual ASR system’s performance, and retaining only the target language’s phonotactic data in LM training is preferable. Copyright © 2020, The Authors. All rights reserved.

关键词： speech recognition