We perform text normalization, i.e. the transformation of words from the written to the spoken form, using a memory augmented neural network. With the addition of dynamic memory access and storage mechanism, we presen...
详细信息
We perform text normalization, i.e. the transformation of words from the written to the spoken form, using a memory augmented neural network. With the addition of dynamic memory access and storage mechanism, we present a neural architecture that will serve as a language-agnostic text normalization system while avoiding the kind of unacceptable errors made by the LSTM-based recurrent neural networks. By successfully reducing the frequency of such mistakes, we show that this novel architecture is indeed a better alternative. Our proposed system requires significantly lesser amounts of data, training time and compute resources. Additionally, we perform data up-sampling, circumventing the data sparsity problem in some semiotic classes, to show that sufficient examples in any particular class can improve the performance of our text normalization system. Although a few occurrences of these errors still remain in certain semiotic classes, we demonstrate that memory augmented networks with meta-learning capabilities can open many doors to a superior text normalization system.
We describe the automated multi-language text normalization infrastructure that prepares textual data to train language models used in Google's keyboards and speech recognition systems, across hundreds of language...
详细信息
ISBN:
(纸本)9791095546009
We describe the automated multi-language text normalization infrastructure that prepares textual data to train language models used in Google's keyboards and speech recognition systems, across hundreds of language varieties. Training corpora are sourced from various types of data sets, and the text is then normalized using a sequence of hand-written grammars and learned models. These systems need to scale to hundreds or thousands of language varieties in order to meet product needs. Frequent data refreshes, privacy considerations and simultaneous updates across such a high number of languages make manual inspection of the normalized training data infeasible, while there is ample opportunity for data normalization issues. By tracking metrics about the data and how it was processed, we are able to catch internal data preparation issues and external data corruption issues that can be hard to notice using standard extrinsic evaluation methods. Showing the importance of paying attention to data normalization behavior in large-scale pipelines, these metrics have highlighted issues in Google's real-world speech recognition system that have caused significant, but latent, quality degradation.
text-to-Speech (TTS) is a technology that is currently widely used for several purposes both for academic/ non-commercial and industry/commercial purposes. In several cases, some researchers on the TTS field adding a ...
详细信息
ISBN:
(纸本)9781665442886
text-to-Speech (TTS) is a technology that is currently widely used for several purposes both for academic/ non-commercial and industry/commercial purposes. In several cases, some researchers on the TTS field adding a text normalization process for normalizing text that will be used for TTS input to enhance the TTS performance itself. In this paper, we present a rule-based approach to make an Indonesian text normalization dataset that has a raw text and a spoken form of it for enhancing Indonesian text-to-Speech (TTS) performance. We conduct a set of rule-based for normalizing Indonesian text as an input for the TTS system. Using those rule-based, we generated a dataset and correct it manually so that we have a gold standard for text normalization for Indonesian TTS input. Our approach shows a rule-based can give a good performance for normalizing text for Indonesian TTS with 0.0805 of Word Error Rate (WER).
This paper addresses the problem of text normalization, an often overlooked problem in natural language processing, in code-mixed social media text. The objective of the work presented here is to correct English spell...
详细信息
ISBN:
(纸本)9781479983490
This paper addresses the problem of text normalization, an often overlooked problem in natural language processing, in code-mixed social media text. The objective of the work presented here is to correct English spelling errors in co-demixed social media text that contains English words as well as Romanized transliteration of words from another language, in this case Bangla. The targeted research problem also entails solving another problem, that of word-level language identification in code-mixed social media text. We employ a CRF based machine learning approach followed by post-processing heuristics for the word-level language identification task. For spelling correction, we used the noisy channel model of spelling correction. In addition, the spell checker model presented here tackles wordplay, contracted words and phonetic variations. Overall, the word-level language identification achieved 90.5% accuracy and the spell checker achieved 69.43% accuracy on the detected English words.
Twitter often contains many noisy short messages. The noisy text are caused by insertion, transformation, transliteration and onomatopoeia. text normalization is used for solving these noisy text. In this paper, we pr...
详细信息
ISBN:
(纸本)9781728101644
Twitter often contains many noisy short messages. The noisy text are caused by insertion, transformation, transliteration and onomatopoeia. text normalization is used for solving these noisy text. In this paper, we present the algorithm that can normalize insertion and homophonic transformation words by converting to International Phonetic Alphabet(IPA) and find the most similarity IPA of out-of-vocabulary and IPA of invocabulary using Levenshtein Distance. We used Twitter corpus that contained 2,000 twitter messages for evaluating the proposed algorithm. The experiment result illustrated that the proposed algorithm returned an accuracy of 79.03% when compared to dictionary-based normalization of LextoPlus returned an accuracy 24.19%.
Background: Named entity recognition (NER) is a task of detecting named entities in documents and categorizing them to predefined classes, such as person, location, and organization. This paper focuses on tweets poste...
详细信息
text normalization is an important component in text-to-Speech system and the difficulty in text normalization is to disambiguate the Non-Standard Words (NSWs). This paper develops a taxonomy of NSWs on the basis of a...
详细信息
ISBN:
(纸本)9781424414833
text normalization is an important component in text-to-Speech system and the difficulty in text normalization is to disambiguate the Non-Standard Words (NSWs). This paper develops a taxonomy of NSWs on the basis of a large scale Chinese corpus, and proposes a two-stage NSWs disambiguation strategy, Finite State Automata (FSA) for initial classification and Maximum Entropy (ME) classifiers for subclass disambiguation. Based on the above NSWs taxonomy, the two-stage approach achieves an F-score of 98.53% in open test, 5.23% higher than that of FSA based approach. Experiments show that the NSWs taxonomy ensures FSA a high baseline performance and ME classifiers make considerable improvement, and the two-stage approach adapts well to new domains.
Collecting sufficient language model training data for good speech recognition performance in a new domain is often difficult. However, there may be other sources of data that are matched in terms of topic or style, i...
详细信息
ISBN:
(纸本)0780374029
Collecting sufficient language model training data for good speech recognition performance in a new domain is often difficult. However, there may be other sources of data that are matched in terms of topic or style, if not both. This paper looks at the use of text normalization tools to make these data more suitable for language model training, in conjunction with mixture models to combine data from different sources. We specifically address the task of recognizing meeting speech, showing a small reduction in word error rate over a baseline language model trained from conversational speech data.
In this paper, we describe and compare systems for text normalization based on statistical machine translation (SMT) methods which are constructed with the support of internet users. Internet users normalize text disp...
详细信息
ISBN:
(纸本)9781617821233
In this paper, we describe and compare systems for text normalization based on statistical machine translation (SMT) methods which are constructed with the support of internet users. Internet users normalize text displayed in a web interface, thereby providing a parallel corpus of normalized and non-normalized text. With this corpus, SMT models are generated to translate non-normalized into normalized text. To build traditional language-specific text normalization systems, knowledge of linguistics as well as established computer skills to implement text normalization rules are required. Our systems are built without profound computer knowledge due to the simple self-explanatory user interface and the automatic generation of the SMT models. Additionally, no inhouse knowledge of the language to normalize is required due to the multilingual expertise of the internet community. All techniques are applied on French texts, crawled with our Rapid Language Adaptation Toolkit [1] and compared through Levenshtein edit distance [2], BLEU score [3], and perplexity.
This paper presents the task of normalizing Vietnamese transcribed texts in Speech-to-text (STT) systems. The main purpose is to develop a text normalizer that automatically converts proper nouns and other context-spe...
详细信息
This paper presents the task of normalizing Vietnamese transcribed texts in Speech-to-text (STT) systems. The main purpose is to develop a text normalizer that automatically converts proper nouns and other context-specific formatting of the transcription such as dates, time, and numbers into their appropriate expressions. To this end, we propose a solution that exploits deep neural networks with rich features followed by manually designed rules to recognize and then convert these text sequences. We also introduce a new corpus of 13 K spoken sentences to facilitate the process of the text normalization. The experimental results on this corpus are quite promising. The proposed method yields 90.67% in the F1 score in recognizing sequences of texts that need converting. We hope that this initial work will inspire other follow-up research on this important but unexplored problem.
暂无评论