This paper investigates state-of-the-art Transformer- and FastSpeech-based high-fidelity neural text-to-speech (TTS) with full-context label input for pitch accent languages. The aim is to realize faster training than...
详细信息
ISBN:
(纸本)9781509066315
This paper investigates state-of-the-art Transformer- and FastSpeech-based high-fidelity neural text-to-speech (TTS) with full-context label input for pitch accent languages. The aim is to realize faster training than conventional Tacotron-based models. Introducing phoneme durations into Tacotron-based TTS models improves both synthesis quality and stability. Therefore, a Transformer-based acoustic model with weighted forced attention obtained from phoneme durations is proposed to improve synthesis accuracy and stability, where both encoder-decoder attention and forced attention are used with a weighting factor. Furthermore, FastSpeech without a duration predictor, in which the phoneme durations are predicted by another conventional model, is also investigated. The results of experiments using a Japanese female corpus and the WaveGlow vocoder indicate that the proposed Transformer using forced attention with a weighting factor of 0.5 outperforms other models, and removing the duration predictor from FastSpeech improves synthesis quality, although the proposed weighted forced attention does not improve synthesis stability.
sequence-to-sequence text-to-speech (TTS) is dominated by soft-attention-based methods. Recently, hard-attention-based methods have been proposed to prevent fatal alignment errors, but their sampling method of discret...
详细信息
ISBN:
(纸本)9781509066315
sequence-to-sequence text-to-speech (TTS) is dominated by soft-attention-based methods. Recently, hard-attention-based methods have been proposed to prevent fatal alignment errors, but their sampling method of discrete alignment is poorly investigated. This research investigates various combinations of sampling methods and probability distributions for alignment transition modeling in a hard-alignment-based sequence-to-sequence TTS method called SSNT-TTS. We clarify the common sampling methods of discrete variables including greedy search, beam search, and random sampling from a Bernoulli distribution in a more general way. Furthermore, we introduce the binary Concrete distribution to model discrete variables more properly. The results of a listening test shows that deterministic search is more preferable than stochastic search, and the binary Concrete distribution is robust with stochastic search for natural alignment transition.
An automatic question generation (QG) system aims to produce questions from a text, such as a sentence or a paragraph. Traditional approaches are mainly based on heuristic, hand-crafted rules to transduce a declarativ...
详细信息
ISBN:
(数字)9783030557898
ISBN:
(纸本)9783030557881;9783030557898
An automatic question generation (QG) system aims to produce questions from a text, such as a sentence or a paragraph. Traditional approaches are mainly based on heuristic, hand-crafted rules to transduce a declarative sentence to a related interrogative sentence. However, creating such a set of rules requires deep linguistic knowledge and most of these rules are language-specific. Although a data-driven approach reduces the participation of linguistic experts, to get sufficient labeled data for QG model training is still a difficult task. In this paper, we applied a neural sequence-to-sequence pointer-generator network with various transfer learning strategies to capture the underlying information of making a question, on a target domain with rare training pairs. Our experiment demonstrates the viability of domain adaptation in QG task. We also show the possibility that transfer learning is helpful in a semi-supervised approach when the amount of training pairs in the target QG dataset is not large enough.
In pop music, accompaniments are usually played by multiple instruments (tracks) such as drum, bass, string and guitar, and can make a song more expressive and contagious by arranging together with its melody. Previou...
详细信息
ISBN:
(纸本)9781450379885
In pop music, accompaniments are usually played by multiple instruments (tracks) such as drum, bass, string and guitar, and can make a song more expressive and contagious by arranging together with its melody. Previous works usually generate multiple tracks separately and the music notes from different tracks not explicitly depend on each other, which hurts the harmony modeling. To improve harmony, in this paper(1), we propose a novel MUlti-track MIDI representation (MuMIDI), which enables simultaneous multi-track generation in a single sequence and explicitly models the dependency of the notes from different tracks. While this greatly improves harmony, unfortunately, it enlarges the sequence length and brings the new challenge of long-term music modeling. We further introduce two new techniques to address this challenge: 1) We model multiple note attributes (e.g., pitch, duration, velocity) of a musical note in one step instead of multiple steps, which can shorten the length of a MuMIDI sequence. 2) We introduce extra long-context as memory to capture long-term dependency in music. We call our system for pop music accompaniment generation as PopMAG. We evaluate PopMAG on multiple datasets (LMD, FreeMidi and CPMD, a private dataset of Chinese pop songs) with both subjective and objective metrics. The results demonstrate the effectiveness of PopMAG for multi-track harmony modeling and long-term context modeling. Specifically, PopMAG wins 42%/38%/40% votes when comparing with ground truth musical pieces on LMD, FreeMidi and CPMD datasets respectively and largely outperforms other state-of-the-art music accompaniment generation models and multi-track MIDI representations in terms of subjective and objective metrics.
With the exponential growth of information on the internet, users have been relying on search engines for finding the precise documents. However, user queries are often short. The inherent ambiguity of short queries i...
详细信息
ISBN:
(纸本)9781450380164
With the exponential growth of information on the internet, users have been relying on search engines for finding the precise documents. However, user queries are often short. The inherent ambiguity of short queries imposes great challenges for search engines to understand user intent. Query suggestion is one key technique for search engines to augment user queries so that they can better understand user intent. In the past, query suggestions have been relying on either term-frequency-based methods with little semantic understanding of the query, or word-embedding-based methods with little personalization efforts. Here, we present a sequence-to-sequence-model-based query suggestion framework that is capable of modeling structured, personalized features and unstructured query texts naturally. This capability opens up the opportunity to better understand query semantics and user intent at the same time. As the largest professional network, LINKEDIN has the advantage of utilizing a rich amount of accurate member profile information to personalize query suggestions. We applied this framework in the LINKEDIN production traffic and showed that personalized query suggestions significantly improved member search experience as measured by key business metrics at LINKEDIN.
Electronic medical records (EMR) contain comprehensive patient information and are typically stored in a relational database with multiple tables. Effective and efficient patient information retrieval from EMR data is...
详细信息
ISBN:
(纸本)9781450370233
Electronic medical records (EMR) contain comprehensive patient information and are typically stored in a relational database with multiple tables. Effective and efficient patient information retrieval from EMR data is a challenging task for medical experts. Question-to-SQL generation methods tackle this problem by first predicting the SQL query for a given question about a database, and then, executing the query on the database. However, most of the existing approaches have not been adapted to the healthcare domain due to a lack of healthcare Question-to-SQL dataset for learning models specific to this domain. In addition, wide use of the abbreviation of terminologies and possible typos in questions introduce additional challenges for accurately generating the corresponding SQL queries. In this paper, we tackle these challenges by developing a deep learning based TRanslate-Edit model for Question-to-SQL (TREQS) generation, which adapts the widely used sequence-to-sequence model to directly generate the SQL query for a given question, and further performs the required edits using an attentive-copying mechanism and task-specific look-up tables. Based on the widely used publicly available electronic medical database, we create a new large-scale Question-SQL pair dataset, named MIMICSQL, in order to perform the Question-to-SQL generation task in healthcare domain. An extensive set of experiments are conducted to evaluate the performance of our proposed model on MIMICSQL. Both quantitative and qualitative experimental results indicate the flexibility and efficiency of our proposed method in predicting condition values and its robustness to random questions with abbreviations and typos.
Labanotation is an important notation system for recording dances. Automatically generating Labanotation scores from motion capture data has attracted more interest in recent years. Current methods usually focus on in...
详细信息
ISBN:
(纸本)9781509066315
Labanotation is an important notation system for recording dances. Automatically generating Labanotation scores from motion capture data has attracted more interest in recent years. Current methods usually focus on individual movement segments and generate Labanotation symbols one by one. This requires segmenting the captured data sequence in advance. Manual segmentation will consume a lot of time and effort, while automatic segmentation may not be reliable enough. In this paper, we propose a sequence-to-sequence approach that can generate Labanotation scores from unsegmented motion data sequences. First, we extract effective features from motion capture data based on body skeleton analysis. Then, we train a neural network under the encoder-decoder architecture to transform the motion feature sequences to corresponding Labanotation symbols. As such, the dance score is generated. Experiments show that the proposed method performs favorably against state-of-the-art algorithms in the automatic Labanotation generation task.
Auto-regressive sequence-to-sequence models with attention mechanisms have achieved state-of-the-art performance in various tasks including speech synthesis. Training these models can be difficult. The standard approa...
详细信息
ISBN:
(纸本)9781713820697
Auto-regressive sequence-to-sequence models with attention mechanisms have achieved state-of-the-art performance in various tasks including speech synthesis. Training these models can be difficult. The standard approach guides a model with the reference output history during training. However during synthesis the generated output history must be used. This mismatch can impact performance. Several approaches have been proposed to handle this, normally by selectively using the generated output history. To make training stable, these approaches often require a heuristic schedule or an auxiliary classifier. This paper introduces attention forcing, which guides the model with the generated output history and reference attention. This approach reduces the training-evaluation mismatch without the need for a schedule or a classifier. Additionally, for standard training approaches, the frame rate is often reduced to prevent models from copying the output history. As attention forcing does not feed the reference output history to the model, it allows using a higher frame rate, which improves the speech quality. Finally, attention forcing allows the model to generate output sequences aligned with the references, which is important for some down-stream tasks such as training neural vocoders. Experiments show that attention forcing allows doubling the frame rate, and yields significant gain in speech quality.
Machine Learning models from other fields, like Computational Linguistics, have been transplanted to Software Engineering tasks, often quite successfully. Yet a transplanted model's initial success at a given task...
详细信息
ISBN:
(纸本)9781450367684
Machine Learning models from other fields, like Computational Linguistics, have been transplanted to Software Engineering tasks, often quite successfully. Yet a transplanted model's initial success at a given task does not necessarily mean it is well-suited for the task. In this work, we examine a common example of this phenomenon: the conceit that "software patching is like language translation". We demonstrate empirically that there are subtle, but critical distinctions between sequence-to-sequence models and translation model: while program repair benefits greatly from the former, general modeling architecture, it actually suffers from design decisions built into the latter, both in terms of translation accuracy and diversity. Given these findings, we demonstrate how a more principled approach to model design, based on our empirical findings and general knowledge of software development, can lead to better solutions. Our findings also lend strong support to the recent trend towards synthesizing edits of code conditional on the buggy context, to repair bugs. We implement such models ourselves as "proof-of-concept" tools and empirically confirm that they behave in a fundamentally different, more effective way than the studied translation-based architectures. Overall, our results demonstrate the merit of studying the intricacies of machine learned models in software engineering: not only can this help elucidate potential issues that may be overshadowed by increases in accuracy;it can also help innovate on these models to raise the state-of-the-art further. We will publicly release our replication data and materials at https://***/ARiSE- Lab/Patch-as-translation.
Machine Learning models from other fields, like Computational Linguistics, have been transplanted to Software Engineering tasks, often quite successfully. Yet a transplanted model's initial success at a given task...
详细信息
ISBN:
(纸本)9781450367684
Machine Learning models from other fields, like Computational Linguistics, have been transplanted to Software Engineering tasks, often quite successfully. Yet a transplanted model's initial success at a given task does not necessarily mean it is well-suited for the task. In this work, we examine a common example of this phenomenon: the conceit that "software patching is like language translation". We demonstrate empirically that there are subtle, but critical distinctions between sequence-to-sequence models and translation model: while program repair benefits greatly from the former, general modeling architecture, it actually suffers from design decisions built into the latter, both in terms of translation accuracy and diversity. Given these findings, we demonstrate how a more principled approach to model design, based on our empirical findings and general knowledge of software development, can lead to better solutions. Our findings also lend strong support to the recent trend towards synthesizing edits of code conditional on the buggy context, to repair bugs. We implement such models ourselves as "proof-of-concept" tools and empirically confirm that they behave in a fundamentally different, more effective way than the studied translation-based architectures. Overall, our results demonstrate the merit of studying the intricacies of machine learned models in software engineering: not only can this help elucidate potential issues that may be overshadowed by increases in accuracy; it can also help innovate on these models to raise the state-of-the-art further. We will publicly release our replication data and materials at https://***/ARiSE-Lab/Patch-as-translation.
暂无评论