Background Human-machine dialog generation is an essential topic of research in the field of natural language *** high-quality,diverse,fluent,and emotional conversation is a challenging *** on continuing advancements ...
详细信息
Background Human-machine dialog generation is an essential topic of research in the field of natural language *** high-quality,diverse,fluent,and emotional conversation is a challenging *** on continuing advancements in artificial intelligence and deep learning,new methods have come to the forefront in recent *** particular,the end-to-end neural network model provides an extensible conversation generation framework that has the potential to enable machines to understand semantics and automatically generate ***,neural network models come with their own set of questions and *** basic conversational model framework tends to produce universal,meaningless,and relatively"safe"*** Based on generative adversarial networks(GANs),a new emotional dialog generation framework called EMC-GAN is proposed in this study to address the task of emotional dialog *** proposed model comprises a generative and three discriminative *** generator is based on the basic sequence-to-sequence(Seq2Seq)dialog generation model,and the aggregate discriminative model for the overall framework consists of a basic discriminative model,an emotion discriminative model,and a fluency discriminative *** basic discriminative model distinguishes generated fake sentences from real sentences in the training *** emotion discriminative model evaluates whether the emotion conveyed via the generated dialog agrees with a pre-specified emotion,and directs the generative model to generate dialogs that correspond to the category of the pre-specified ***,the fluency discriminative model assigns a score to the fluency of the generated dialog and guides the generator to produce more fluent *** Based on the experimental results,this study confirms the superiority of the proposed model over similar existing models with respect to emotional accuracy,fluency,and *** The proposed EMC-
In last-mile delivery, drivers frequently deviate from planned delivery routes because of their tacit knowledge of the road and curbside infrastructure, customer availability, and other characteristics of the respecti...
详细信息
In last-mile delivery, drivers frequently deviate from planned delivery routes because of their tacit knowledge of the road and curbside infrastructure, customer availability, and other characteristics of the respective service areas. Hence, the actual stop sequences chosen by an experienced human driver may be potentially preferable to the theoretical shortest-distance routing under real-life operational conditions. Thus, being able to predict the actual stop sequence that a human driver would follow can help to improve route planning in last -mile delivery. This paper proposes a pair-wise attention-based pointer neural network for this prediction task using drivers' historical delivery trajectory data. In addition to the commonly used encoder-decoder architecture for sequence-to-sequence prediction, we propose a new attention mechanism based on an alternative specific neural network to capture the local pair -wise information for each pair of stops. To further capture the global efficiency of the route, we propose a new iterative sequence generation algorithm that is used after model training to identify the first stop of a route that yields the lowest operational cost. Results from an extensive case study on real operational data from Amazon's last-mile delivery operations in the US show that our proposed method can significantly outperform traditional optimization -based approaches and other machine learning methods (such as the Long Short-Term Memory encoder-decoder and the original pointer network) in finding stop sequences that are closer to high-quality routes executed by experienced drivers in the field. Compared to benchmark models, the proposed model can increase the average prediction accuracy of the first four stops from around 0.229 to 0.312, and reduce the disparity between the predicted route and the actual route by around 15%.
Identifying travel modes from GPS tracks, as an essential technique to understand the travel behavior of a population, has received widespread interest over the past decade. While most previous Travel Mode Identificat...
详细信息
Identifying travel modes from GPS tracks, as an essential technique to understand the travel behavior of a population, has received widespread interest over the past decade. While most previous Travel Mode Identification (TMI) methods separately identify the mode of each track segment of a GPS trajectory, in this paper, we propose a sequence-based TMI framework that constructs a feature sequence for each GPS trajectory and sent it to a sequence-to-sequence (seq2seq) model to obtain the corresponding travel mode label sequence, named Trajectory-as-a-sequence (TaaS). The proposed seq2seq model consists of a Convolutional Encoder (CE) and a Recurrent Conditional Random Field (RCRF), where the CE extracts high-level features from the point-level trajectory features and the RCRF learns the context information of trajectories at both feature and label levels, thus outputting accurate and reasonable travel mode label sequences. To alleviate the lack of data, we adopted a two-stage model training strategy. Additionally, we design two novel bus-related features to assist the seq2seq model distinguishing different high-speed travel modes (i.e., bus, car, and railway) in the sequence. Besides the classical performance metrics such as accuracy, we propose a new metric that evaluates the rationality of the travel mode label sequence at the trajectory level. Comprehensive evaluations corresponding to the real-world TMI applications show that the sequence-based TaaS outperforms the segment-based models in practice. Furthermore, the results of ablation studies demonstrate that the elements integrated into the TaaS framework are helpful to improve the efficiency and accuracy of TMI.
sequence-to-sequence (seq2seq) automatic speech recognition (ASR) recently achieves state-of-the-art performance with fast decoding and a simple architecture. On the other hand, it requires a large amount of training ...
详细信息
sequence-to-sequence (seq2seq) automatic speech recognition (ASR) recently achieves state-of-the-art performance with fast decoding and a simple architecture. On the other hand, it requires a large amount of training data and cannot use text-only data for training. In our previous work, we proposed a method for applying text data to seq2seq ASR training by leveraging text-to-speech (TTS). However, we observe the log Mel-scale filterbank (lmfb) features produced by Tacotron 2-based model are blurry, particularly on the time dimension. This problem is mitigated by introducing the WaveNet vocoder to generate speech of better quality or spectrogram of better time-resolution. This makes it possible to train waveform-input end-to-end ASR. Here we use CNN filters and apply a masking method similar to SpecAugment. We compare the waveform-input model with two kinds of lmfb-input models: (1) lmfb features are directly generated by TTS, and (2) lmfb features are converted from the waveform generated by TTS. Experimental evaluations show the combination of waveform-output TTS and the waveform-input end-to-end ASR model outperforms the lmfb-input models in two domain adaptation settings.
Predicting Estimated Time of Arrival (ETA) for a Multi-Airport System (MAS) is much more challenging than for a single airport system because of complex air route structure, dense air traffic volume and vagaries of tr...
详细信息
Predicting Estimated Time of Arrival (ETA) for a Multi-Airport System (MAS) is much more challenging than for a single airport system because of complex air route structure, dense air traffic volume and vagaries of traffic conditions in an MAS. In this work, we propose a novel "Bubble"mechanism to accurately predict medium-term ETA for a Multi-Airport System (MAS), in which the prediction of travel time of an origin-destination (OD) pair is decomposed into two stages, termed as out-MAS and in-MAS stages. For the out-MAS stage, Auto-Regressive Integrated Moving Average (ARIMA) is used to predict the travel time of a flight to reach the MAS boundary. For the in-MAS stage, we construct new spatio-temporal features based on clustering analysis of trajectory patterns facilitated by a novel data-driven hybrid polar sampling method. A sequence-to-sequence prediction model, Multi-variate Stacked Fully connected Bidirectional Long-Short Term Memory, is further developed to achieve multi-step-ahead predictions of in-MAS travel time for each trajectory pattern using the spatio-temporal features as input. Finally, the medium-term ETA prediction for an MAS is achieved by integrating the out-MAS and in-MAS prediction with the help of trajectory pattern prediction via random forest. A case study of predicting medium-term ETA for a typical MAS in China, Guangdong-Hong Kong-Macao Greater Bay Area, is conducted to demonstrate the usage and promising performance of the proposed method in comparison to several commonly used end-to-end learning methods.
While end-to-end automatic speech recognition (ASR) has achieved high performance, it requires a huge amount of paired speech and transcription data for training Recently, data augmentation methods have actively been ...
详细信息
ISBN:
(纸本)9781665437394
While end-to-end automatic speech recognition (ASR) has achieved high performance, it requires a huge amount of paired speech and transcription data for training Recently, data augmentation methods have actively been investigated. One method is to use a text-to-speech (TTS) system to generate speech data from text-only data and use the generated speech for data augmentation, but it has been found that the synthesized log Mel-scale filterbank (lmfb) features could have a serious mismatch with the real speech features. In this study, we propose a data augmentation method via a discrete speech representation. The TTS model predicts discrete ID sequences instead of lmfb features, and the ASR also uses the ID sequences as training data. We expect that the use of a discrete representation based on vq-wav2vec not only makes TTS training easier but also mitigates the mismatch with real data. Experimental evaluations show that the proposed method outperforms the data augmentation method using the conventional TTS. We found that it reduces speaker dependency, and the generated features are distributed more closely to the real ones.
The number of people suffering from mental health issues like depression and anxiety have spiked enormously in recent times. Conversational agents like chatbots have emerged as an effective way for users to express th...
详细信息
ISBN:
(纸本)9781728158754
The number of people suffering from mental health issues like depression and anxiety have spiked enormously in recent times. Conversational agents like chatbots have emerged as an effective way for users to express their feelings and anxious thoughts and in turn obtain some empathetic reply that would relieve their anxiety. In our work, we construct two types of empathetic conversational agent models based on sequence-to-sequence modeling with and without attention mechanism. We implement the attention mechanism proposed by Bandanau et al for neural machine translation models. We train our model on the benchmark Facebook Empathetic Dialogue dataset and the BLEU scores are computed. Our empathetic conversational agent model incorporating attention mechanism generates better quality empathetic responses and is better in capturing human feelings and emotions in the conversation.
Machine translation is the process of translating one natural language into another natural language. In the experiment, machine translation tasks were performed on the English to German data set and the English to Th...
详细信息
Machine translation is the process of translating one natural language into another natural language. In the experiment, machine translation tasks were performed on the English to German data set and the English to Thai data set through the sequence-to-sequence model, the sequence-to-sequence model with attention mechanism and the transformer model. Through the analysis of the experimental data, it is concluded that the transformer model is not only better than the first two models in the performance of machine translation, but also the structural characteristics of the transformer model. When the data set is a relatively rare English to Thai data set, the transformer model is collected, and the impact of the result is less than the first two, which proves that the transformer model improves the quality of machine translation.
We present an extended Parrotron model: a single, end-to-end network that enables voice conversion and recognition simultaneously. Input spectrograms are transformed to output spectrograms in the voice of a predetermi...
详细信息
ISBN:
(纸本)9781728176055
We present an extended Parrotron model: a single, end-to-end network that enables voice conversion and recognition simultaneously. Input spectrograms are transformed to output spectrograms in the voice of a predetermined target speaker while also generating hypotheses in a target vocabulary. We study the performance of this novel architecture, which jointly predicts speech and text, on atypical (e.g. dysarthric) speech. We show that with as little as an hour of atypical speech, speaker adaptation can yield a 77% relative reduction in Word Error Rate (WER), measured by ASR performance on the converted speech. We also show that data augmentation using a customized synthesizer built on atypical speech can provide an additional 10% relative improvement over the best speaker-adapted model. Finally, we show how these methods generalize across 8 types of atypical speech for a range of speech impairment severities.
In this paper, we propose Textual Echo Cancellation (TEC) - a framework for cancelling the text-to-speech (TTS) playback echo(1) from overlapping speech recordings. Such a system can largely improve speech recognition...
详细信息
ISBN:
(纸本)9781665437394
In this paper, we propose Textual Echo Cancellation (TEC) - a framework for cancelling the text-to-speech (TTS) playback echo(1) from overlapping speech recordings. Such a system can largely improve speech recognition performance and user experience for intelligent devices such as smart speakers, as the user can talk to the device while the device is still playing the TI'S signal responding to the previous query. We implement this system by using a novel sequence-to-sequence model with multi-source attention that takes both the microphone mixture signal and source text of the TTS playback as inputs, and predicts the enhanced audio. Experiments show that the textual information of the TTS playback is critical to enhancement performance. Besides, the text sequence is much smaller in size compared with the raw acoustic signal of the TTS playback, and can be immediately transmitted to the device or ASR server even before the playback is synthesized. Therefore, our proposed approach effectively reduces Internet communication and latency compared with alternative approaches such as acoustic echo cancellation (AEC).
暂无评论