State-of-the-art encoder-decoder models (e.g. for machine translation (MT) or automatic speech recognition (ASR)) are constructed and trained end-to-end as an atomic unit. No component of the model can be (re-)used wi...
详细信息
State-of-the-art encoder-decoder models (e.g. for machine translation (MT) or automatic speech recognition (ASR)) are constructed and trained end-to-end as an atomic unit. No component of the model can be (re-)used without the others, making it impossible to share parts, e.g. a high resourced decoder, across tasks. We describe LegoNN, a procedure for building encoder-decoder architectures in a way so that its parts can be applied to other tasks without the need for any fine-tuning. To achieve this reusability, the interface between encoder and decoder modules is grounded to a sequence of marginal distributions over a pre-defined discrete vocabulary. We present two approaches for ingesting these marginals;one is differentiable, allowing the flow of gradients across the entire network, and the other is gradient-isolating. To enable the portability of decoder modules between MT tasks for different source languages and across other tasks like ASR, we introduce a modality agnostic encoder which consists of a length control mechanism to dynamically adapt encoders' output lengths in order to match the expected input length range of pre-trained decoders. We present several experiments to demonstrate the effectiveness of LegoNN models: a trained language generation LegoNN decoder module from German-English (De-En) MT task can be reused without any fine-tuning for the Europarl English ASR and the Romanian-English (Ro-En) MT tasks, matching or beating the performance of baseline. After fine-tuning, LegoNN models improve the Ro-En MT task by 1.5 BLEU points and achieve 12.5% relative WER reduction on the Europarl ASR task. To show how the approach generalizes, we compose a LegoNN ASR model from three modules - each has been learned within different end-to-end trained models on three different datasets - achieving an overall WER reduction of 19.5%.
Image matting is a technique used to extract the foreground and background from a given image. In the past, classical algorithms based on sampling, propagation, or a combination of the two were used to perform image m...
详细信息
Image matting is a technique used to extract the foreground and background from a given image. In the past, classical algorithms based on sampling, propagation, or a combination of the two were used to perform image matting;however, most of these have produced poor results when applied to images with complex backgrounds. They are also unable to extract with high accuracy foreground images that are comprised of thin objects. In this context, the use of deep learning to solve the image matting problem has gained increasing popularity. In this paper, an encoder-decoder model for alpha matting of human portraits using deep learning is proposed. The model used comprises two parts: the first is an encoder-decoder model, which is a deep convolutional network that has 11 convolutional layers and 5 max-pooling layers in the encoder stage and 11 convolutional layers and 5 unpooling layers in the decoder stage. This portion of the model takes the image and trimap as input produces the coarse alpha matte as the output. The second part is the refinement stage with four convolutional layers, responsible for further refining the coarse alpha matte that was produced by the encoder-decoder stage to obtain an alpha matte of high accuracy. The model was trained using 43,100 images. When tested using the dataset, our model's output was comparable to the industry standard, yielding an average MSE of 0.023 and an average SAD loss of 66.5.
Predictive monitoring is a subfield of process mining that aims to predict how a running case will unfold in the future. One of its main challenges is forecasting the sequence of activities that will occur from a give...
详细信息
Predictive monitoring is a subfield of process mining that aims to predict how a running case will unfold in the future. One of its main challenges is forecasting the sequence of activities that will occur from a given point in time -suffix prediction-. Most approaches to the suffix prediction problem learn to predict the suffix by learning how to predict the next activity only, while also disregarding structural information present in the process model. This paper proposes a novel architecture based on an encoder-decoder model with an attention mechanism that decouples the representation learning of the prefixes from the inference phase, predicting only the activities of the suffix. During the inference phase, this architecture is extended with a heuristic search algorithm that selects the most probable suffix according to both the structural information extracted from the process model and the information extracted from the log. Our approach has been tested using 12 public event logs against 6 different state-of-the-art proposals, showing that it significantly outperforms these proposals.
In this paper, we present an encoder-decoder model leveraging Flan-T5 for post-Automatic Speech Recognition (ASR) Generative Speech Error Correction (GenSEC), and we refer to it as FlanEC. We explore its application w...
详细信息
ISBN:
(纸本)9798350392265;9798350392258
In this paper, we present an encoder-decoder model leveraging Flan-T5 for post-Automatic Speech Recognition (ASR) Generative Speech Error Correction (GenSEC), and we refer to it as FlanEC. We explore its application within the GenSEC framework to enhance ASR outputs by mapping n-best hypotheses into a single output sentence. By utilizing n-best lists from ASR models, we aim to improve the linguistic correctness, accuracy, and grammaticality of final ASR transcriptions. Specifically, we investigate whether scaling the training data and incorporating diverse datasets can lead to significant improvements in post-ASR error correction. We evaluate FlanEC using the HyPoradise dataset, providing a comprehensive analysis of the model's effectiveness in this domain. Furthermore, we assess the proposed approach under different settings to evaluate model scalability and efficiency, offering valuable insights into the potential of instruction-tuned encoder-decoder models for this task.
Text-to-speech (TTS) systems are an important component in voice-based e-commerce applications. These applications include end-to-end voice assistant and customer experience (CX) voice bot. Code-mixed TTS is also rele...
详细信息
ISBN:
(纸本)9783031483110;9783031483127
Text-to-speech (TTS) systems are an important component in voice-based e-commerce applications. These applications include end-to-end voice assistant and customer experience (CX) voice bot. Code-mixed TTS is also relevant in these applications since the product names are commonly described in English while the surrounding text is in a regional language. In this work, we describe our approaches for production quality code-mixed Hindi-English TTS systems built for e-commerce applications. We propose a data-oriented approach by utilizing monolingual data sets in individual languages. We leverage a transliteration model to convert the Roman text into a common Devanagari script and then combine both datasets for training. We show that such single script bi-lingual training without any code-mixing works well for pure code-mixed test sets. We further present an exhaustive evaluation of single-speaker adaptation and multi-speaker training with Tacotron2 + Waveglow setup to show that the former approach works better. These approaches are also coupled with transfer learning and decoder-only fine-tuning to improve performance. We compare these approaches with the Google TTS and report a positive CMOS score of 0.02 with the proposed transfer learning approach. We also perform low-resource voice adaptation experiments to show that a new voice can be onboarded with just 3 hrs of data. This highlights the importance of our pre-trained models in resource-constrained settings. This subjective evaluation is performed on a large number of out-of-domain pure code-mixed sentences to demonstrate the high quality of the systems.
Image segmentation is a key task in computer vision and image processing with important applications such as scene understanding, medical image analysis, robotic perception, video surveillance, augmented reality, and ...
详细信息
Image segmentation is a key task in computer vision and image processing with important applications such as scene understanding, medical image analysis, robotic perception, video surveillance, augmented reality, and image compression, among others, and numerous segmentation algorithms are found in the literature. Against this backdrop, the broad success of deep learning (DL) has prompted the development of new image segmentation approaches leveraging DL models. We provide a comprehensive review of this recent literature, covering the spectrum of pioneering efforts in semantic and instance segmentation, including convolutional pixel-labeling networks, encoder-decoder architectures, multiscale and pyramid-based approaches, recurrent networks, visual attention models, and generative models in adversarial settings. We investigate the relationships, strengths, and challenges of these DL-based segmentation models, examine the widely used datasets, compare performances, and discuss promising research directions.
On a daily basis, data centers process huge volumes of data backed by the proliferation of inexpensive hard disks. Data stored in these disks serve a range of critical functional needs from financial, and healthcare t...
详细信息
ISBN:
(纸本)9798350322811
On a daily basis, data centers process huge volumes of data backed by the proliferation of inexpensive hard disks. Data stored in these disks serve a range of critical functional needs from financial, and healthcare to aerospace. As such, premature disk failure and consequent loss of data can be catastrophic. To mitigate the risk of failures, cloud storage providers perform condition-based monitoring and replace hard disks before they fail. By estimating the remaining useful life of hard disk drives, one can predict the time-to-failure of a particular device and replace it at the right time, ensuring maximum utilization whilst reducing operational costs. In this work, large-scale predictive analyses are performed using severely skewed health statistics data by incorporating customized feature engineering and a suite of sequence learners. Past work suggests using LSTMs as an excellent approach to predicting remaining useful life. To this end, we present an encoder-decoder LSTM model where the context gained from understanding health statistics sequences aid in predicting an output sequence of the number of days remaining before a disk potentially fails. The models developed in this work are trained and tested across an exhaustive set of all of the 10 years of S.M.A.R.T. health data in circulation from Backblaze and on a wide variety of disk instances. It closes the knowledge gap on what full-scale training achieves on thousands of devices and advances the state-of-the-art by providing tangible metrics for evaluation and generalization for practitioners looking to extend their workflow to all years of health data in circulation across disk manufacturers. The encoder-decoder LSTM posted an RMSE of 0.83 during training and 0.86 during testing over the exhaustive 10-year data while being able to generalize competitively over other drives from the Seagate family.
Combinatorial optimization problems are an important class of problems often encountered in the real world involving a combinatorially growing set of feasible solutions as the problem size increases. Since exact appro...
详细信息
Combinatorial optimization problems are an important class of problems often encountered in the real world involving a combinatorially growing set of feasible solutions as the problem size increases. Since exact approaches can be computationally expensive, practitioners often use approximate approaches such as metaheuristics. However, sophisticated approximate methods that yield high-quality solutions require expert help to handcraft or fine-tune the solution process to suit a given problem distribution. In recent years, artificial intelligence (AI) approaches that involve learning from data without being explicitly programmed have shown tremendous success at various challenging tasks, like natural language processing and autonomous driving. Therefore, solving combinatorial optimization problems is an ideal use case for AI approaches. In this dissertation, we find answers to two key questions considering recent AI developments. 1) How to use deep reinforcement learning (DRL) approaches to solve complex multi-vehicle combinatorial optimization problems. 2) Can combining machine learning, metaheuristics, and mixed integer-linear optimization solvers under a hybrid framework help quickly obtain certifiable high-quality solutions for combinatorial optimization problems? The answer to these questions broadly builds on two key directions: DRL and hybrid approaches to tackle challenging multi-vehicle combinatorial optimization problems considering the recent advancements, gaps, and drawbacks. Specifically, in Part I of this dissertation, DRL-based approximate approaches are developed to learn from complex edge features, reason over uncertain edges, and handle multi-vehicle decoding and collaboration to solve complex multi-vehicle combinatorial optimization problems. Additionally, we develop approaches to generate large-scale complex data on the fly for training. Upon experimental evaluation, we learn that DRL-based approaches can quickly generate high-quality solutions to
Real-time safety prediction models are vital in proactive road safety management strategies. This study develops models to predict traffic conflicts at signalized intersections at the signal cycle level, using advance...
详细信息
Real-time safety prediction models are vital in proactive road safety management strategies. This study develops models to predict traffic conflicts at signalized intersections at the signal cycle level, using advanced Bayesian deep learning techniques and efficient LiDAR points. The modeling framework contains three phases, which are data preprocessing, base deep learning model development, and Bayesian deep learning model development. The core of the framework is the long short-term memory (LSTM) employed to predict the conflict frequency of a cycle by using traffic features of the previous five cycles (e.g., dynamic traffic parameters, traffic conflict frequency). Four Bayesian deep learning models were developed, including Bayesian-Standard LSTM, BayesianHybrid-LSTM, Bayesian-Stacked-LSTM encoder-decoder, and Bayesian-Multi-head Stacked-LSTM encoderdecoder. The developed models were applied to traffic conflicts extracted from LiDAR points that were collected from a signalized intersection in Harbin, China with a total duration of seven days. Traffic conflicts, measured by the modified time-to-collision conflict indicator, were identified using the peak over threshold approach. The models were thoroughly evaluated from the aspects of reliability, transferability, sensitivity, and robustness. The results show that the developed four models can predict traffic conflict frequency per cycle per lane simultaneously with its uncertainty. Moreover, the two Bayesian encoder-decoder models perform better than Bayesian-Standard LSTM and Bayesian-Hybrid-LSTM in the four tests. Bayesian-Multi-head Stacked-LSTM encoder-decoder is suggested as the optimal model for its high reliability under uncertainty, good transferability in three scenarios, low sensitivity to different parameters, and sound robustness against small noise. The proposed framework could benefit studies on the state-of-the-art data-driven approach for real-time safety prediction.
The paper explores a novel methodology in source code obfuscation through the application of text-based recurrent neural network (RNN) encoder-decoder models in ciphertext generation and key generation. Sequence-to-se...
详细信息
ISBN:
(纸本)9783030801267;9783030801250
The paper explores a novel methodology in source code obfuscation through the application of text-based recurrent neural network (RNN) encoder-decoder models in ciphertext generation and key generation. Sequence-to-sequence models are incorporated into the model architecture to generate obfuscated code, generate the deobfuscation key, and live execution. Quantitative benchmark comparison to existing obfuscation methods indicate significant improvement in stealth and execution cost for the proposed solution, and experiments regarding the model's properties yield positive results regarding its character variation, dissimilarity to the original codebase, and consistent length of obfuscated code.
暂无评论