检索结果-内蒙古大学图书馆

How Much Does Tokenization Affect Neural Machine Translation? 1

学校读者我要写书评

暂无评论

20th International Conference on Computational Linguistics and Intelligent Text Processing, CICLing 2019

作者： Domingo, Miguel García-Martínez, Mercedes Helle, Alexandre Casacuberta, Francisco Herranz, Manuel Pattern Recognition and Human Language Technology Research Center Universitat Politècnica de València Camino de Vera s/n Valencia46022 Spain Pangeanic/B.I Europa PangeaMT Technologies Division Valencia Spain

ISBN: (数字)9783031243370

ISBN: (纸本)9783031243363

Tokenization or segmentation is a wide concept that covers simple processes such as separating punctuation from words, or more sophisticated processes such as applying morphological knowledge. Neural Machine Translation (NMT) requires a limited-size vocabulary for computational cost and enough examples to estimate word embeddings. Separating punctuation and splitting tokens into words or subwords has proven to be helpful to reduce vocabulary and increase the number of examples of each word, improving the translation quality. Tokenization is more challenging when dealing with languages with no separator between words. In order to assess the impact of the tokenization in the quality of the final translation on NMT, we experimented on five tokenizers over ten language pairs. We reached the conclusion that the tokenization significantly affects the final translation quality and that the best tokenizer differs for different language pairs. © 2023, Springer Nature Switzerland AG.

关键词： Neural machine translation

Calibration of Deep Probabilistic Models with Decoupled Bayesian Neural Networks

学校读者我要写书评

暂无评论

arXiv 2019年

作者： Maroñas, Juan Paredes, Roberto Ramos, Daniel PRHLT - Pattern Recognition and Human Language Technology Research Center Universitat Politècnica de Valencia Spain AUDIAS - Audio Data Intelligence and Speech Universidad Autónoma de Madrid Spain

Deep Neural Networks (DNNs) have achieved state-of-the-art accuracy performance in many tasks. However, recent works have pointed out that the outputs provided by these models are not well-calibrated, seriously limiting their use in critical decision scenarios. In this work, we propose to use a decoupled Bayesian stage, implemented with a Bayesian Neural Network (BNN), to map the uncalibrated probabilities provided by a DNN to calibrated ones, consistently improving calibration. Our results evidence that incorporating uncertainty provides more reliable probabilistic models, a critical condition for achieving good calibration. We report a generous collection of experimental results using high-accuracy DNNs in standardized image classification benchmarks, showing the good performance, flexibility and robust behavior of our approach with respect to several state-of-the-art calibration methods. Code for reproducibility is provided. Copyright © 2019, The Authors. All rights reserved.

关键词： Calibration

Combining handwriting and speech recognition for transcribing historical handwritten documents

学校读者我要写书评

暂无评论

Combining handwriting and speech recognition for transcribin...

International Conference on Document Analysis and recognition

作者： Emilio Granell Carlos-D. Martínez-Hinarejos Pattern Recognition and Human Language Technology Research Center Universitat Politècnica de València Valencia Spain

Transcription of historical documents is an interesting task for libraries in order to make available their funds. In the lasts years, the use of Handwritten Text recognition allowed paleographs to speed up the manual transcription process, since they are able to correct on a draft transcription. Another alternative is obtaining the draft transcription by dictating the contents to an Automatic Speech recognition system. When both sources (image and speech) are available, a multimodal combination is possible, and an iterative process can be used in order to refine the final hypothesis. In this work, a multimodal combination based on confusion networks is presented. Results on two different sets of data, with different difficulty level, show that the proposed technique provides similar or better draft transcriptions than a previously proposed approach, allowing for a faster transcription process.

关键词： Iterative decoding Acoustics Proposals Laplace equations Integrated optics Optical imaging

ICFHR2014 Competition on Handwritten Text recognition on Transcriptorium Datasets (HTRtS)

学校读者我要写书评

暂无评论

ICFHR2014 Competition on Handwritten Text Recognition on Tra...

International Workshop on Frontiers in Handwriting recognition

作者： Joan Andreu Sánchez Verónica Romero Alejandro H. Toselli Enrique Vidal Pattern Recognition and Human Language Technology Research Center Universitat Politècnica de València València Spain

A contest on Handwritten Text recognition organised in the context of the ICFHR 2014 conference is described. Two tracks with increased freedom on the use of training data were proposed and three research groups participated in these two tracks. The handwritten images for this contest were drawn from an English data set which is currently being considered in the Tran scriptorium project. The goal of this project is to develop innovative, efficient and cost-effective solutions for the transcription of historical handwritten document images, focusing on four languages: English, Spanish, German and Dutch. For the English language, the so-called "Bentham collection" is being considered in Tran scriptorium. It encompasses a large set of manuscripts written by the renowned English philosopher and reformer Jeremy Bentham (1748-1832). A small subset of this collection has been chosen for the present HTR competition. The selected subset has been written by several hands (Bentham himself and his secretaries) and entails significant variabilities and difficulties regarding the quality of text images and writing styles. Training and test data were provided in the form of carefully segmented line images, along with the corresponding transcripts. The three participants achieved very good results, with transcription word error rates ranging from 15.0% down to 8.6%.

关键词： Training Hidden Markov models Histograms Artificial neural networks Text recognition Training data Adaptive optics

Active learning for interactive neural machine translation of data streams

学校读者我要写书评

暂无评论

arXiv 2018年

作者： Peris, Álvaro Casacuberta, Francisco Pattern Recognition and Human Language Technology Research Center Universitat Politècnica de València València Spain

We study the application of active learning techniques to the translation of unbounded data streams via interactive neural machine translation. The main idea is to select, from an unbounded stream of source sentences, those worth to be supervised by a human agent. The user will interactively translate those samples. Once validated, these data is useful for adapting the neural machine translation model. We propose two novel methods for selecting the samples to be validated. We exploit the information from the attention mechanism of a neural machine translation system. Our experiments show that the inclusion of active learning techniques into this pipeline allows to reduce the effort required during the process, while increasing the quality of the translation system. Moreover, it enables to balance the human effort required for achieving a certain translation quality. Moreover, our neural system outperforms classical approaches by a large margin. Copyright © 2018, The Authors. All rights reserved.

关键词： Neural machine translation

A neural, interactive-predictive system for multimodal sequence to sequence tasks

学校读者我要写书评

暂无评论

arXiv 2019年

作者： Peris, Álvaro Casacuberta, Francisco Pattern Recognition and Human Language Technology Research Center Universitat Politècnica de València València Spain

We present a demonstration of a neural interactive-predictive system for tackling multimodal sequence to sequence tasks. The system generates text predictions to different sequence to sequence tasks: machine translation, image and video captioning. These predictions are revised by a human agent, who introduces corrections in the form of characters. The system reacts to each correction, providing alternative hypotheses, compelling with the feedback provided by the user. The final objective is to reduce the human effort required during this correction process. This system is implemented following a client–server architecture. For accessing the system, we developed a website, which communicates with the neural model, hosted in a local server. From this website, the different tasks can be tackled following the interactive-predictive framework. We open-source all the code developed for building this system. The demonstration in hosted in http://***/ interactive-seq2seq. Copyright © 2019, The Authors. All rights reserved.

关键词： Websites

Online Learning for Neural Machine Translation Post-editing

学校读者我要写书评

暂无评论

arXiv 2017年

作者： Peris, Álvaro Cebrián, Luis Casacuberta, Francisco Pattern Recognition and Human Language Technology Research Center Universitat Politècnica de València València Spain

Neural machine translation has meant a revolution of the field. Nevertheless, post-editing the outputs of the system is mandatory for tasks requiring high translation quality. Post-editing offers a unique opportunity for improving neural machine translation systems, using online learning techniques and treating the post-edited translations as new, fresh training data. We review classical learning methods and propose a new optimization algorithm. We thoroughly compare online learning algorithms in a post-editing scenario. Results show significant improvements in translation quality and effort reduction. Copyright © 2017, The Authors. All rights reserved.

关键词： Learning algorithms

Interactive-predictive neural multimodal systems?

学校读者我要写书评

暂无评论

arXiv 2019年

作者： Peris, Álvaro Casacuberta, Francisco Pattern Recognition and Human Language Technology Research Center Universitat Politècnica de València València Spain

Despite the advances achieved by neural models in sequence to sequence learning, exploited in a variety of tasks, they still make errors. In many use cases, these are corrected by a human expert in a posterior revision process. The interactive-predictive framework aims to minimize the human effort spent on this process by considering partial corrections for iteratively refining the hypothesis. In this work, we generalize the interactive-predictive approach, typically applied in to machine translation field, to tackle other multimodal problems namely, image and video captioning. We study the application of this framework to multimodal neural sequence to sequence models. We show that, following this framework, we approximately halve the effort spent for correcting the outputs generated by the automatic systems. Moreover, we deploy our systems in a publicly accessible demonstration, that allows to better understand the behavior of the interactive-predictive framework. Copyright © 2019, The Authors. All rights reserved.

关键词： Deep learning

Bridging the Native language and language Variety Identification Tasks

学校读者我要写书评

暂无评论

Procedia Computer Science 2017年 112卷 1554-1561页

作者： Marc Franco-Salvador Greg Kondrak Paolo Rosso Symanto Research 90425 Nuremberg Germany Pattern Recognition and Human Language Technology (PRHLT) Research Center Universitat Politècnica de València 46022 Valencia Spain Department of Computing Science University of Alberta Edmonton AB T6G 2E8 Canada

The objective of Native language Identification is to determine the native language of the author of a text that he or she wrote in another language. By contrast, language Variety Identification aims at classifying texts representing different varieties of a single language. We postulate that both tasks may be reduced to a single objective, which is to identify the language variety of the text. We design a general approach that combines string kernels and word embeddings, which capture different characteristics of texts. The results of our experiments show that the approach achieves excellent results on both tasks, without any task-specific adaptations.

关键词： Native language Identification language Variety Identification String Kernels Word Embeddings Classifier Combination