Many linguistic and textual processes involve transduc-tion of strings. We show how to learn a stochastic transducer from an unorganized collection of strings (rather than string pairs). The role of the transducer is ...
详细信息
ISBN:
(纸本)9781622765034
Many linguistic and textual processes involve transduc-tion of strings. We show how to learn a stochastic transducer from an unorganized collection of strings (rather than string pairs). The role of the transducer is to organize the collection. Our generative model explains similarities among the strings by supposing that some strings in the collection were not generated ab initio, but were instead derived by transduction from other, "similar" strings in the collection. Our variational EM learning algorithm alternately reestimates this phylogeny and the transducer parameters. The final learned transducer can quickly link any test name into the final phylogeny, thereby locating variants of the test name. We find that our method can effectively find name variants in a corpus of web strings used to refer to persons in Wikipedia, improving over standard untrained distances such as Jaro-Winkler and Leven-shtein distance.
We present progress on Joshua, an open-source decoder for hierarchical and syntax-based machine translation. The main focus is describing Thrax, a flexible, open source synchronous context-free grammar extractor. Thra...
详细信息
Spoken term discovery is the task of automatically identifying words and phrases in speech data by searching for long repeated acoustic patterns. Initial solutions relied on exhaustive dynamic time warping-based searc...
详细信息
The recursive least squares (RLS) algorithm is well known and has been widely used for many years. Most analyses of RLS have assumed statistical properties of the data or the noise process, but recent robust ℌ ∞ ana...
详细信息
The recursive least squares (RLS) algorithm is well known and has been widely used for many years. Most analyses of RLS have assumed statistical properties of the data or the noise process, but recent robust ℌ ∞ analyses have been used to bound the ratio of the performance of the algorithm to the total noise. In this paper, we provide an additive analysis bounding the difference between performance and noise. Our analysis provides additional convergence guarantees in general, and particular benefits for structured input data. We illustrate the analysis using humanspeech and white noise.
In many inference and learning tasks, collecting large amounts of labeled training data is time consuming and expensive, and oftentimes impractical. Thus, being able to efficiently use small amounts of labeled data wi...
详细信息
We establish a novel task in the spirit of news summarization and topic detection and tracking (TDT): daily determination of the topics newly popular with Wikipedia readers. Central to this effort is a new public data...
详细信息
ISBN:
(纸本)9781618392480
We establish a novel task in the spirit of news summarization and topic detection and tracking (TDT): daily determination of the topics newly popular with Wikipedia readers. Central to this effort is a new public dataset consisting of the hourly page view statistics of all Wikipedia articles over the last three years. We give baseline results for the tasks of: discovering individual pages of interest, clustering these pages into coherent topics, and extracting the most relevant summarizing sentence for the reader. When compared to human judgements, our system shows the viability of this task, and opens the door to a range of exciting future work.
Long-span language models, such as those involving syntactic dependencies, produce more coherent text than their n-gram counterparts. However, evaluating the large number of sentence-hypotheses in a packed representat...
详细信息
Long-span language models, such as those involving syntactic dependencies, produce more coherent text than their n-gram counterparts. However, evaluating the large number of sentence-hypotheses in a packed representation such as an ASR lattice is intractable under such long-span models both during decoding and discriminative training. The accepted compromise is to rescore only the N-best hypotheses in the lattice using the long-span LM. We present discriminative hill climbing, an efficient and effective discriminative training procedure for long-span LMs based on a hill climbing rescoring algorithm [1]. We empirically demonstrate significant computational savings as well as error-rate reduction over N-best training methods in a state of the art ASR system for Broadcast News transcription.
Spoken term discovery is the task of automatically identifying words and phrases in speech data by searching for long repeated acoustic patterns. Initial solutions relied on exhaustive dynamic time warping-based searc...
详细信息
Spoken term discovery is the task of automatically identifying words and phrases in speech data by searching for long repeated acoustic patterns. Initial solutions relied on exhaustive dynamic time warping-based searches across the entire similarity matrix, a method whose scalability is ultimately limited by the O(n 2 ) nature of the search space. Recent strategies have attempted to improve search efficiency by using either unsupervised or mismatched-language acoustic models to reduce the complexity of the feature representation. Taking a completely different approach, this paper investigates the use of randomized algorithms that operate directly on the raw acoustic features to produce sparse approximate similarity matrices in O(n) space and O(n log n) time. We demonstrate these techniques facilitate spoken term discovery performance capable of outperforming a model-based strategy in the zero resource setting.
Accurate estimates of language model parameters are critical for building quality text generation systems, such as automatic speech recognition. However, text training data for a domain of interest is often unavailabl...
详细信息
Accurate estimates of language model parameters are critical for building quality text generation systems, such as automatic speech recognition. However, text training data for a domain of interest is often unavailable. Instead, we use semi-supervised model adaptation; parameters are estimated using both unlabeled in-domain data (raw speech audio) and labeled out of domain data (text.) In this work, we present a new semi-supervised language model adaptation procedure for Maximum Entropy models with n-gram features. We augment the conventional maximum likelihood training criterion on out-of-domain text data with an additional term to minimize conditional entropy on in-domain audio. Additionally, we demonstrate how to compute conditional entropy efficiently on speech lattices using first- and second-order expectation semirings. We demonstrate improvements in terms of word error rate over other adaptation techniques when adapting a maximum entropy language model from broadcast news to MIT lectures.
暂无评论