The use of ever larger corpora for NLP research seems to reflect the folk theorem that increasing sizes of training data for supervised, and definitively for unsupervised, machine learning approaches will (always) lea...
详细信息
ISBN:
(纸本)9549174336
The use of ever larger corpora for NLP research seems to reflect the folk theorem that increasing sizes of training data for supervised, and definitively for unsupervised, machine learning approaches will (always) lead to improving the quality of the learning results for various NLP tasks. We challenge this general assumption in the light of empirical counterevidence. Following up on work in machine translation and word sense disambiguation, we wanted to estimate the necessary and sufficient, and hence fully adequate, size of underlying training corpora. We conducted various experimental studies on the unsupervised disambiguation of ambiguous prepositional phrase attachments for the English and German language. Based on this evidence, we are able to estimate reasonable upper bounds of the sufficient size of a proper training corpus, for this task at least.
We present Legal Argument Reasoning (LAR), a novel task designed to evaluate the legal reasoning capabilities of Large language Models (LLMs). The task requires selecting the correct next statement (from multiple choi...
详细信息
We present experiments that analyze the necessity of using a highly interconnected word/sense graph for unsupervised all-words word sense disambiguation. We show that allowing only grammatically related words to influ...
详细信息
It is difficult for a language model (LM) to perform well with limited in-domain transcripts in low-resource speech recognition. In this paper, we mainly summarize and extend some effective methods to make the most of...
详细信息
ISBN:
(纸本)9781728150147
It is difficult for a language model (LM) to perform well with limited in-domain transcripts in low-resource speech recognition. In this paper, we mainly summarize and extend some effective methods to make the most of the out-of-domain data to improve LMs. These methods include data selection, vocabulary expansion, lexicon augmentation, multi-model fusion and so on. The methods are integrated into a systematic procedure, which proves to be effective for improving both n-gram and neural network LMs. Additionally, pre-trained word vectors using out-of-domain data arc utilized to improve the performance of RNN/LSTM LMs for restoring first-pass decoding results. Experiments on live Asian languages from Babel Build Packs show that, after improving LMs, 5.4-7.6% relative reduction of word error rate (WER) is generally achieved compared to the baseline ASR systems. For some languages, we achieve lower WER than newly published results on the same data sets.
Question detection serves great purposes in the cQA question retrieval task. While detecting questions in standard language data corpus is relatively easy, it becomes a great challenge for online content. Online quest...
详细信息
Question detection serves great purposes in the cQA question retrieval task. While detecting questions in standard language data corpus is relatively easy, it becomes a great challenge for online content. Online questions are usually long and informal, and standard features such as question mark or 5W1H words are likely to be absent. In this paper, we explore question characteristics in cQA services, and propose an automated approach to detect question sentences based on lexical and syntactic features. Our model is capable of handling informal online languages. The empirical evaluation results further demonstrate that our model significantly outperforms traditional methods in detecting online question sentences, and it considerably boosts the question retrieval performance in cQA.
In naturallanguageprocessing (NLP), language identification is the problem of determining which naturallanguage(s) are used in written script. This paper presents a methodology for language Identification from mult...
详细信息
ISBN:
(纸本)9789811016752;9789811016745
In naturallanguageprocessing (NLP), language identification is the problem of determining which naturallanguage(s) are used in written script. This paper presents a methodology for language Identification from multilingual document written in Indian language(s). The main objective of this research is to automatically, quickly, and accurately recognize the language from the multilingual document written in Indian language(s) and then separate the content according to types of language, using Unicode Transformation Format (UTF). The proposed methodology is applicable for preprocessing step in document classification and a number of applications such as POS-Tagging, Information Retrieval, Search Engine Optimization, and Machine Translation for Indian languages. Sixteen different Indian languages have been used for empirical purpose. The corpus texts were collected randomly from web and 822 documents were prepared, comprising of 300 Portable Document Format (PDF) files and 522 text files. Each of 822 documents contained more than 800 words written in different and multiple Indian languages at the sentence level. The proposed methodology has been implemented using UTF-8 through free and open-source programming language Java Server Pages (JSP). The obtained results with an execution of 522 Text file documents yielded an accuracy of 99.98 %, whereas 300 PDF documents yielded an accuracy of 99.28 %. The accuracy of text files is more than PDF files by 0.70 %, due to corrupted texts appearing in PDF files.
Automatically generating rich naturallanguage descriptions for open-domain videos is among the most challenging tasks of computer vision, naturallanguageprocessing and machine learning. Based on the general approac...
详细信息
ISBN:
(数字)9781510623118
ISBN:
(纸本)9781510623118
Automatically generating rich naturallanguage descriptions for open-domain videos is among the most challenging tasks of computer vision, naturallanguageprocessing and machine learning. Based on the general approach of encoder-decoder frameworks, we propose a bidirectional long short-term memory network with spatial-temporal attention based on multiple features of objects, activities and scenes, which can learn valuable and complementary high-level visual representations, and dynamically focus on the most important context information of diverse frames within different subsets of videos. From the experimental results, our proposed methods achieve competitive or better than state-of-the-art performance on the MSVD video dataset.
Multi-modal open-domain question answering typically requires evidence retrieval from databases across diverse modalities, such as images, tables, passages, etc. Even Large language Models (LLMs) like GPT-4 fall short...
详细信息
ISBN:
(纸本)9798891760615
Multi-modal open-domain question answering typically requires evidence retrieval from databases across diverse modalities, such as images, tables, passages, etc. Even Large language Models (LLMs) like GPT-4 fall short in this task. To enable LLMs to tackle the task in a zero-shot manner, we introduce MOQAGPT1, a straightforward and flexible framework. Using a divide-and-conquer strategy that bypasses intricate multi-modality ranking, our framework can accommodate new modalities and seamlessly transition to new models for the task. Built upon LLMs, MOQAGPT retrieves and extracts answers from each modality separately, then fuses this multi-modal information using LLMs to produce a final answer. Our methodology boosts performance on the MMCoQA dataset, improving F1 by +37.91 points and EM by +34.07 points over the supervised baseline. On the MultiModalQA dataset, MOQAGPT surpasses the zero-shot baseline, improving F1 by 9.5 points and EM by 10.1 points, and significantly closes the gap with supervised methods.
language recognition is typically performed with methods that exploit phonotactics-a phone recognition language modeling (PRLM) system. A PRLM system converts speech to a lattice of phones and then scores a language m...
详细信息
ISBN:
(纸本)1424407281
language recognition is typically performed with methods that exploit phonotactics-a phone recognition language modeling (PRLM) system. A PRLM system converts speech to a lattice of phones and then scores a language model. A standard extension to this scheme is to use multiple parallel phone recognizers (PPRLM). In this paper, we modify this approach in two distinct ways. First, we replace the phone tokenizer by a powerful speech-to-text system. Second, we use a discriminative support vector machine for language modeling. Our goals are twofold. First, we explore the ability of a single speech-to-text system to distinguish multiple languages. Second, we fuse the new system with an SVM PRLM system to see if it complements current approaches. Experiments on the 2005 NIST language recognition corpus show the new word system accomplishes these goals and has significant potential for language recognition.
The construction of a speech understanding application requires a method for extracting language models of appropriate size and perplexity from the application grammar. We describe a method for approximating context-f...
详细信息
暂无评论