Log-bilinear language models such as SkipGram and GloVe have been proven to capture high quality syntactic and semantic relationships between words in a vector space. We revisit the relationship between SkipGram and G...
详细信息
ISBN:
(纸本)9781941643730
Log-bilinear language models such as SkipGram and GloVe have been proven to capture high quality syntactic and semantic relationships between words in a vector space. We revisit the relationship between SkipGram and GloVe models from a machinelearning viewpoint, and show that these two methods are easily merged into a unified form. then, by using the unified form, we extract the factors of the configurations that they use differently. We also empirically investigate which factor is responsible for the performance difference often observed in widely examined word similarity and analogy tasks.
Pronouns are frequently dropped in Chinese sentences, especially in informal data such as text messages. In this work we propose a solution to recover dropped pronouns in SMS data. We manually annotate dropped pronoun...
详细信息
ISBN:
(纸本)9781941643730
Pronouns are frequently dropped in Chinese sentences, especially in informal data such as text messages. In this work we propose a solution to recover dropped pronouns in SMS data. We manually annotate dropped pronouns in 684 SMS files and apply machinelearning algorithms to recover them, leveraging lexical, contextual and syntactic information as features. We believe this is the first work on recovering dropped pronouns in Chinese text messages.
Modern statistical machine translation (SMT) systems usually use a linear combination of features to model the quality of each translation hypothesis. the linear combination assumes that all the features are in a line...
详细信息
ISBN:
(纸本)9781941643723
Modern statistical machine translation (SMT) systems usually use a linear combination of features to model the quality of each translation hypothesis. the linear combination assumes that all the features are in a linear relationship and constrains that each feature interacts withthe rest features in an linear manner, which might limit the expressive power of the model and lead to a under-fit model on the current data. In this paper, we propose a non-linear modeling for the quality of translation hypotheses based on neural networks, which allows more complex interaction between features. A learning framework is presented for training the non-linear models. We also discuss possible heuristics in designing the network structure which may improve the non-linear learning performance. Experimental results show that withthe basic features of a hierarchical phrase-based machine translation system, our method produce translations that are better than a linear model.
Word-sense disambiguation is one of the key concepts in naturallanguageprocessing. the main goal of a language is to present a specific concept to the audience. this concept is extracted from the meaning of words in...
详细信息
ISBN:
(纸本)9781538608043
Word-sense disambiguation is one of the key concepts in naturallanguageprocessing. the main goal of a language is to present a specific concept to the audience. this concept is extracted from the meaning of words in that language. System should be able to identify role and meaning of words in order to identify the concepts in texts properly. this issue becomes more problematic if there are words that take different meanings because of their surrounding words. Regarding that different practical programs have been developed in Persian language, it is vital now to find a solution for word-sense disambiguation in Persian language. Lack of training data is the biggest challenge in the course of word-sense disambiguation in Persian language. In order to face this problem, machinelearning approach with minimal supervision is employed in this research. the applied method tries to disambiguate word senses by considering defined features of target words and applying collaborative learning method. Extracted corpus from published news by news agencies is used as the reference corpus. Evaluating the program by the available corpus on three considered ambiguous words, the implemented method has been able to properly identify the meaning of 5368 documents with 88% recall, 95% precision and 93% accuracy rate.
Named Entity Recognition (NER) is a crucial technology in naturallanguageprocessing, and currently, deep learning-based methods have been widely applied to Chinese entity recognition research. For relevant articles ...
详细信息
A standard pipeline for statistical relational learning involves two steps: one first constructs the knowledge base (KB) from text, and then performs the learning and reasoning tasks using probabilistic first-order lo...
详细信息
ISBN:
(纸本)9781941643723
A standard pipeline for statistical relational learning involves two steps: one first constructs the knowledge base (KB) from text, and then performs the learning and reasoning tasks using probabilistic first-order logics. However, a key issue is that information extraction (IE) errors from text affect the quality of the KB, and propagate to the reasoning task. In this paper, we propose a statistical relational learning model for joint information extraction and reasoning. More specifically, we incorporate context-based entity extraction with structure learning (SL) in a scalable probabilistic logic framework. We then propose a latent context invention (LCI) approach to improve the performance. In experiments, we show that our approach outperforms state-of-the-art baselines over three real-world Wikipedia datasets from multiple domains;that joint learning and inference for IE and SL significantly improve both tasks;that latent context invention further improves the results.
Understanding open-domain text is one of the primary challenges in NLP machine comprehension evaluates the system's ability to understand text through a series of question-answering tasks on short pieces of text s...
详细信息
ISBN:
(纸本)9781941643723
Understanding open-domain text is one of the primary challenges in NLP machine comprehension evaluates the system's ability to understand text through a series of question-answering tasks on short pieces of text such that the correct answer can be found only in the given text. For this task, we posit that there is a hidden (latent) structure that explains the relation between the question, correct answer, and text. We call this the answer-entailing structure;given the structure, the correctness of the answer is evident. Since the structure is latent, it must be inferred. We present a unified max-margin framework that learns to find these hidden structures (given a corpus of question-answer pairs), and uses what it learns to answer machine comprehension questions on novel texts. We extend this framework to incorporate multi-task learning on the different subtasks that are required to perform machine comprehension. Evaluation on a publicly available dataset shows that our framework outperforms various IR and neural-network baselines, achieving an overall accuracy of 67.8% (vs. 59.9%, the best previously-published result.)
Methods for name matching, an important component to support downstream tasks such as entity linking and entity clustering, have focused on alphabetic languages, primarily English. In contrast, logogram languages such...
ISBN:
(纸本)9781941643730
Methods for name matching, an important component to support downstream tasks such as entity linking and entity clustering, have focused on alphabetic languages, primarily English. In contrast, logogram languages such as Chinese remain untested. We evaluate methods for name matching in Chinese, including both string matching and learning approaches. Our approach, based on new representations for Chinese, improves both name matching and a downstream entity clustering task.
this study used a multi-pronged approach to examine how Gulf region Twitter users felt about COVID-19 trending topics. the data collection phase of the study started withthe selection of relevant hashtags and time fr...
详细信息
ISBN:
(纸本)9798350351491;9798350351484
this study used a multi-pronged approach to examine how Gulf region Twitter users felt about COVID-19 trending topics. the data collection phase of the study started withthe selection of relevant hashtags and time frames for analysis. the Twitter API was then used to compile a representative dataset of Arabic tweets pertaining to the COVID-19 trending topics. Subsequently, the research proceeded to the data annotation stage, employing a hybrid annotation technique that fused the transfer learning model and lexicon-based approach to assign a sentiment label to every tweet. Analyzing the patterns of tweet distribution over time exposed interesting patterns and possible sentiment expression influencers. the research obtained good accuracy scores by using a sentiment analysis model that combined three popular machinelearning algorithms (Multinomial Naive Bayes, CountVectorizer, and TfidfVectorizer) withthree feature representations (Ngram, TfidfVectorizer, and CountVectorizer). the sentiment tendencies of Arabic-speaking Twitter users toward trending topics were revealed by these scores. Withthe Ngram(1,2) representation, the LinearSVC algorithm achieved an impressive accuracy score of 89.1%, making it stand out as the best performer among all feature representations.
A lightweight, human-in-the-loop evaluation scheme for machine translation (MT) systems is proposed. It extrinsically evaluates MT systems using human subjects' scores on second language ability test problems that...
详细信息
ISBN:
(纸本)9781941643730
A lightweight, human-in-the-loop evaluation scheme for machine translation (MT) systems is proposed. It extrinsically evaluates MT systems using human subjects' scores on second language ability test problems that are machine-translated to the subjects' native language. A large-scale experiment involving 320 subjects revealed that the context-unawareness of the current MT systems severely damages human performance when solving the test problems, while one of the evaluated MT systems performed as good as a human translation produced in a context-unaware condition. An analysis of the experimental results showed that the extrinsic evaluation captured a different dimension of translation quality than that captured by manual and automatic intrinsic evaluation.
暂无评论