版权所有:内蒙古大学图书馆 技术提供:维普资讯• 智图
内蒙古自治区呼和浩特市赛罕区大学西街235号 邮编: 010021
作者机构:Univ Macau Dept Comp & Informat Sci Lab NLP2CT Macau Peoples R China Univ Lisbon Inst Super Tecn INESC ID P-1000029 Lisbon Portugal
出 版 物:《IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING》 (IEEE Trans. Audio Speech Lang. Process.)
年 卷 期:2015年第23卷第3期
页 面:441-450页
核心收录:
学科分类:0808[工学-电气工程] 08[工学] 0702[理学-物理学]
基 金:Science and Technology Development Fund of Macau Research Committee of the University of Macau [057/2014/A, MYRG076 (Y1-L2)-FST13-WF, MYRG070 (Y1-L2)-FST12-CS] national funds through Fundacao para a Ciencia e a Tecnologia (FCT) [UID/CEC/50021/2013]
主 题:Graph propagation natural language processing neural word representation syntax parsing
摘 要:This paper aims at learning a better probabilistic context-free grammar with latent annotations (PCFG-LA) by using a graph propagation (GP) technique. We propose leveraging the GP to regularize the lexical model of the grammar. The proposed approach constructs k-nearest neighbor (k-NN) similarity graphs over words with identical pre-terminal (part-of-speech) tags, for propagating the probabilities of latent annotations given the words. The graphs demonstrate the relationship between words in syntactic and semantic levels, estimated by using a neural word representation method based on Recursive autoencoder (RAE). We modify the conventional PCFG-LA parameter estimation algorithm, expectation maximization (EM), by incorporating a GP process subsequent to the M-step. The GP encourages the smoothness among the graph vertices, where different words under similar syntactic and semantic environments should have approximate posterior distributions of nonterminal subcategories. The proposed PCFG-LA learning approach was evaluated together with a hierarchical split-and-merge training strategy, on parsing tasks for English, Chinese and Portuguese. The empirical results reveal two crucial findings: 1) regularizing the lexicons with GP results in positive effects to parsing accuracy;and 2) learning with unlabeled data can also expand the PCFG-LA lexicons.