Would you target your audience differently, knowing the real age and gender of the text authors on your website forum? This paper examines hundreds of thousands of online documents, e.g. chat lines or blog posts, show...
详细信息
Would you target your audience differently, knowing the real age and gender of the text authors on your website forum? This paper examines hundreds of thousands of online documents, e.g. chat lines or blog posts, showing that computers are capable to address this task better than humans, without relying on content stereotypes. Pointing out that age and gender profiling are not independent problems, we approach the task as a multiclass classification problem, combining the age and gender information to define six classes. Utilizing a wide range of stylistic and content features and a large number of readability measures we demonstrate the high predictive abilities of the parts of speech, the punctuation and the amount of emotions and slang used in the text, independently of the topic discussed.
Our system combines text similarity measures with a textual entailment system. In the main task, we focused on the influence of lexicalized versus unlexicalized features, and how they affect performance on unseen ques...
ISBN:
(纸本)9781937284497
Our system combines text similarity measures with a textual entailment system. In the main task, we focused on the influence of lexicalized versus unlexicalized features, and how they affect performance on unseen questions and domains. We also participated in the pilot partial entailment task, where our system significantly outperforms a strong baseline. c 2013 Association for Computational Linguistics
Textual entailment is an asymmetric relation between two text fragments that describes whether one fragment can be inferred from the other. It thus cannot capture the notion that the target fragment is "almost en...
详细信息
With the increasing amount of user generated reference texts in the web, automatic quality assessment has become a key challenge. However, only a small amount of annotated data is available for training quality assess...
详细信息
In this paper, we analyze a novel set of features for the task of automatic edit category classification. Edit category classification assigns categories such as spelling error correction, paraphrase or vandalism to e...
详细信息
We propose a semi-informative aware approach using the topic model on query expansion problem in the biomedicine domain. the demographics and disease information is applied to semi-structure the topic model as the “k...
详细信息
We propose a semi-informative aware approach using the topic model on query expansion problem in the biomedicine domain. the demographics and disease information is applied to semi-structure the topic model as the “known” label, compared to the traditional latent topics in topic modelling. Then, we suggest to select three terms from the top ranked documents to expand the query, based on the assumption in the pseudo relevance feedback method that the top ranked results in the first retrieval around are relevant. After that, we conduct the experiments on the TREC medical records data sets with extensive analysis and discussions. Numerically, we achieve the improvements of 7.41% on MAP, 9.29% on Bpref and 5.60% on P@10 respectively over the strong baselines.
A table-of-contents (TOC) provides a quick reference to a document's content and structure. We present the first study on identifying the hierarchical structure for automatically generating a TOC using only textua...
详细信息
A table-of-contents (TOC) provides a quick reference to a document's content and structure. We present the first study on identifying the hierarchical structure for automatically generating a TOC using only textual features instead of structural hints e.g. from HTML-tags. We create two new datasets to evaluate our approaches for hierarchy identification. We find that our algorithm performs on a level that is sufficient for a fully automated system. For documents without given segment titles, we extend our work by automatically generating segment titles. We make the datasets and our experimental framework publicly available in order to foster future research in TOC generation.
This paper introduces a general method to incorporate the LDA Topic Model into text segmentation algorithms. We show that semantic information added by Topic Models significantly improves the performance of two wordba...
详细信息
In this paper, we propose an annotation schema for the discourse analysis of Wikipedia Talk pages aimed at the coordination efforts for article improvement. We apply the annotation schema to a corpus of 100 Talk pages...
详细信息
ISBN:
(纸本)9781622760428
In this paper, we propose an annotation schema for the discourse analysis of Wikipedia Talk pages aimed at the coordination efforts for article improvement. We apply the annotation schema to a corpus of 100 Talk pages from the Simple English Wikipedia and make the resulting dataset freely available for download~1. Furthermore, we perform automatic dialog act classification on Wikipedia discussions and achieve an average F_1 -score of 0.82 with our classification pipeline.
暂无评论