This paper describes our system for the detection of stances in tweets submitted to SemEval 2016 Task 6A. The system uses an ensemble of learning algorithms, fine-tuned using a genetic algorithm. We experiment with va...
详细信息
In this paper we describe VERBCROCEAN, a broad-coverage repository of fine-grained semantic relations between Croatian verbs. Adopting the methodology of Chklovski and Pantel (2004) used for acquiring the English Verb...
详细信息
ISBN:
(纸本)9782951740891
In this paper we describe VERBCROCEAN, a broad-coverage repository of fine-grained semantic relations between Croatian verbs. Adopting the methodology of Chklovski and Pantel (2004) used for acquiring the English VerbOcean, we first acquire semantically related verb pairs from a web corpus hrWaC by relying on distributional similarity of subject-verb-object paths in the dependency trees. We then classify the semantic relations between each pair of verbs as similarity, intensity, antonymy, or happens-before, using a number of manually-constructed lexico-syntatic patterns. We evaluate the quality of the resulting resource on a manually annotated sample of 1000 semantic verb relations. The evaluation revealed that the predictions are most accurate for the similarity relation, and least accurate for the intensity relation. We make available two variants of VERBCROCEAN: a coverage-oriented version, containing about 36k verb pairs at a precision of 41%, and a precision-oriented version containing about 5k verb pairs, at a precision of 56%.
We introduce Cro36WSD, a freely-available medium-sized lexical sample for Croatian word sense disambiguation (WSD). Cro36WSD comprises 36 words: 12 adjectives, 12 nouns, and 12 verbs, balanced across both frequency ba...
详细信息
ISBN:
(纸本)9782951740891
We introduce Cro36WSD, a freely-available medium-sized lexical sample for Croatian word sense disambiguation (WSD). Cro36WSD comprises 36 words: 12 adjectives, 12 nouns, and 12 verbs, balanced across both frequency bands and polysemy levels. We adopt the multi-label annotation scheme in the hope of lessening the drawbacks of discrete sense inventories and obtaining more realistic annotations from human experts. Sense-annotated data is collected through multiple annotation rounds to ensure high-quality annotations: with a 115 person-hours effort we reached an inter-annotator agreement score of 0.877. We analyze the obtained data and perform a correlation analysis between several relevant variables, including word frequency, number of senses, sense distribution skewness, average annotation time, and the observed inter-annotator agreement (IAA). Using the obtained data, we compile multi- and single-labeled dataset variants using different label aggregation schemes. Finally, we evaluate three different baseline WSD models on both dataset variants and report on the insights gained. We make both dataset variants freely available.
Identifying the main claims occurring across texts is important for large-scale argumentation mining from social media. However, the claims that users make are often unclear and build on implicit knowledge, effectivel...
Word sense induction (WSI) seeks to induce senses of words from unannotated corpora. In this paper, we address the WSI task for the Croatian language. We adopt the word clustering approach based on co-occurrence graph...
详细信息
ISBN:
(纸本)9782951740891
Word sense induction (WSI) seeks to induce senses of words from unannotated corpora. In this paper, we address the WSI task for the Croatian language. We adopt the word clustering approach based on co-occurrence graphs, in which senses are taken to correspond to strongly inter-connected components of co-occurring words. We experiment with a number of graph construction techniques and clustering algorithms, and evaluate the sense inventories both as a clustering problem and extrinsically on a word sense disambiguation (WSD) task. In the cluster-based evaluation, Chinese Whispers algorithm outperformed Markov Clustering, yielding a normalized mutual information score of 64.3. In contrast, in WSD evaluation Markov Clustering performed better, yielding an accuracy of about 75%. We are making available two induced sense inventories of 10,000 most frequent Croatian words: one coarse-grained and one fine-grained inventory, both obtained using the Markov Clustering algorithm.
作者:
Glavaš, GoranUniversity of Zagreb
Faculty of Electrical Engineering and Computing Text Analysis and Knowledge Engineering Lab Unska 3 Zagreb10000 Croatia
Medical texts are filled with mentions of diseases, disorders, and other clinical conditions, with many different surface forms relating to the same condition. We describe MINERAL, a system for extraction and normaliz...
详细信息
Online debates sparkle argumentative discussions from which generally accepted arguments often emerge. We consider the task of unsupervised identification of prominent argument in online debates. As a first step, in t...
详细信息
A distinguishing feature of many multiword expressions (MWEs) is their semantic non-compositionality. Determining the semantic compositionality of MWEs is important for many natural language processing tasks. We addre...
详细信息
A distinguishing feature of many multiword expressions (MWEs) is their semantic non-compositionality. Determining the semantic compositionality of MWEs is important for many natural language processing tasks. We address the task of modeling semantic compositionality of Croatian MWEs. We adopt a composition-based approach within the distributional semantics framework. We build and evaluate models based on Latent Semantic Analysis and the recently proposed neural network-based Skip-gram model, and experiment with different composition functions. We show that the compositionality scores predicted by the Skip-gram additive models correlate well with human judgments (=0.50). When framed as a classification task, the model achieves an accuracy of 0.64.
When tweeting on a topic, Twitter users often post messages that convey the same or similar meaning. We describe TweetingJay, a system for detecting paraphrases and semantic similarity of tweets, with which we partici...
详细信息
暂无评论