This paper presents a novel crawling strategy to locate bilingual sites. It does so by focusing on the Web graph neighborhood of these sites and exploring the patterns of the links in this region to guide its visitati...
详细信息
Statistical language modeling techniques have successfully been applied to large source code corpora, yielding a variety of new software development tools, such as tools for code suggestion, improving readability, and...
详细信息
ISBN:
(纸本)9781450371223
Statistical language modeling techniques have successfully been applied to large source code corpora, yielding a variety of new software development tools, such as tools for code suggestion, improving readability, and API migration. A major issue with these techniques is that code introduces new vocabulary at a far higher rate than naturallanguage, as new identifier names proliferate. Both large vocabularies and out-of-vocabulary issues severely affect Neural language Models (NLMs) of source code, degrading their performance and rendering them unable to scale. In this paper, we address this issue by: 1) studying how various modelling choices impact the resulting vocabulary on a large-scale corpus of 13,362 projects;2) presenting an open vocabulary source code NLM that can scale to such a corpus, 100 times larger than in previous work, and outperforms the state of the art. To our knowledge, this is the largest NLM for code that has been reported.
Clustering Text has been an important problem in the domain of naturallanguageprocessing. While there are techniques to cluster text based on using conventional clustering techniques on top of contextual or non-cont...
详细信息
Identifying aspects at an early stage helps to achieve separation of crosscutting concerns in the initial system analysis, instead of deferring such decisions to later stages of design and code, and thus, having to pe...
详细信息
ISBN:
(纸本)0769524257
Identifying aspects at an early stage helps to achieve separation of crosscutting concerns in the initial system analysis, instead of deferring such decisions to later stages of design and code, and thus, having to perform costly refactorings. This paper describes the Early-AIM approach that utilises corpus-based naturallanguageprocessing (NLP) techniques to effectively enable the identification and modelling of early aspects in a semi-automated way.
We demonstrate how the paradigm of complex networks can be used to model some aspects of the process of second language acquisition. When learning a new language, knowledge of 3000-4000 of the most frequent words appe...
详细信息
ISBN:
(纸本)9780889865921
We demonstrate how the paradigm of complex networks can be used to model some aspects of the process of second language acquisition. When learning a new language, knowledge of 3000-4000 of the most frequent words appears to be a significant threshold, necessary to transfer reading skills from L1 to L2(1). We show that this threshold corresponds to the transition from Zipf's law to a non-Zipfian regime in the rank-frequency plot of words of the English language. Using a large dictionary, we then construct a graph representing this dictionary, and study topological properties of subgraphs generated by the k most frequent words of the language. The clustering coefficient of these subgraphs reaches a minimum in the same place as the crossover point in the rank-frequency plot. We conjecture that the coincidence of all these thresholds may indicate a change in the language structure, which occurs when the vocabulary size reaches about 3000-4000 words.
With the rapid development of intelligent networked vehicles and driverless technology, the importance of the dialogue between human and vehicle artificial intelligence has also become prominent. In the case of the dr...
详细信息
Lately, with the increasing popularity of social media technologies, applying naturallanguageprocessing for mining information in tweets has posed itself as a challenging task and has attracted significant research ...
详细信息
ISBN:
(纸本)9781450355537
Lately, with the increasing popularity of social media technologies, applying naturallanguageprocessing for mining information in tweets has posed itself as a challenging task and has attracted significant research efforts. In contrast with the news text and others formal content, tweets pose a number of new challenges, due to their short and noisy nature. Thus, over the past decade, different Named Entity Recognition (NER) architectures have been proposed to solve this problem. However, most of them are based on handcrafted-features and restricted to a particular domain, which imposes a natural barrier to generalize over different contexts. In this sense, despite the long line of work in NER on formal domains, there are no studies in NER for tweets in Portuguese (despite 17.97 million monthly active users). To bridge this gap, we present a new gold-standard corpus of tweets annotated for Person, Location, and Organization (PLO). Additionally, we also perform multiple NER experiments using a variety of Long Short-Term Memory (LSTM) based models without resorting to any handcrafted rules. Our approach with a centered context input window of word embeddings yields 52.78 F1 score, 38.68% higher compared to a state of the art baseline system.
The paper presents two manually annotated Slovene language text normalisation datasets, one of historical texts and the other of tweets, and proposes several variants of character-based statistical machine translation...
详细信息
Driven by deep learning, naturallanguageprocessing(NLP) has achieved great success in analyzing and understanding large volumes of text. As a result, a large number of communication transmission methods based on NLP...
详细信息
Humans communicate with others using naturallanguage. Because many expressions in naturallanguage can convey the same message, humans interpret these expressions flexibly based on their knowledge of words and associ...
详细信息
ISBN:
(纸本)1601322178
Humans communicate with others using naturallanguage. Because many expressions in naturallanguage can convey the same message, humans interpret these expressions flexibly based on their knowledge of words and association skills. An Association System was constructed on a computer by applying Concept Bases and the degree ofassociation. This paper proposes a method for generating an association word from several other words with the Association System to show that it can achieve humanlike associative abilities on a computer. The proposed method generated a natural human association with 61.0% accuracy and 77.0% recall.
暂无评论