naturallanguage generation lies at the core of generative dialogue systems and conversational agents. We describe an ensemble neural language generator, and present several novel methods for data representation and a...
详细信息
We build parallel feature decay algorithms (parfda) Moses statistical machine translation (SMT) models for language pairs in the translation task. parfda obtains results close to the top constrained phrase-based SMT w...
详细信息
Text classification is a foundational task in naturallanguageprocessing (NLP). Traditional methods rely heavily on human-designed features, while deep learning models based on neural networks can automatically captu...
详细信息
ISBN:
(纸本)9781538660676
Text classification is a foundational task in naturallanguageprocessing (NLP). Traditional methods rely heavily on human-designed features, while deep learning models based on neural networks can automatically capture contextual information. We explore and introduce various neural network architectures to extract information and key components in texts. An extensive set of experiments and comparisons on accuracy, speed, memory-consumption are conducted. methods based on the proposed models won the first place in the Zhihu Machine Learning Challenge 2017. The code has been made publicly available(1).
Public debate forums provide a common platform for exchanging opinions on a topic of interest. While recent studies in naturallanguageprocessing (NLP) have provided empirical evidence that the language of the debate...
详细信息
This paper investigates the use of Machine Translation (MT) to bootstrap a naturallanguage Understanding (NLU) system for a new language for the use case of a large-scale voice-controlled device. The goal is to decre...
详细信息
Statistical Machine Translation (SMT) is a research hotspot in machine translation and naturallanguageprocessing. Recently, source code translation tasks based on SMT model have been applied to Software Engineering....
详细信息
ISBN:
(数字)9781728137551
ISBN:
(纸本)9781728137568
Statistical Machine Translation (SMT) is a research hotspot in machine translation and naturallanguageprocessing. Recently, source code translation tasks based on SMT model have been applied to Software Engineering. Unfortunately, there is no automated metric that can effectively detect the accuracy of code translation. Considering the similarity between code similarity detection and machine translation scoring process, this paper proposes Code Semantic Metric (CSM) based on traditional code plagiarism detection metrics to verify its applicability to code translation tasks. Our empirical research shows that the results of different methods of code plagiarism detection are quite different. After specific parameter adjustment, CSM can reflect the correctness of translation code semantics to a certain extent. We confirm that CSM has a high correlation with human judgment in the semantic accuracy of translated code, which surpasses the scores of MOSS and JPlag, the mainstream traditional code plagiarism detection methods.
With so much our daily lives relying on digital devices like personal computers and cell phones, there is a growing demand for code that not only functions properly, but is secure and keeps user data safe. However, en...
详细信息
ISBN:
(纸本)9781538661338
With so much our daily lives relying on digital devices like personal computers and cell phones, there is a growing demand for code that not only functions properly, but is secure and keeps user data safe. However, ensuring this is not such an easy task, and many developers do not have the required skills or resources to ensure their code is secure. Many code analysis tools have been written to find vulnerabilities in newly developed code, but this technology tends to produce many false positives, and is still not able to identify all of the problems. Other methods of finding software vulnerabilities automatically are required. This proof-of-concept study applied naturallanguageprocessing on Java byte code to locate SQL injection vulnerabilities in a Java program. Preliminary findings show that, due to the high number of terms in the dataset, using singular decision trees will not produce a suitable model for locating SQL injection vulnerabilities, while random forest structures proved more promising. Still, further work is needed to determine the best classification tool.
Duplicate Question Detection (DQD) is a naturallanguageprocessing task under active research, with applications to fields like Community Question Answering and Information Retrieval. While DQD falls under the umbrel...
详细信息
ISBN:
(纸本)9791095546009
Duplicate Question Detection (DQD) is a naturallanguageprocessing task under active research, with applications to fields like Community Question Answering and Information Retrieval. While DQD falls under the umbrella of Semantic Text Similarity (STS), these are often not seen as similar tasks of semantic equivalence detection, with STS being implicitly understood as concerning only declarative sentences. Nevertheless, approaches to STS have been applied to DQD and paraphrase detection, that is to interrogatives and declaratives, alike. We present a study that seeks to assess, under conditions of comparability, the possible different performance of state-of-the-art approaches to STS over different types of textual segments, including most notably declaratives and interrogatives. This paper contributes to a better understanding of current mainstream methods for semantic equivalence detection, and to a better appreciation of the different results reported in the literature when these are obtained from different data sets with different types of textual segments. Importantly, it contributes also with results concerning how data sets containing textual segments of a certain type can be used to leverage the performance of resolvers for segments of other types.
Internet of Things (IoT) deployments are becoming increasingly automated and vastly more complex. Facilitated by programming abstractions such as trigger-action rules, end-users can now easily create new functionaliti...
详细信息
ISBN:
(纸本)9781450367479
Internet of Things (IoT) deployments are becoming increasingly automated and vastly more complex. Facilitated by programming abstractions such as trigger-action rules, end-users can now easily create new functionalities by interconnecting their devices and other online services. However, when multiple rules are simultaneously enabled, complex system behaviors arise that are difficult to understand or diagnose. While history tells us that such conditions are ripe for exploitation, at present the security states of trigger-action IoT deployments are largely unknown. In this work, we conduct a comprehensive analysis of the interactions between trigger-action rules in order to identify their security risks. Using IFTTT as an exemplar platform, we first enumerate the space of inter-rule vulnerabilities that exist within trigger-action platforms. To aid users in the identification of these dangers, we go on to present iRULER, a system that performs Satisfiability Modulo Theories (SMT) solving and model checking to discover inter-rule vulnerabilities within IoT deployments. iRULER operates over an abstracted information flow model that represents the attack surface of an IoT deployment, but we discover in practice that such models are difficult to obtain given the closed nature of IoT platforms. To address this, we develop methods that assist in inferring trigger-action information flows based on naturallanguageprocessing. We develop a novel evaluative methodology for approximating plausible real-world IoT deployments based on the installation counts of 315,393 IFTTT applets, determining that 66% of the synthetic deployments in the IFTTT ecosystem exhibit the potential for inter rule vulnerabilities. Combined, these efforts provide the insight into the real-world dangers of IoT deployment misconfigurations.
The focus of data scientists is essentially divided into three areas: collecting data, analyzing data, and inferring information from data. Each one of these tasks requires special personnel, takes time, and costs mon...
详细信息
The focus of data scientists is essentially divided into three areas: collecting data, analyzing data, and inferring information from data. Each one of these tasks requires special personnel, takes time, and costs money. Yet, the next and the fastidious step is how to turn data into products. Therefore, this field grabs the attention of many research groups in academia as well as industry. In the last decades, data-driven approaches came into existence and gained more popularity because they require much less human effort. naturallanguageprocessing (NLP) is strongly among the fields influenced by data. The growth of data is behind the performance improvement of most NLP applications such as machine translation and automatic speech recognition. Consequently, many NLP applications are frequently moving from rule-based systems and knowledge-based methods to data driven approaches. However, collected data that are based on undefined design criteria or on technically unsuitable forms will be useless. Also, they will be neglected if the size is not enough to perform the required analysis and to infer the accurate information. The chief purpose of this overview is to shed some lights on the vital role of data in various fields and give a better understanding of data in light of NLP. Expressly, it describes what happen to data during its life-cycle: building, processing, analyzing, and exploring phases. (C) 2018 The Authors. Published by Elsevier B.V.
暂无评论