Thesis deals with the development of corpus tools which allow building corpus of religious and historical texts. It is foreseen that the corpus has the features of data ingestion, text data preprocessing, statistics c...
详细信息
Thesis deals with the development of corpus tools which allow building corpus of religious and historical texts. It is foreseen that the corpus has the features of data ingestion, text data preprocessing, statistics calculation, qualitative and quantitative text analysis. All these features are customizable. With Big Data approach is meant that corpus tools are treated as the data platform and the corpus itself is treated as a combination of data lake and data warehouse solutions. There have been suggested the ways for resolving algorithmic, methodological and architectural problems which arise while building corpus tool. The effectiveness of naturallanguageprocessing and naturallanguage understanding methods, libraries and tools on the example of building historical and religious texts' corpora have been checked. There have been created the workflows which comprise data extraction from sources, data transformation, data enrichment and loading into corpus storage with proper qualitative and quantitative characteristics. Data extraction approaches which are common for ingestion into data lake were used. Transformations and enrichments were realized by means of naturallanguageprocessing and naturallanguage understanding techniques. Calculation of statistical characteristics was done by means of machine learning techniques. Finding keywords and relations between them became possible thanks to the employment of latent semantic analysis, terms and N-gram frequencies, term frequency-inverse document frequencies. Computation complexity and number of information noise were reduced by singular value decomposition. The influence of singular value decomposition parameters on the text processing accuracy has been analyzed. The results of corpus-based computational experiment for religious text concept analysis have been shown. The architectural approaches to building corpus-based data platform and the usage of software tools, frameworks and specific libraries have been su
Semi-supervised learning is an efficient method to augment training data automatically from unlabeled data. Development of many naturallanguage understanding (NLU) applications has a challenge where unlabeled data is...
详细信息
Recent advances in reading comprehension have resulted in models that surpass human performance when the answer is contained in a single, continuous passage of text. However, complex Question Answering (QA) typically ...
详细信息
Surgery of glial tumors of the brain located in the motor areas vicinity is associated with a high risk of increasing neurological deficits. Motor deficit affects overall survival in this group of patients. Nowadays, ...
详细信息
The Textgraphs-13 Shared Task on Explanation Regeneration (Jansen and Ustalov, 2019) asked participants to develop methods to reconstruct gold explanations for elementary science questions. Red Dragon AI's entries...
详细信息
Word embeddings continue to be of great use for NLP researchers and practitioners due to their training speed and easiness of use and distribution. Prior work has shown that the representation of those words can be im...
详细信息
Searchable Encryption can bridge the gap between privacy protection and data utilization. As it leaks access pattern to attain practical search performance, it is vulnerable under advanced attacks. While these advance...
详细信息
ISBN:
(纸本)9781450371155
Searchable Encryption can bridge the gap between privacy protection and data utilization. As it leaks access pattern to attain practical search performance, it is vulnerable under advanced attacks. While these advanced attacks show significant privacy leakage, the assumptions of these attacks are often strong and the methods that can be used to mitigate these attacks are limited. In this paper, we investigate one of these advanced attacks, referred to as file-injection attacks, and examine whether this attack is viable in practice. In addition, we also propose a defense method to mitigatefi le-injection attacks. By leveraging naturallanguageprocessing, we formulate the generation of injectedfi les in the attack as an automated text generation problem with restrictions on word selection, and then we tackle the problem with n-grams and Recursive Neural Networks. We formulate the proposed defense as a semantic analysis problem, in which we extract linguistic features and address the problem using machine learning. Our experiential results on real-world datasets suggest two interesting observations. First, automatically generating injectedfi les in the attack will result low semantics infi les. Second, it is viable to automatically detect injectedfi les based on semantics and mitigatefi le-injection attacks.
In this work we describe the system from naturallanguageprocessing group at Arizona State University for the Textgraphs 2019 Shared Task. The task focuses on Explanation Regeneration, an intermediate step towards ge...
详细信息
In this work, we propose a rule based method to identify the event type and create a frame for that event. With the help of naturallanguage Toolkit (NLTK) and preloaded SpaCy models, we have tried to define certain m...
详细信息
Speech translation systems usually follow a pipeline approach, using word lattices as an intermediate representation. However, previous work assume access to the original transcriptions used to train the ASR system, w...
详细信息
暂无评论