We present a novel method for key term extraction from text documents. In our method, document is modeled as a graph of semantic relationships between terms of that document. We exploit the following remarkable featur...
详细信息
ISBN:
(纸本)9781605584874
We present a novel method for key term extraction from text documents. In our method, document is modeled as a graph of semantic relationships between terms of that document. We exploit the following remarkable feature of the graph: the terms related to the main topics of the document tend to bunch up into densely interconnected sub-graphs or communities, while non-important terms fall into weakly interconnected communities, or even become isolated vertices. We apply graph community detection techniques to partition the graph into thematically cohesive groups of terms. We introduce a criterion function to select groups that contain key terms discarding groups with unimportant terms. To weight terms and determine semantic relatedness between them we exploit information extracted from Wikipedia. Using such an approach gives us the following two advantages. First, it allows effectively processing multi-theme documents. Second, it is good at filtering out noise information in the document, such as, for example, navigational bars or headers in web pages. Evaluations of the method show that it outperforms existing methods producing key terms with higher precision and recall. Additional experiments on web pages prove that our method appears to be substantially more effective on noisy and multi-theme documents than existing methods. Copyright is held by the International World Wide Web Conference Committee (IW3C2).
Extracting information from tables is an important and rather complex part of information retrieval. For the task of objects extraction from HTML tables we introduce the following methods: determining table orientatio...
Extracting information from tables is an important and rather complex part of information retrieval. For the task of objects extraction from HTML tables we introduce the following methods: determining table orientation, processing of aggregating objects (like Total) and scattered headers (super row labels, subheaders).
This paper presents an experimental evaluation of the state-of-the-art approaches for automatic term recognition based on multiple features: machine learning method and voting algorithm. We show that in most cases mac...
详细信息
This paper presents an experimental evaluation of the state-of-the-art approaches for automatic term recognition based on multiple features: machine learning method and voting algorithm. We show that in most cases machine learning approach obtains the best results and needs little data for training;we also find the best subsets of all popular features.
Binary code comparison tools are widely used to analyze vulnerabilities, search for malicious code, detect copyright violations, etc. The article discusses three tools - BCC, BinDiff, Diaphora. Those are based on stat...
详细信息
Distribution is a well-known solution to increase performance and provide load balancing in case you need optimal resource utilization. Together with replication it also allows improved reliability, accessibility and ...
详细信息
Distribution is a well-known solution to increase performance and provide load balancing in case you need optimal resource utilization. Together with replication it also allows improved reliability, accessibility and fault-tolerance. However since the amount of data is large there is a problem of maintaining meta-information about distribution and finding needed data fragments during execution of queries. These problems are well understood but they have not received much attention in the context of XML data management. This paper presents research-in-progress, which examines the possibility of management of meta-information about XML data distribution extending auxillary index structure called DataGuide.
There are thousands of various software libraries being developed in the modern world - completely new libraries emerge as well as new versions of existing ones regularly appear. Unfortunately, developers of many libr...
详细信息
ISBN:
(纸本)9781457706066
There are thousands of various software libraries being developed in the modern world - completely new libraries emerge as well as new versions of existing ones regularly appear. Unfortunately, developers of many libraries focus on developing functionality of the library itself but neglect ensuring high quality and backward compatibility of application programming interfaces (APIs) provided by their libraries. The best practice to address these aspects is having an automated regression test suite that can be regularly (e.g., nightly) run against the current development version of the library. Such a test suite would ensure early detection of any regressions in the quality or compatibility of the library. But developing a good test suite can cost significant amount of efforts, which becomes an inhibiting factor for library developers when deciding QA policy. That is why many libraries do not have a test suite at all. This paper discusses an approach for low cost automatic generation of basic tests for shared libraries based on the information automatically extracted from the library header files and additional information about semantics of some library data types. Such tests can call APIs of target libraries with some correct parameters and can detect typical problems like crashes "out-of-the-box". Using this method significantly lowers the barrier for developing an initial version of library tests, which can be then gradually improved with a more powerful test development framework as resources appear. The method is based on analyzing API signatures and type definitions obtained from the library header files and creating parameter initialization sequences through comparison of target function parameter types with other functions' return values or out-parameters (usually, it is necessary to call some function to get a correct parameter value for another function and the initialization sequence of the necessary function calls can be quite long). The paper also descri
This paper discusses a problem of ensuring backward binary compatibility when developing shared libraries. Linux (and GCC environment) is used as the main example. Breakage of the compatibility may result in crashing ...
详细信息
The paper is devoted to analysis of a strategy of computation distribution on heterogeneous parallel systems. According to this strategy processes of parallel program are distributed over the processors according to t...
详细信息
ISBN:
(纸本)0769523129
The paper is devoted to analysis of a strategy of computation distribution on heterogeneous parallel systems. According to this strategy processes of parallel program are distributed over the processors according to their performances and data are distributed between processes evenly. The paper presents an algorithm that computes optimal number of the processes and their distribution over processors minimizing the execution time of an application. The processor performance is considered as a function of the number of processes running on the processor and the amount of the data processing by the processor.
The paper discusses problems arising in development of avionics systems and considers how discrete-event simulation on the base of architecture models at the early stages of avionics design can help to mitigate some o...
详细信息
The paper discusses problems arising in development of avionics systems and considers how discrete-event simulation on the base of architecture models at the early stages of avionics design can help to mitigate some of them. A tool for simulation of AADL architecture models augmented by behavioural specifications is presented and its main design decisions are discussed.
暂无评论