In this paper, we study how to perform XML query expansion effectively from the high quality pseudo-relevance documents. A solution for selecting good expansion information is presented, in which various features impa...
详细信息
In this paper, we study how to perform XML query expansion effectively from the high quality pseudo-relevance documents. A solution for selecting good expansion information is presented, in which various features impacting weight, such as term element frequency, term inverse element frequency, semantic weight of tag and level information, are analyzed and those term with high weigh value are selected as expansion term. Experiment results show that proposed expansion method is feasible. Compared to original query and traditional expansion method with no structure features considered, our method achieves better retrieval performance.
Chinese radicals play important roles in forming Chinese character's semantic meaning. The semantic properties of radicals make them a promising source of information to be analyzed in text mining and content extr...
详细信息
Chinese radicals play important roles in forming Chinese character's semantic meaning. The semantic properties of radicals make them a promising source of information to be analyzed in text mining and content extraction. However, until recently there is little research work concentrating on using the radical set in text mining related tasks. We investigate the roles of radicals in Chinese text classification tasks. In the task, texts are transformed into vectors of radicals, characters and words. Radicals are further pruned by their semantic strengths and network traits. We carry out experiments with real data from Open Directory Project. The experiments results justify Chinese radicals as important features for semantic processing in Chinese text mining tasks.
N-gram approach takes the position information into account additionally and thus can offer higher accuracy in query answering than keyword based approaches and is widely used in IR and NLP. However, in large-scale RD...
详细信息
N-gram approach takes the position information into account additionally and thus can offer higher accuracy in query answering than keyword based approaches and is widely used in IR and NLP. However, in large-scale RDF graphs, URIs instead of documents are the ranking and querying units; URIs are usually much shorter than documents, and different URIs are interlinked into a massive network. One shot n-gram querying is usually not good for the RDF data in many cases. In this paper, we present a hybrid framework which combines the n-gram retrieval with link analysis based weight propagation. The idea is to exploit the link structures in the RDF data graphs and propagate the one shot n-gram score weights along with these links. Large scale experiments using MapReduce on Billion Triples Challenge dataset show the hybrid framework achieves an 80.3% improvement in relevance scores over mere n-gram retrieval.
The problem of scalable knowledge extraction from the Web has attracted much attention in the past decade. However, it is under explored how to extract the structured knowledge from semi-structured Websites in a fully...
详细信息
ISBN:
(纸本)9781467351645
The problem of scalable knowledge extraction from the Web has attracted much attention in the past decade. However, it is under explored how to extract the structured knowledge from semi-structured Websites in a fully automatic and scalable way. In this work, we define the table-formatted structured data with clear schema as knowledge Tables and propose a scalable learning system, which is named as Kable to extract knowledge from semi-structured Websites automatically in a never ending and scalable way. Kable consists of two major components, which are auto wrapper induction and schema matching respectively. In contrast to the state of the art auto wrappers for semi-structured Web sites, our adopted approach can run around 1'000 times faster, which makes the Web scale knowledge extraction possible. On the other hand, we propose a novel schema matching solution which can work effectively on the auto-extracted structured data. With 3 months' continuous run using ten Web servers, we successfully extracted 427,105,009 knowledge facts. The manual labeling over sampled knowledge extracted show the up to 87% precision for supporting various Web applications.
Currently, most works on interval valued problems mainly focus on attribute reduction (i.e., feature selection) by using rough set technologies. However, less research work on classifier building on interval-valued pr...
详细信息
Currently, most works on interval valued problems mainly focus on attribute reduction (i.e., feature selection) by using rough set technologies. However, less research work on classifier building on interval-valued problems has been conducted. It is promising to propose an approach to build classifier for interval-valued problems. In this paper, we propose a classification approach based on interval valued fuzzy rough sets. First, the concept of interval valued fuzzy granules are proposed, which is the crucial notion to build the reduction framework for the interval-valued databases. Second, the idea to keep the critical value invariant before and after reduction is selected. Third, the structure of reduction rule is completely studied by using the discernibility vector approach. After the description of rule inference system, a set of rules covering all the objects can be obtained, which is used as a rule based classifier for future classification. Finally, numerical examples are presented to illustrate feasibility and affectivity of the proposed method in the application of privacy protection.
Although there exist a few good schemes to protect the kernel hooks of operating systems, attackers are still able to circumvent existing defense mechanisms with spurious context infonmtion. To address this challenge,...
详细信息
Although there exist a few good schemes to protect the kernel hooks of operating systems, attackers are still able to circumvent existing defense mechanisms with spurious context infonmtion. To address this challenge, this paper proposes a framework, called HooklMA, to detect compromised kernel hooks by using hardware debugging features. The key contribution of the work is that context information is captured from hardware instead of from relatively vulnerable kernel data. Using commodity hardware, a proof-of-concept pro- totype system of HooklMA has been developed. This prototype handles 3 082 dynamic control-flow transfers with related hooks in the kernel space. Experiments show that HooklMA is capable of detecting compomised kernel hooks caused by kernel rootkits. Performance evaluations with UnixBench indicate that runtirre overhead introduced by HooklMA is about 21.5%.
this paper,the author defines Generalized Unique Game Problem (GUGP),where weights of the edges are allowed to be *** special types of GUGP are illuminated,GUGP-NWA,where the weights of all edges are negative,and GUGP...
this paper,the author defines Generalized Unique Game Problem (GUGP),where weights of the edges are allowed to be *** special types of GUGP are illuminated,GUGP-NWA,where the weights of all edges are negative,and GUGP-PWT(ρ),where the total weight of all edges are positive and the negative-positive ratio is at most ρ.
Recently, social tagging systems become more and more popular in many Web 2.0 applications. In such systems, Users are allowed to annotate a particular resource with a freely chosen a set of tags. These user-generated...
详细信息
Recently, social tagging systems become more and more popular in many Web 2.0 applications. In such systems, Users are allowed to annotate a particular resource with a freely chosen a set of tags. These user-generated tags can represent users' interests more concise and closer to human understanding. Interests will change over time. Thus, how to describe users' interests and interests transfer path become a big challenge for personalized recommendation systems. In this approach, we propose a variable-length time interval division algorithm and user interest model based on time interval. Then, in order to draw users' interests transfer path over a specific time period, we suggest interest transfer model. After that, we apply a classical community partition algorithm in our approach to separate users into communities. Finally, we raise a novel method to measure users' similarities based on interest transfer model and provide personalized tag recommendation according to similar users' interests in their next time intervals. Experimental results demonstrate the higher precision and recall with our approach than classical user-based collaborative filtering methods.
The existing query languages for XML (e.g., XQuery) require professional programming skills to be formulated, however, such complex query languages burden the query processing. In addition, when issuing an XML query, ...
详细信息
ISBN:
(纸本)9781467300421
The existing query languages for XML (e.g., XQuery) require professional programming skills to be formulated, however, such complex query languages burden the query processing. In addition, when issuing an XML query, users are required to be familiar with the content (including the structural and textual information) of the hierarchical XML, which is diffcult for common users. The need for designing user friendly interfaces to reduce the burden of query formulation is fundamental to the spreading of XML community. We present a twig-based XML graphical search system, called LotusX, that provides a graphical interface to simplify the query processing without the need of learning query language and data schemas and the knowledge of the content of the XML document. The basic idea is that LotusX proposes "position-aware" and "auto-completion" features to help users to create tree-modeled queries (twig pattern) by providing the possible candidates on-the-fly. In addition, complex twig queries (including order sensitive queries) are supported in LotusX. Furthermore, a new ranking strategy and a query rewriting solution are implemented to rank and rewrite the query effectively. We provide an online demo for LotusX system: http://***:8080/LotusX.
In algorithm trading, computer algorithms are used to make the decision on the time, quantity, and direction of operations (buy, sell, or hold) automatically. To create a useful algorithm, the parameters of the algori...
详细信息
In algorithm trading, computer algorithms are used to make the decision on the time, quantity, and direction of operations (buy, sell, or hold) automatically. To create a useful algorithm, the parameters of the algorithm should be optimized based on historical data. However, Parameter optimization is a time consuming task, due to the large search space. We propose to search the parameter combination space using the MapReduce framework, with the expectation that runtime of optimization be cut down by leveraging the parallel processing capability of MapReduce. This paper presents the details of our method and some experiment results to demonstrate its efficiency. We also show that a rule based strategy after being optimized performs better in terms of stability than the one whose parameters are arbitrarily preset, while making a comparable profit.
暂无评论