Domain terms play a crucial role in many research areas, which has led to a rise in demand for automatic domain terms extraction. In this paper, we present a two-level evaluation approach based on term hood and unit h...
详细信息
Domain terms play a crucial role in many research areas, which has led to a rise in demand for automatic domain terms extraction. In this paper, we present a two-level evaluation approach based on term hood and unit hood to extract Chinese domain compound terms automatically, which takes the character-level and word-level information into account. To achieve this, we incorporate semantic features by using the word segmentation to recognize single word terms, then leverage the improved C-value and heuristic methods such as word formation pattern and word formation power to evaluate candidates at both levels. By validating our approach with several existing dictionaries, a significant improvement of compound terms detection is achieved. Experiments in legal corpus show our method is superior over other compared methods.
To generate large number of reports in a limited time window, four techniques were proposed, including ROLAP&SQL, Shared Scanning, Hadoop based Solution, and MOLAP&Cube Sharding, an algorithm that performs in ...
详细信息
To generate large number of reports in a limited time window, four techniques were proposed, including ROLAP&SQL, Shared Scanning, Hadoop based Solution, and MOLAP&Cube Sharding, an algorithm that performs in memory aggregation was designed for the second solution. The experiment results show that all techniques except ROLAP&SQL can meet the time window constraint, the Hadoop based solution is a promising technique owe to its highly scalability. Considering maturity of the techniques and their performance, we put MOLAP&Cube Sharding into practice while keeping an eye on Hadoop for future adoption.
Trajectories representing the motion of moving objects are typically obtained via location sampling, e.g. using GPS or road-side sensors, at discrete time-instants. In-between consecutive samples, nothing is known abo...
详细信息
ISBN:
(纸本)9781450305280
Trajectories representing the motion of moving objects are typically obtained via location sampling, e.g. using GPS or road-side sensors, at discrete time-instants. In-between consecutive samples, nothing is known about the whereabouts of a given moving object. Various models have been proposed (e.g., sheared cylinders;spacetime prisms) to represent the uncertainty of the moving objects both in unconstrained Euclidian space, as well as road networks. In this paper, we focus on representing the uncertainty of the objects moving along road networks as time-dependent probability distribution functions, assuming availability of a maximal speed on each road segment. For these settings, we introduce a novel indexing mechanism - UTH (Uncertain Trajectories Hierarchy), based upon which efficient algorithms for processing spatio-temporal range queries are proposed. We also present experimental results that demonstrate the benefits of our proposed methodologies.
Clustering XML search results is an effective way to improve performance. However, the key problem is how to measure similarity between XML documents. In this paper, we propose a semantic similarity measure method com...
详细信息
Discovering the relationship between protein sequence pattern and protein secondary structure is important for accurately predicting secondary structure of protein sequence. A protein secondary structure pattern dicti...
详细信息
In document-center XML dataset, an element may contain so many text that users have to spend enough time to judge the elements returned by XML search engine are valuable or not. Query-orient XML summarization system a...
详细信息
Search engine users often have clear search tasks hidden behind their queries. Inspired by this, the modern search engines are providing an increasing number of services to help users simplify their key tasks. However...
详细信息
Domain terms play a crucial role in many research areas, which has led to a rise in demand for automatic domain terms extraction. In this paper, we present a two-level evaluation approach based on term hood and unit h...
详细信息
In many areas, a lot of data have been modeled by graphs which are subject to uncertainties, such as molecular compounds and protein interaction networks. While many real applications, for example, collaborative filte...
详细信息
In many areas, a lot of data have been modeled by graphs which are subject to uncertainties, such as molecular compounds and protein interaction networks. While many real applications, for example, collaborative filtering, fraud detection, and link prediction in social networks etc, rely on efficiently answering k-nearest neighbor queries (kNN), which is the problem of computing the most "similar" k nodes to a given query node. To solve the problem, in this paper a novel method based on measurement of SimRank is proposed. However, because graphs evolve over time and are uncertainly, the computing cost can be very high in practice to solve the problem using the existing algorithms of SimRank. So the paper presents an optimization algorithm. Introducing path threshold, which is suitable in both determined graph and uncertain graph, the algorithm merely considers the local neighborhood of a given query node instead of whole graph to prune the search space. To further improving efficiency, the algorithm adopts sample technology in uncertain graph. At the same time, theory and experiments interpret and verify that the optimization algorithm is efficient and effective.
暂无评论