Many collaboratively building resources, such as Wikipedia, Weibo and Quora, exist in the form of semi-structureddata and semistructureddataclassification plays an important role in many data analysis applications....
详细信息
ISBN:
(纸本)9783030731960;9783030731977
Many collaboratively building resources, such as Wikipedia, Weibo and Quora, exist in the form of semi-structureddata and semistructureddataclassification plays an important role in many data analysis applications. In addition to content information, semi-structureddata also contain structural information. Thus, combining the structure and content features is a crucial issue in semi-structured data classification. In this paper, we propose a supervised semi-structured data classification approach that utilizes both the structural and content information. In this approach, generalized tag sequences are extracted from the structural information, and nGrams are extracted from the content information. Then the tag sequences and nGrams are combined into features called TSGram according to their link relation, and each semi-structured document is represented as a vector of TSGram features. Based on the TSGram features, a classification model is devised to improve the performance of semi-structured data classification. Because TSGram features retain the association between the structural and content information, they are helpful in improving the classification performance. Our experimental results on two real datasets show that the proposed approach is effective.
Many collaboratively building resources, such as Wikipedia, Weibo and Quora, exist in the form of semi-structureddata. The semi-structureddata has been widely used in areas such as data integration, data distributio...
详细信息
ISBN:
(纸本)9781665420990
Many collaboratively building resources, such as Wikipedia, Weibo and Quora, exist in the form of semi-structureddata. The semi-structureddata has been widely used in areas such as data integration, data distribution, data storage, data management, information retrieval and knowledge management. For large volumes of semi-structureddata on the Web, semi-structured data classification technique can group them into different categories by their structure and/or content information. Supervised semi-structured data classification plays an important role in many applications. This paper provides an overview of the literature in the area of supervised semi-structured data classification. A general framework for semi-structured data classification is presented, which is mainly composed of two steps: feature extraction and model building. Several different representation models of semi-structureddata are discussed, mainly including rooted labeled tree model, feature vector space model and feature set model. A large selection of semi-structured data classification approaches are reviewed in detail from two aspects: based on structure only and based on both structure and content. Finally, several future research directions for semi-structured data classification are presented.
暂无评论