Organizing webpages into interesting topics is one of the key steps to understand the trends from multimodal web data. The sparse, noisy, and less-constrained user-generated content results in inefficient feature repr...
详细信息
Organizing webpages into interesting topics is one of the key steps to understand the trends from multimodal web data. The sparse, noisy, and less-constrained user-generated content results in inefficient feature representations. These descriptors unavoidably cause that a detected topic still contains a certain number of the false detected webpages, which further make a topic be less coherent, less interpretable, and less useful. In this paper, we address this problem from a viewpoint interpreting a topic by its prototypes, and present a two-step approach to achieve this goal. Following the detection-by-ranking approach, a sparse Poisson deconvolution is proposed to learn the intratopic similarities between webpages. To find the prototypes, leveraging the intratopic similarities, top-k diverse yet representative prototype webpages are identified from a submodularity function. Experimental results not only show the improved accuracies for the web topic detection task, but also increase the interpretation of a topic by its prototypes on two public datasets.
Detecting "hot" topics from the enormous user-generated content (UGC) data on web poses two main difficulties that the conventional approaches can barely handle: 1) poor feature representations from noisy im...
详细信息
Detecting "hot" topics from the enormous user-generated content (UGC) data on web poses two main difficulties that the conventional approaches can barely handle: 1) poor feature representations from noisy images or short texts, and 2) uncertain roles of modalities where the visual content is either highly or weakly relevant to the textual cues due to the less-constrained UGC. In this paper, following the detection-by-ranking approach, we address above challenges by learning a robust latent representation from multiple, noisy and a high probability of the complementary features. Both the textual features and the visual ones are encoded into a k-nearest neighbor hybrid similarity graph (HSG), where nonnegative matrix factorization using random walk is introduced to generate topic candidates. An efficient fusion of multiple HSGs is then done by a latent poisson deconvolution, which consists of a poisson deconvolution with sparse basis similarity for each edge. Experiments show significantly improved accuracy of the proposed approach in comparison with the state-of-the-art methods on two public datasets.
Organizing a few webpages from social media into hot topics is one of the key steps to understand trends on web. Discovering popular yet hot topics from web faces a sea of noise webpages which never evolve into popula...
详细信息
Organizing a few webpages from social media into hot topics is one of the key steps to understand trends on web. Discovering popular yet hot topics from web faces a sea of noise webpages which never evolve into popular topics. In this paper, we discover that the similarity values between webpages in a popular topic contain the statistically similar features observed in L & eacute;vy walks. Consequently, we present a simple, novel, yet very powerful Explore-Exploit (EE) approach to group topics by simulating L & eacute;vy walks nature in the similarity space. The proposed EE-based topic clustering is an effective and efficient method which is a solid move towards handling a sea of noise webpages. Experiments on two public data sets demonstrate that our approach is not only comparable to the State-Of-The-Art (SOTA) methods in terms of effectiveness but also significantly outperforms the SOTA methods in terms of efficiency.
Despite the massive growth of social media on the Internet, the process of organizing, understanding, and monitoring user generated content (UGC) has become one of the most pressing problems in today's society. Di...
详细信息
Despite the massive growth of social media on the Internet, the process of organizing, understanding, and monitoring user generated content (UGC) has become one of the most pressing problems in today's society. Discovering topics on the web from a huge volume of UGC is one of the promising approaches to achieve this goal. Compared with classical topicdetection and tracking in news articles, identifying topics on the web is by no means easy due to the noisy, sparse, and less-constrained data on the Internet. In this paper, we investigate methods from the perspective of similarity diffusion, and propose a clustering-like pattern across similarity cascades (SCs). SCs are a series of subgraphs generated by truncating a similarity graph with a set of thresholds, and then maximal cliques are used to capture topics. Finally, a topic-restricted similarity diffusion process is proposed to efficiently identify real topics from a large number of candidates. Experiments demonstrate that our approach outperforms the state-of-the-art methods on three public data sets.
In multi-media and social media communities, web topic detection poses two main difficulties that conventional approaches can barely handle: 1) there are large inter-topic variations among webtopics;2) supervised inf...
详细信息
ISBN:
(纸本)9781479947614
In multi-media and social media communities, web topic detection poses two main difficulties that conventional approaches can barely handle: 1) there are large inter-topic variations among webtopics;2) supervised information is rare to identify the real topics. In this paper, we address these problems from the similarity diffusion perspective among objects on web, and present a clustering-like pattern across similarity cascades (SCs). SCs are a series of subgraphs generated by truncating a weighted graph with a set of thresholds, and then maximal cliques are used to describe the topic candidates. Poisson deconvolution is adopted to efficiently identify the real topics from these topic candidates. Experiments demonstrate that our approach outperforms the state-of-the-arts on two datasets. In addition, we report accuracy v.s. false positives per topic (FPPT) curves for performance evaluation. To our knowledge, this is the first complete evaluation of web topic detection at the topic-wise level, and it establishes a new benchmark for this problem.
In multi-media and social media communities, web topic detection poses two main difficulties that conventional approaches can barely handle: 1) there are large inter-topic variations among webtopics;2) supervised inf...
详细信息
ISBN:
(纸本)9781479947607
In multi-media and social media communities, web topic detection poses two main difficulties that conventional approaches can barely handle: 1) there are large inter-topic variations among webtopics;2) supervised information is rare to identify the real topics. In this paper, we address these problems from the similarity diffusion perspective among objects on web, and present a clustering-like pattern across similarity cascades (SCs). SCs are a series of subgraphs generated by truncating a weighted graph with a set of thresholds, and then maximal cliques are used to describe the topic candidates. Poisson deconvolution is adopted to efficiently identify the real topics from these topic candidates. Experiments demonstrate that our approach outperforms the state-of-the-arts on two datasets. In addition, we report accuracy v.s. false positives per topic (FPPT) curves for performance evaluation. To our knowledge, this is the first complete evaluation of web topic detection at the topic-wise level, and it establishes a new benchmark for this problem.
Organizing webpages into hot topics is one of the key steps to understand the trends from multi-modal web data. To handle this pressing problem, Poisson Deconvolution (PD), a state-of-the-art method, recently is propo...
详细信息
ISBN:
(纸本)9783030057107;9783030057091
Organizing webpages into hot topics is one of the key steps to understand the trends from multi-modal web data. To handle this pressing problem, Poisson Deconvolution (PD), a state-of-the-art method, recently is proposed to rank the interestingness of webtopics on a similarity graph. Nevertheless, in terms of scalability, PD optimized by expectation-maximization is not sufficiently efficient for a large-scale data set. In this paper, we develop a Stochastic Poisson Deconvolution (SPD) to deal with the large-scale web data sets. Experiments demonstrate the efficacy of the proposed approach in comparison with the state-of-the-art methods on two public data sets and one large-scale synthetic data set.
暂无评论