The phenomenon of person name ambiguity is widespread on web pages in that one name may be used by different people. It is important to uniquely identify the given person on the web. In this paper, the method Baidu-PN...
详细信息
The phenomenon of person name ambiguity is widespread on web pages in that one name may be used by different people. It is important to uniquely identify the given person on the web. In this paper, the method Baidu-PND is proposed by the authors. It is an unsupervised name disambiguation method based on Baidu Encyclopedia. We extract three features including background knowledge, contextual feature and Related-Set of the characters from the online Baidu Encyclopedia. The weights of the features are studied by logistic regression algorithm. Then we make a linear fusion of the features. The maximum combined value is selected as the correct person on web pages. Experiments are conducted to measure the performance of Baidu-PND, which show that the performance is higher than we expected, validating its feasibility and effectiveness for person name disambiguation on web pages. And, Baidu-PND is a new method for knowledge mining based on Baidu Encyclopedia.
In this paper, we propose a new L1-Norm-Based two-dimensional locality preserving projections (2DLPP-L1). Traditional 2D-LPP can preserve local structure and extract feature directly form matrices, which shows great a...
详细信息
In this paper, we propose a new L1-Norm-Based two-dimensional locality preserving projections (2DLPP-L1). Traditional 2D-LPP can preserve local structure and extract feature directly form matrices, which shows great advantages. However, it is based on L2 norm. It is well known that L2-norm-based criterion is sensitive to outliers. We generalize 2D-LPP to its corresponding L1-norm-based version, i.e. 2DLPP-L1, which is more robust against outliers. To evaluate the performance of 2DLPP-L1, several experiments are performed on the ORL face databases. Experimental results demonstrate that 2DLPP-L1 has better performance than its related methods.
Content-based image retrieval has become an important research area. In order to extract the semantic information within the user’s query concept, we propose an image retrieval method based on regional objects. It is...
详细信息
Content-based image retrieval has become an important research area. In order to extract the semantic information within the user’s query concept, we propose an image retrieval method based on regional objects. It is regarded as the pre-processing of a given query image, that is to say, when we get a query image, it needs us to segment the regional object which is useful or interesting, and retrieve according to the segmented fragment. Moreover, we propose a correlation coefficient based color representation. Experimental results demonstrate that our proposed approach performs much better than its related methods. Furthermore, the presented system has a high retrieval precision and keeps color consistency between the similarity images.
We propose a framework for analyzing and comparing distributions, which we use to construct statistical tests to determine if two samples are drawn from different distributions. Our test statistic is the largest diffe...
详细信息
We propose a framework for analyzing and comparing distributions, which we use to construct statistical tests to determine if two samples are drawn from different distributions. Our test statistic is the largest difference in expectations over functions in the unit ball of a reproducing kernel Hilbert space (RKHS), and is called the maximum mean discrepancy (MMD).We present two distribution free tests based on large deviation bounds for the MMD, and a third test based on the asymptotic distribution of this statistic. The MMD can be computed in quadratic time, although efficient linear time approximations are available. Our statistic is an instance of an integral probability metric, and various classical metrics on distributions are obtained when alternative function classes are used in place of an RKHS. We apply our two-sample tests to a variety of problems, including attribute matching for databases using the Hungarian marriage method, where they perform strongly. Excellent performance is also obtained when comparing distributions over graphs, for which these are the first such tests.
The radial basis function network (RBFN) has been widely used in various fields such as function regression, pattern recognition, and error detection, etc. However, the structural parameters of RBFN including the numb...
详细信息
The radial basis function network (RBFN) has been widely used in various fields such as function regression, pattern recognition, and error detection, etc. However, the structural parameters of RBFN including the number of hidden units, centers vectors, and widths (variances) are one of the most important issues when training a RBFN, which greatly affect the performance of RBFN. So, the objective of this paper is to construct an elementary survey about this problem. Firstly, the fundamental knowledge and notations of RBFN is introduced. Secondly, we summarize most existing network structure initialization methods for RBFN and categorize them into four goups. Then some typical appraoches for each category are introduced and discussed. The disadvantages and virtues for parts of methods are also introduced. Finally, the paper is concluded with a discussion of current difficulties and possible future directions about RBFN architecture selection.
Based on a knowledge base, we propose a new method to realize free-style Chinese keyword search over relational databases. Firstly, an index (also called knowledge base) is built by extracting related information of C...
详细信息
Printed mathematical formulas edited by different soft wares have some obvious differences. To distinguish it before recognition is beneficial to the formula recognition. Based on the statistical analysis to the chara...
详细信息
Email is a kind of semi-structured document, some important attributes are contained in its structure, and especially using spam-specific features could improve the email classification results. In this paper, we appl...
详细信息
Email is a kind of semi-structured document, some important attributes are contained in its structure, and especially using spam-specific features could improve the email classification results. In this paper, we apply decision tree data mining technique to dig out the potential association rules among these attributes of email, and then to identify unknown email's category based on these rules. According to the experiment of applying numerous Chinese emails to our email classifier, the efficiency of our method is not lower than that of other existing methods of checking whole email content text. Meanwhile our method can reduce the cost of computation and consumption of system resources.
Active learning is a hot topic in machinelearning field. The main task of active learning is to automatically select the representative instances for efficiently reducing the sample complexity. This paper presents a ...
详细信息
Active learning is a hot topic in machinelearning field. The main task of active learning is to automatically select the representative instances for efficiently reducing the sample complexity. This paper presents a brief survey of active learning regarding selection methods, query strategies, applications and other related works.
The support vectors play an important role in the training to find the optimal hyper-plane. For the problem of many non-support vectors and a few support vectors in the classification of SVM, a method to reduce the sa...
详细信息
The support vectors play an important role in the training to find the optimal hyper-plane. For the problem of many non-support vectors and a few support vectors in the classification of SVM, a method to reduce the samples that may be not support vectors is proposed in this paper. First, adopt the Support Vector Domain Description to find the smallest sphere containing the most data points, and then remove the objects outside the sphere. Second, remove the edge points based on the distance of each pattern to the centers of other classes. In comparison with the standard SVM, the experimental results show that the new algorithm in the paper is capable of reducing the number of samples as well as the training time while maintaining high accuracy.
暂无评论