The current availability of large datasets composed of heterogeneous objects stresses the importance of large-scale clustering of mixed complex items. Several algorithms have been developed for mixed datasets composed...
详细信息
ISBN:
(纸本)9781479953752
The current availability of large datasets composed of heterogeneous objects stresses the importance of large-scale clustering of mixed complex items. Several algorithms have been developed for mixed datasets composed of numerical and categorical variables, a well-known algorithm being the k-prototypes. This algorithm is efficient for clustering large datasets given its linear complexity. However, many fields are handling more complex data, for example variable-size sets of categorical valuesmixed with numerical and categorical values, which cannot be processed as is by the k-prototypesalgorithm. We are proposing a variation of the k-prototypes clustering algorithm that can handle these complex entities, by using a bag-of-words representation for the multi valued categorical variables. We evaluate our approach on a real-world application to the clustering of administrative health care databases in Quebec, with results illustrating the good performances of our method.
暂无评论