Clustering is among the most popular datamining algorithm families. Before applying clustering algorithms to datasets, it is usually necessary to preprocess the data properly. datapreprocessing is a crucial, still n...
详细信息
Clustering is among the most popular datamining algorithm families. Before applying clustering algorithms to datasets, it is usually necessary to preprocess the data properly. datapreprocessing is a crucial, still neglected step in datamining. Although preprocessing techniques and algorithms are well-known, the preprocessing process is very complex and takes usually a lot of time. Instead of handling preprocessing more systematically, it is usually undervalued, i.e. more emphasis is put on choosing the appropriate clustering algorithm and setting its parameters. In our opinion, this is not because preprocessing is less important, but because it is difficult to choose the best sequence of preprocessing algorithms. We argue that it is important to better standardize this process so it is performed efficiently. Therefore, this paper proposes a generic framework for datapreprocessing. It is based on a survey with datamining experts, as well as a literature and software review. The framework enables pipelining preprocessing algorithms and methods which facilitate further automated preprocessing design and the selection of a suitable preprocessing stream. The proposed framework is easily extendible, so it can be applied to other datamining algorithm families that have their own idiosyncrasies.
暂无评论