We provide a randomized linear time approximation scheme for a generic problem about clustering of binary vectors subject to additional constraints. The new constrained clustering problem generalizes a number of probl...
详细信息
We provide a randomized linear time approximation scheme for a generic problem about clustering of binary vectors subject to additional constraints. The new constrained clustering problem generalizes a number of problems and by solving it, we obtain the first linear time-approximation schemes for a number of well-studied fundamental problems concerning clustering of binary vectors and low-rank approximation of binary matrices. Among the problems solvable by our approach are Low GF(2)-RANK APPROXIMATION, Low BOOLEAN-RANK APPROXIMATION, and various versions of binary CLUSTERING. For example, for Low GF(2)-RANK APPROXIMATION problem, where for an m x n binarymatrix A and integer r > 0, we seek for a binarymatrix B of GF(2) rank at most r such that the l(0)-norm of matrix A - B is minimum, our algorithm, for any epsilon > 0 in time f (r, epsilon) . n . m, where f is some computable function, outputs a (1 + epsilon)-approximate solution with probability at least (1 - 1/e). This is the first linear time approximation scheme for these problems. We -7 also give (deterministic) PTASes for these problems running in time n(f(r)()1/)(epsilon 2)( log 1/epsilon), where f is some function depending on the problem. Our algorithm for the constrained clustering problem is based on a novel sampling lemma, which is interesting on its own.
matrixfactorization (MF) plays a key role in many applications such as recommender systems and computer vision, but MF may take long running time for handling large matrices commonly seen in the big data era. Many pa...
详细信息
matrixfactorization (MF) plays a key role in many applications such as recommender systems and computer vision, but MF may take long running time for handling large matrices commonly seen in the big data era. Many parallel techniques have been proposed to reduce the running time, but few parallel MF packages are available. Therefore, we present an open source library, LIBMF, based on recent advances of parallel MF for shared-memory systems. LIBMF includes easy-to-use command-line tools, interfaces to C/C++ languages, and comprehensive documentation. Our experiments demonstrate that LIBMF outperforms state of the art packages. LIBMF is BSD-licensed, so users can freely use, modify, and redistribute the code.
Given a set of n binary data points, a widely used technique is to group its features into k clusters (e.g. [7] ). In the case where n < k , the question of how overlapping are the clusters becomes of interest. In ...
详细信息
Given a set of n binary data points, a widely used technique is to group its features into k clusters (e.g. [7] ). In the case where n < k , the question of how overlapping are the clusters becomes of interest. In this paper we approach the question through matrix decomposition, and relate the degree of overlap with the sparsity of one of the resulting matrices. We present analytical results regarding bounds on this sparsity, and a heuristic to estimate the minimum amount of overlap that an exact grouping of features into k clusters must have. As shown below, adding new data will not alter this minimum amount of overlap.
暂无评论