In this paper, we develop some algorithmic aspects of the fractional arboricity of a graph in order to study some new approaches for graph clustering. For a given undirected graph G(V, E) with m edges and n vertices, ...
详细信息
In this paper, we develop some algorithmic aspects of the fractional arboricity of a graph in order to study some new approaches for graph clustering. For a given undirected graph G(V, E) with m edges and n vertices, the fractional arboricity gamma(G) measures the maximum edge density of the subgraphs of G(V, E). It is the fractional covering number of the corresponding graphic matroid. The fractional arboricity, for applications in networks reliability or the approximation of the chromatic number of the graphs, has been studied for many years. There are some algorithms in polynomial time to compute the fractional arboricity and its integer part. But, for large graphs such as the graph of the Web or graphs of social networks, the exact algorithms are not fast enough for practical use. That is why we describe a new FPTAS to compute an epsilon-approximation of the fractional arboricity (with epsilon > 0 as small as desired). Our algorithm uses the principle of the multiplicative weights update method, needs a memory of size O(m) and has a complexity of O(m log(2)(m) log(m/n)/epsilon(2)). We also give a 2-approximation of gamma(G) with computation time O(m), which is a quick preprocessing for the main algorithm. Finally, we present a fast algorithm to extract a subgraph which achieves the value of the approximation of the fractional arboricity. (C) 2014 Elsevier B.V. All rights reserved.
Malware databases have been unintentionally collecting garbage (incomplete malware) together with malware through the Internet. This paper focuses on finding garbage (incomplete malware) from large malware datasets us...
详细信息
ISBN:
(纸本)9781728151731
Malware databases have been unintentionally collecting garbage (incomplete malware) together with malware through the Internet. This paper focuses on finding garbage (incomplete malware) from large malware datasets using binary pattern matching and speed up the matching by using nested clustering as a preprocessing. To verify the effectiveness of our method, we conduct experiments on various malware datasets. The results show that our method works efficiently while maintaining high accuracy.
We study the mathematical concept of the strength of a graph as defined by Cunningham to partition very large graphs. The strength is the objective value of an attack problem on a graph where the striker aims at divid...
详细信息
We study the mathematical concept of the strength of a graph as defined by Cunningham to partition very large graphs. The strength is the objective value of an attack problem on a graph where the striker aims at dividing the graph into a maximum number of connected components removing as few edges as possible per component uncoupled. The strength is the optimal number of edges per component uncoupled or, in the weighted version, the sum of the weights of the edges removed per component uncoupled. Given a connected graph with n vertices and m weighted edges, we denote by W the total sum of the integer weights, and we describe an algorithm that approximates within a factor of 1 + epsilon the weighted strength in time O(m log(W) log(3)(n)/epsilon(2)), and that can be implemented with O(m) memory use if we allow a complexity of O(m log(2)(W) log(3)(n)/epsilon(2)). This result has several important applications, in particular for the detection of dense areas of a graph, in terms of weighted edges, which in turn can be used for spam detection, community identification, and partitioning.
Several applications involving counts present a large proportion of zeros (excess-of-zeros data). A popular model for such data is the hurdle model, which explicitly models the probability of a zero count, while assum...
详细信息
Several applications involving counts present a large proportion of zeros (excess-of-zeros data). A popular model for such data is the hurdle model, which explicitly models the probability of a zero count, while assuming a sampling distribution on the positive integers. We consider data from multiple count processes. In this context, it is of interest to study the patterns of counts and cluster the subjects accordingly. We introduce a novel Bayesian approach to cluster multiple, possibly related, zero-inflated processes. We propose a joint model for zero-inflated counts, specifying a hurdle model for each process with a shifted Negative Binomial sampling distribution. Conditionally on the model parameters, the different processes are assumed independent, leading to a substantial reduction in the number of parameters as compared with traditional multivariate approaches. The subject-specific probabilities of zero-inflation and the parameters of the sampling distribution are flexibly modelled via an enriched finite mixture with random number of components. This induces a two-level clustering of the subjects based on the zero/non-zero patterns (outer clustering) and on the sampling distribution (inner clustering). Posterior inference is performed through tailored Markov chain Monte Carlo schemes. We demonstrate the proposed approach on an application involving the use of the messaging service *** article is part of the theme issue 'Bayesian inference: challenges, perspectives, and prospects'.
暂无评论