Many cancer genome sequencing efforts are underway with the goal of identifying the somatic mutations that drive cancer progression. A major difficulty in these studies is that tumors are typically heterogeneous, with...
详细信息
Many cancer genome sequencing efforts are underway with the goal of identifying the somatic mutations that drive cancer progression. A major difficulty in these studies is that tumors are typically heterogeneous, with individual cells in a tumor having different complements of somatic mutations. However, nearly all DNA sequencing technologies sequence DNA from multiple cells, thus resulting in measurement of mutations from a mixture of genomes. Genome rearrangements are a major class of somatic mutations in many tumors, and the novel adjacencies (i.e. breakpoints) resulting from these rearrangements are readily detected from DNA sequencing reads. However, the assignment of each rearrangement, or adjacency, to an individual cancer genome in the mixture is not known. Moreover, the quantity of DNA sequence reads may be insufficient to measure all rearrangements in all genomes in the tumor. Motivated by this application, we formulate the k-minimum completion problem (k-MCP). In this problem, we aim to reconstruct k genomes derived from a single reference genome, given partial information about the adjacencies present in the mixture of these genomes. We show that the 1-MCP is solvable in lineartime in the cases where: (i) the measured, incomplete genome has a single circular or linear chromosome;(ii) there are no restrictions on the chromosomal content of the measured, incomplete genome. We also show that the k-MCP problem, for k >= 3 in general, and the 2-MCP problem with the double-cut-and-join (DCJ) distance are NP-complete, when there are no restriction on the chromosomal structure of the measured, incomplete genome. These results lay the foundation for future algorithmic studies of the k-MCP and the application of these algorithms to real cancer sequencing data.
We present optimal linear time algorithms for computing the Shapley values and 'heightened evolutionary distinctiveness' (HED) scores for the set of taxa in a phylogenetic tree. We demonstrate the efficiency o...
详细信息
We present optimal linear time algorithms for computing the Shapley values and 'heightened evolutionary distinctiveness' (HED) scores for the set of taxa in a phylogenetic tree. We demonstrate the efficiency of these new algorithms by applying them to a set of 10,000 reasonable 5139-species mammal trees. This is the first time these indices have been computed on such a large taxon and we contrast our finding with an ad-hoc index for mammals, fair proportion (FP), used by the Zoological Society of London's EDGE programme. Our empirical results follow expectations. In particular, the Shapley values are very strongly correlated with the FP scores, but provide a higher weight to the few monotremes that comprise the sister to all other mammals. We also find that the HED score, which measures a species' unique contribution to future subsets as function of the probability that close relatives will go extinct, is very sensitive to the estimated probabilities. When they are low, HED scores are less than FP scores, and approach the simple measure of a species' age. Deviations (like the Solendon genus of the West Indies) occur when sister species are both at high risk of extinction and their clade roots deep in the tree. Conversely, when endangered species have higher probabilities of being lost, HED scores can be greater than FP scores and species like the African elephant Loxondonta africana, the two solendons and the thumbless bat Furipterus horrens can move up the rankings. We suggest that conservation attention be applied to such species that carry genetic responsibility for imperiled close relatives. We also briefly discuss extensions of Shapley values and HED scores that are possible with the algorithms presented here.
Let G be a finite undirected graph with edge set E. An edge set E' subset of E is an induced matching in G if the pairwise distance of the edges of E' in G is at least two;E' is dominating in G if every ed...
详细信息
ISBN:
(纸本)9783642255908
Let G be a finite undirected graph with edge set E. An edge set E' subset of E is an induced matching in G if the pairwise distance of the edges of E' in G is at least two;E' is dominating in G if every edge e is an element of E\E' intersects some edge in E'. The Dominating Induced Matching Problem (DIM, for short) asks for the existence of an induced matching E' which is also dominating in G;this problem is also known as the Efficient Edge Domination Problem. The DIM problem is related to parallel resource allocation problems, encoding theory and network routing. It is NP-complete even for very restricted graph classes such as planar bipartite graphs with maximum degree three. However, its complexity was open for P-k-free graphs for any k >= 5;P-k denotes a chordless path with k vertices and k - 1 edges. We show in this paper that the weighted DIM problem is solvable in lineartime for P-7-free graphs in a robust way.
Background: The distance between two genomes is often computed by comparing only the common markers between them. Some approaches are also able to deal with non-common markers, allowing the insertion or the deletion o...
详细信息
Background: The distance between two genomes is often computed by comparing only the common markers between them. Some approaches are also able to deal with non-common markers, allowing the insertion or the deletion of such markers. In these models, a deletion and a subsequent insertion that occur at the same position of the genome count for two sorting steps. Results: Here we propose a new model that sorts non-common markers with substitutions, which are more powerful operations that comprehend insertions and deletions. A deletion and an insertion that occur at the same position of the genome can be modeled as a substitution, counting for a single sorting step. Conclusions: Comparing genomes with unequal content, but without duplicated markers, we give a linear time algorithm to compute the genomic distance considering substitutions and double-cut-and-join (DCJ) operations. This model provides a parsimonious genomic distance to handle genomes free of duplicated markers, that is in practice a lower bound to the real genomic distances. The method could also be used to refine orthology assignments, since in some cases a substitution could actually correspond to an unannotated orthology.
Block-based connected components labeling is by far the fastest algorithm to label the connected components in 2D binary images, especially when the image size is quite large. This algorithm produces a decision tree t...
详细信息
Block-based connected components labeling is by far the fastest algorithm to label the connected components in 2D binary images, especially when the image size is quite large. This algorithm produces a decision tree that contains 211 leaf nodes with 14 levels for the depth of a tree and an average depth of 1.5923. This article attempts to provide a faster method for connected components labeling. We propose two new scan masks for connected components labeling, namely, the pixel-based scan mask and the block-based scan mask. In the final stage, the block-based scan mask is transformed to a near-optimal decision tree. We conducted comparative experiments using different sources of images for examining the performance of the proposed method against the existing methods. We also performed an average tree depth analysis and tree balance analysis to consolidate the performance improvement over the existing methods. Most significantly, the proposed method produces a decision tree containing 86 leaf nodes with 12 levels for the depth of a tree and an average depth of 1.4593, resulting in faster execution time, especially when the foreground density is equal to or greater than the background density of the images.
We consider the minimum k-way cut problem for unweighted undirected graphs with a size bound s on the number of cut edges allowed. Thus we seek to remove as few edges as possible so as to split a graph into k componen...
详细信息
ISBN:
(纸本)9780769545714
We consider the minimum k-way cut problem for unweighted undirected graphs with a size bound s on the number of cut edges allowed. Thus we seek to remove as few edges as possible so as to split a graph into k components, or report that this requires cutting more than s edges. We show that this problem is fixed-parameter tractable (FPT) with the standard parameterization in terms of the solution sizes. More precisely, for s=O(1), we present a quadratic timealgorithm. Moreover, we present a much easier linear time algorithm for planar graphs and bounded genus graphs. Our tractability result stands in contrast to known W[1] hardness of related problems. Without the size bound, Downey et al. [2003] proved that the minimum k-way cut problem is W[1] hard with parameter k, and this is even for simple unweighted graphs. Downey et al. asked about the status for planar graphs. We get lineartime with fixed parameter k for simple planar graphs since the minimum k-way cut of a planar graph is of size at most 6k. More generally, we get FPT with parameter k for any graph class with bounded average degree. A simple reduction shows that vertex cuts are at least as hard as edge cuts, so the minimum k-way vertex cut is also W[1] hard with parameter k. Marx [2004] proved that finding a minimum k-way vertex cut of size s is also W[1] hard with parameters. Marx asked about the FPT status with edge cuts, which we prove tractable here. We are not aware of any other cut problem where the vertex version is W[1] hard but the edge version is FPT, e.g., Marx [2004] proved that the k terminal cut problem is FPT parameterized by the cut size, both for edge and vertex cuts.
This paper deals with the Efficient Edge Domination Problem (EED, for short), also known as Dominating Induced Matching Problem. For an undirected graph G = (V, E) FED asks for an induced matching M subset of E that s...
详细信息
ISBN:
(纸本)9783642121999
This paper deals with the Efficient Edge Domination Problem (EED, for short), also known as Dominating Induced Matching Problem. For an undirected graph G = (V, E) FED asks for an induced matching M subset of E that simultaneously dominates all edges of G. Thus, the distance between edges of M is at least two and every edge in E is adjacent to an edge of M. EED is related to parallel resource allocation problems, encoding theory and network routing. The problem is NP-complete even for restricted classes like planar bipartite and bipartite graphs with maximum degree three. However, the complexity has been open for chordal bipartite graphs. This paper shows that EED can be solved in polynomial time on hole-free graphs. Moreover, it provides even lineartime for chordal bipartite graphs. Finally, we strengthen the NP-completeness result to planar bipartite graphs of maximum degree three.
Background: Two biomolecular 3-D structures are said to be similar if the RMSD (root mean square deviation) between the two molecules' sequences of 3-D coordinates is less than or equal to some given constant boun...
详细信息
Background: Two biomolecular 3-D structures are said to be similar if the RMSD (root mean square deviation) between the two molecules' sequences of 3-D coordinates is less than or equal to some given constant bound. Tools for searching for similar structures in biomolecular 3-D structure databases are becoming increasingly important in the structural biology of the post-genomic era. Results: We consider an important, fundamental problem of reporting all substructures in a 3-D structure database of chain molecules (such as proteins) which are similar to a given query 3-D structure, with consideration of indels (i.e., insertions and deletions). This problem has been believed to be very difficult but its exact computational complexity has not been known. In this paper, we first prove that the problem in unbounded dimensions is NP hard. We then propose a new algorithm that dramatically improves the average-case time complexity of the problem in 3-D in case the number of indels k is bounded by a constant. Our algorithm solves the above problem for a query of size m and a database of size N in average-case O(N) time, whereas the time complexity of the previously best algorithm was O(Nm(k+1)). Conclusions: Our results show that although the problem of searching for similar structures in a database based on the RMSD measure with indels is NP-hard in the case of unbounded dimensions, it can be solved in 3-D by a simple average-case linear time algorithm when the number of indels is bounded by a constant.
An algorithm of lineartime complexity is presented to label connected components of a binary image by a quadtree. For a given node, the search for all adjacent nodes is carried out in O(1) (i.e., constant time comple...
详细信息
An algorithm of lineartime complexity is presented to label connected components of a binary image by a quadtree. For a given node, the search for all adjacent nodes is carried out in O(1) (i.e., constant time complexity for the worst case) using our formerly presented algorithm in (Aizawa et al., 3rd International Symposium on Communications, Control, and Signal Processing, 2008, 505-510), whereas it explores all possible adjacencies for each node in a usual way. Then during the process of tree formulation in the search, all equivalent relations of labels are stored as lists. time complexity of the algorithm is O(B+W) for the worst case and its auxiliary space is no more than O(B), where B and W correspond to the number of leaf nodes in a quadtree representing black and white quadrants, respectively. Empirical tests of the algorithm are employed in comparison with another lineartime connected component labeling algorithm based on top-down quadtree traversal algorithm (Samet, IEEE Trans Pattern Anal Mach Intell PAMI-7 (1985), 94-98), as well as traditional row-by-row scanning algorithm using lineartime Union-Find (Fiorio and Gustedt, Theor Comput Sci 154 (1996), 165-181). Our algorithm has shown the best performance in large images. (C) 2009 Wiley Periodicals, Inc. Int J Imaging Syst Technol, 19, 158-166, 20091, Published online in Wiley InterScience (***). DOI 10.1002/ima.20179
Background: In 2004, Bejerano et al. announced the startling discovery of hundreds of "ultraconserved elements", long genomic sequences perfectly conserved across human, mouse, and rat. Their announcement st...
详细信息
Background: In 2004, Bejerano et al. announced the startling discovery of hundreds of "ultraconserved elements", long genomic sequences perfectly conserved across human, mouse, and rat. Their announcement stimulated a flurry of subsequent research. Results: We generalize the notion of ultraconserved element in a natural way from extraordinary human-rodent conservation to extraordinary conservation over an arbitrary set of species. We call these "Extremely Conserved Elements". There is a linear time algorithm to find all such Extremely Conserved Elements in any multiple sequence alignment, provided that the conservation is required to be across all the aligned species. For the general case of conservation across an arbitrary subset of the aligned species, we show that the question of whether there exists an Extremely Conserved Element is NP-complete. We illustrate the linear time algorithm by cataloguing all 177 Extremely Conserved Elements in the currently available 44-vertebrate whole-genome alignment, and point out some of the characteristics of these elements. Conclusions: The NP-completeness in the case of conservation across an arbitrary subset of the aligned species implies that it is unlikely an efficient algorithm exists for this general case. Despite this fact, for the interesting case of conservation across all or most of the aligned species, our algorithm is efficient enough to be practical. The 177 Extremely Conserved Elements that we catalog demonstrate many of the characteristics of the original ultraconserved elements of Bejerano et al.
暂无评论