Developing high performance computing solutions for modern day biological problems present a unique set of challenges. The field is experiencing a data revolution due to a rapid introduction of several disruptive expe...
详细信息
Developing high performance computing solutions for modern day biological problems present a unique set of challenges. The field is experiencing a data revolution due to a rapid introduction of several disruptive experimental technologies. Consequently, computational methods that analyze biological data are currently being put to the test in their capability to scale to massive data sizes. Added to this data-intensiveness, is the brand of computation that is quite different in flavor to that in other, perhaps more traditional scientific computing fields. The problems are dominated by integer arithmetic, string matching, combinatorial space exploration, and graph-theoretic formulations that introduce irregularity in computation and communication patterns. In this thesis, we report on our efforts to bridge the gap between biological data processing and high performance computing solutions. Specifically, we focus on the problem of clustering very large collections of protein sequences on distributed memory supercomputers. Given a set of amino acid sequences we reduce the problem to one of constructing sequence homology graph and subsequently detecting arbitrarily-sized dense subgraphs. Our approach efficiently parallelizes this task on a distributed memory machine through a combination of divide-and-conquer and combinatorial pattern matching heuristic techniques. Preliminary tests on an arbitrary collection of 2 million protein sequences from the Global Ocean Sampling project database reveal that our new approach is able to improve sensitivity, recruit more sequences, while considerably reducing the time to solution and memory requirement. The algorithmic techniques developed as part of this research have a wider applicability to other applications in computational biology wherever the need for conducting large-scale sequence analysis is the primary bottleneck.
In this work we tackle the problem of determining the trustworthiness of the users in a social network. Our approach introduces the novelty of taking into account the negative opinions in a social network to obtain th...
详细信息
ISBN:
(纸本)9780946881680
In this work we tackle the problem of determining the trustworthiness of the users in a social network. Our approach introduces the novelty of taking into account the negative opinions in a social network to obtain the ranking of trust according to the opinions of all the users in the network. We briefly discuss some common attacks that malicious users can perform against a system in order to gain good reputation in the network. The experiments are performed with synthetic graphs, randomly generated to model real social networks according to some common features, and to simulate the attacks previously mentioned. The results show that our approach can deal with these threats, demoting malicious users and minimizing their effects in the final ranking of trust.
We present an I/O-efficient algorithm for topologically sorting directed acyclic graphs, called IterTS. In the worst case, our algorithm is extremely inefficient and performs O(n ċ sort(m)) I/Os. However, our experime...
详细信息
We present an I/O-efficient algorithm for topologically sorting directed acyclic graphs, called IterTS. In the worst case, our algorithm is extremely inefficient and performs O(n ċ sort(m)) I/Os. However, our experiments show that IterTS achieves good performance in practice. To evaluate IterTS, we compared its running time to those of three competitors: PeelTS, an I/O-efficient implementation of the standard strategy of iteratively removing sources and sinks; ReachTS, an I/O-efficient implementation of a recent parallel divide-and-conquer algorithm based on reachability queries; and SeTS, a standard DFS-based topological sorting built on top of a semiexternal DFS algorithm. In our evaluation on various types of input graphs, IterTS consistently outperformed PeelTS and ReachTS by at least an order of magnitude in most cases. SeTS outperformed IterTS on most graphs whose vertex sets fit in memory. However, IterTS often came close to the running time of SeTS on these inputs and, more importantly, SeTS was not able to process graphs whose vertex sets were beyond the size of main memory, while IterTS was able to process such inputs efficiently.
We study how parallel chip-firing on the complete graph K-n changes behavior as we vary the total number of chips. Surprisingly, the activity of the system, defined as the average number of firings per time step, does...
详细信息
We study how parallel chip-firing on the complete graph K-n changes behavior as we vary the total number of chips. Surprisingly, the activity of the system, defined as the average number of firings per time step, does not increase smoothly in the number of chips;instead it remains constant over long intervals, punctuated by sudden jumps. In the large n limit we find a 'devil's staircase' dependence of activity on the number of chips. The proof proceeds by reducing the chip-firing dynamics to iteration of a self-map of the circle S-1, in such a way that the activity of the chip-firing state equals the Poincare rotation number of the circle map. The stairs of the devil's staircase correspond to periodic chip-firing states of small period.
The random accumulation of variations in the human genome over time implicitly encodes a history of how human populations have arisen, dispersed, and intermixed since we emerged as a species. Reconstructing that histo...
详细信息
The random accumulation of variations in the human genome over time implicitly encodes a history of how human populations have arisen, dispersed, and intermixed since we emerged as a species. Reconstructing that history is a challenging computational and statistical problem but has important applications both to basic research and to the discovery of genotype-phenotype correlations. We present a novel approach to inferring human evolutionary history from genetic variation data. We use the idea of consensus trees, a technique generally used to reconcile species trees from divergent gene trees, adapting it to the problem of finding robust relationships within a set of intraspecies phylogenies derived from local regions of the genome. Validation on both simulated and real data shows the method to be effective in recapitulating known true structure of the data closely matching our best current understanding of human evolutionary history. Additional comparison with results of leading methods for the problem of population substructure assignment verifies that our method provides comparable accuracy in identifying meaningful population subgroups in addition to inferring relationships among them. The consensus tree approach thus provides a promising new model for the robust inference of substructure and ancestry from large-scale genetic variation data.
In most network analysis tools the computation of the shortest paths between all pairs of nodes is a fundamental step to the discovery of other properties. Among other properties is the computation of closeness centra...
详细信息
In most network analysis tools the computation of the shortest paths between all pairs of nodes is a fundamental step to the discovery of other properties. Among other properties is the computation of closeness centrality, a measure of the nodes that shows how central a vertex is on a given network. In this paper, the authors present a method to compute the All Pairs Shortest Paths on graphs that present two characteristics: abundance of nodes with degree value one, and existence of articulation points along the graph. These characteristics are present in many real life networks especially in networks that show a power law degree distribution as is the case of biological networks. The authors' method compacts the single nodes to their source, and then by using the network articulation points it disconnects the network and computes the shortest paths in the biconnected components. At the final step the authors proposed methods merges the results to provide the whole network shortest paths. The authors' method achieves remarkable speedup compared to state of the art methods to compute the shortest paths, as much as 7 fold speed up in artificial graphs and 3.25 fold speed up in real application graphs. The authors' performance improvement is unlike previous research as it does not involve elaborated setups since the authors algorithm can process significant instances on a popular workstation.
We show that, for every 0 <= p <= 1, there is an O(n(2.575-p/(7.4-2.3p)))-time algorithm that given a directed graph with small positive integer weights, estimates the length of the shortest path between every p...
详细信息
We show that, for every 0 <= p <= 1, there is an O(n(2.575-p/(7.4-2.3p)))-time algorithm that given a directed graph with small positive integer weights, estimates the length of the shortest path between every pair of vertices u, v in the graph to within an additive error delta(p)(u, v), where delta(u, v) is the exact length of the shortest path between u and v. This algorithm runs faster than the fastest algorithm for computing exact shortest paths for any 0 < p <= 1. Previously the only way to "beat" the running time of the exact shortest path algorithms was by applying an algorithm of Zwick [2002] that approximates the shortest path distances within a multiplicative error of (1 + epsilon). Our algorithm thus gives a smooth qualitative and quantitative transition between the fastest exact shortest paths algorithm, and the fastest approximation algorithm with a linear additive error. In fact, the main ingredient we need in order to obtain the above result, which is also interesting in its own right, is an algorithm for computing (1 + epsilon) multiplicative approximations for the shortest paths, whose running time is faster than the running time of Zwick's approximation algorithm when epsilon << 1 and the graph has small integer weights.
The lower and the upper irredundance numbers of a graph G, denoted ir(G) and IR(G), respectively, are conceptually linked to the domination and independence numbers and have numerous relations to other graph parameter...
详细信息
The lower and the upper irredundance numbers of a graph G, denoted ir(G) and IR(G), respectively, are conceptually linked to the domination and independence numbers and have numerous relations to other graph parameters. It has been an open question whether determining these numbers for a graph G on n vertices admits exact algorithms running in time faster than the trivial Theta(2(n) center dot poly(n)) enumeration, also called the 2(n)-barrier. The main contributions of this article are exact exponential-time algorithms breaking the 2(n)-barrier for irredundance. We establish algorithms with running times of O*(1.99914(n)) for computing ir(G) and O*(1.9369(n)) for computing IR(G). Both algorithms use polynomial space. The first algorithm uses a parameterized approach to obtain (faster) exact algorithms. The second one is based, in addition, on a reduction to the Maximum Induced Matching problem providing a branch-and-reduce algorithm to solve it. (C) 2011 Elsevier B.V. All rights reserved.
The normal functioning of a living cell is characterized by complex interaction networks involving many different types of molecules. Associations detected between diseases and perturbations in well-defined pathways w...
详细信息
ISBN:
(纸本)9781450307963
The normal functioning of a living cell is characterized by complex interaction networks involving many different types of molecules. Associations detected between diseases and perturbations in well-defined pathways within such interaction networks have the potential to illuminate the molecular mechanisms underlying disease progression and response to treatment. In this paper, we present a computational method that compares expression profiles of genes in cancer samples to samples from normal tissues in order to detect perturbations of pre-defined pathways in the cancer. In contrast to many previous methods, our scoring function approach explicitly takes into account the interactions between the gene products in a pathway. Moreover, we compute the sub-pathway that has the highest score, as opposed to merely computing the score for the entire pathway. We use a permutation test to assess the statistical significance of the most perturbed sub-pathway. We apply our method to 20 pathways in the Netpath database and to the Global Cancer Map of gene expression in 18 cancers. We demonstrate that our method yields more sensitive results than alternatives that do not consider interactions or measure the perturbation of a pathway as a whole. We perform a sensitivity analysis to show that our approach is robust to modest changes in the input data. Our method confirms numerous well-known connections between pathways and cancers.
暂无评论