In this paper, we study the exact probability distribution of the number of cycles c in the breakpoint graph of two random genomes with n genes or markers and chi(1) and chi(2) linear chromosomes, respectively. The ge...
详细信息
ISBN:
(纸本)9781860947834
In this paper, we study the exact probability distribution of the number of cycles c in the breakpoint graph of two random genomes with n genes or markers and chi(1) and chi(2) linear chromosomes, respectively. The genomic distance d between the two genomes is d = n - c. In the limit we find that the expectation of d is n - 2(chi 1 chi 2)/2(chi 1)+2(chi 2)-1 - 1/2 ln n+min((chi 1),(chi 2))/chi 1+chi 2.
In this paper, an effective promoter detection algorithm, which is called PromoterExplorer, is proposed. In our approach, various features, i.e. local distribution of pentamers, positional CpG island features and digi...
详细信息
ISBN:
(纸本)9781860947834
In this paper, an effective promoter detection algorithm, which is called PromoterExplorer, is proposed. In our approach, various features, i.e. local distribution of pentamers, positional CpG island features and digitized DNA sequence, are combined to build a high-dimensional input vector. A cascade AdaBoost based learning procedure is adopted to select the most "informative" or "discriminating" features to build a sequence of weak classifiers. A number of weak classifiers construct a strong classifier, which can achieve a better performance. In order to reduce the false positive, a cascade structure is used for detection. PromoterExplorer is tested based on large-scale DNA sequences from different databases, including EPD, Genbank and human chromosome 22. The proposed method consistently outperforms PromoterInspector and Dragon Promoter Finder.
In this paper, we present a new biclustering algorithm to provide the geometrical interpretation of similar microarray gene expression profiles. Different from standard clustering analyses, biclustering methodology ca...
详细信息
ISBN:
(纸本)9781860947834
In this paper, we present a new biclustering algorithm to provide the geometrical interpretation of similar microarray gene expression profiles. Different from standard clustering analyses, biclustering methodology can perform simultaneous classification on the row and column dimensions of a data matrix. The main object of the strategy is to reveal the submatrix, in which a subset of genes exhibits a consistent pattern over a subset of conditions. However, the search for such subsets is a computationally complex task. We propose a new algorithm, based on the Hough transform in the column-pair space to perform pattern identification. The algorithm is especially suitable for the biclustering analysis of large-scale microarray data. Our simulation studies show that the method is robust to noise and computationally efficient. Furthermore, we have applied it to a large database of gene expression profiles of multiple human organs and the resulting biclusters show clear biological meanings.
A variety of pattern-based methods have been exploited to extract biological relations from literatures. Many of them require significant domain-specific knowledge to build the patterns by hand, or a large amount of l...
详细信息
ISBN:
(纸本)9781860947834
A variety of pattern-based methods have been exploited to extract biological relations from literatures. Many of them require significant domain-specific knowledge to build the patterns by hand, or a large amount of labeled data to learn the patterns automatically. In this paper, a semi-supervised model is presented to combine both unlabeled and labeled data for the pattern learning procedure. First, a large amount of unlabeled data is used to generate a raw pattern set. Then it is refined in the evaluating phase by incorporating the domain knowledge provided by a relatively small labeled data. Comparative results show that labeled data, when used in conjunction with the inexpensive unlabeled data, can considerably improve the learning accuracy.
It is known that folding a protein chain into the cubic lattice is an NP-complete problem. We consider a seemingly easier problem, given a 3D fold of a protein chain (coordinates of its C atoms), we want to find the c...
详细信息
ISBN:
(纸本)9781860947834
It is known that folding a protein chain into the cubic lattice is an NP-complete problem. We consider a seemingly easier problem, given a 3D fold of a protein chain (coordinates of its C atoms), we want to find the closest lattice approximation of this fold. This problem has been studied under names such as "lattice approximation of a protein chain", "the protein chain fitting problem" and "building protein lattice models". We show that this problem is NP-complete for the cubic lattice with side 3.8 angstrom and the coordinate root mean-square deviation.
Inferring the structure of gene regulatory networks from gene expression data has attracted a growing interest during the last years. Several machine learning related methods, such as Bayesian networks, have been prop...
详细信息
ISBN:
(纸本)9781860947834
Inferring the structure of gene regulatory networks from gene expression data has attracted a growing interest during the last years. Several machine learning related methods, such as Bayesian networks, have been proposed to deal with this challenging problem. However, in many cases, network reconstructions purely based on gene expression data not lead to satisfactory results when comparing the obtained topology against a validation network. Therefore, in this paper we propose an "inverse" approach: Starting from a priori specified network topologies, we identify those parts of the network which are relevant for the gene expression data at hand. For this purpose, we employ linear ridge regression to predict the expression level of a given gene from its relevant regulators with high reliability. Calculated statistical significances of the resulting network topologies reveal that slight modifications of the pruned regulatory network enable an additional substantial improvement.
In metagenomics, the goal is to analyze the genomic content of a sample of organisms collected from a common habitat. One approach is to apply large-scale random shotgun sequencing techniques to obtain a collection of...
详细信息
ISBN:
(纸本)9781860947834
In metagenomics, the goal is to analyze the genomic content of a sample of organisms collected from a common habitat. One approach is to apply large-scale random shotgun sequencing techniques to obtain a collection of DNA reads from the sample. This data is then compared against databases of known sequences such as NCBI-nr or NCBI-nt, in an attempt to identify the taxonomical content of the sample. We introduce a new software called MEGAN (Meta Genome ANalyzer) that generates species profiles from such sequencing data by assigning reads to taxa of the NCBI taxonomy using a straight-forward assignment algorithm. The approach is illustrated by application to a number of datasets obtained using both sequencing-by-synthesis and Sanger sequencing technology, including metagenomic data from a mammoth bone, a portion of the Sargasso sea data set, and several complete microbial test genomes used for validation proposes.
Gene clusters that span three or more chromosomal regions are of increasing importance, yet statistical tests to validate such clusters are in their infancy. Current approaches either conduct several pairwise comparis...
详细信息
ISBN:
(纸本)9781860947834
Gene clusters that span three or more chromosomal regions are of increasing importance, yet statistical tests to validate such clusters are in their infancy. Current approaches either conduct several pairwise comparisons, or consider only the number of genes that occur in all the regions. In this paper, we provide statistical tests for clusters spanning exactly three regions based on genome models of typical comparative genomics problems, including analysis of conserved linkage within multiple species and identification of large-scale duplications. Our tests are the first to combine evidence from genes shared among all three regions and genes shared between pairs of regions. We show that our tests of clusters spanning three regions are more sensitive than existing approaches and can thus be used to identify more diverged homologous regions.
Extending the idea of our previous algorithm [17, 18] we developed a new sequential quartet-based phylogenetic tree construction method. This new algorithm reconstructs the phylogenetic tree iteratively by examining a...
详细信息
ISBN:
(纸本)9781860947834
Extending the idea of our previous algorithm [17, 18] we developed a new sequential quartet-based phylogenetic tree construction method. This new algorithm reconstructs the phylogenetic tree iteratively by examining at each merge step every possible super-quartet which is formed by four subtrees instead of simple quartet in our previous algorithm. Because our new algorithm evaluates super-quartet trees, each of which may consist of more than four molecular sequences, it can effectively alleviate a traditional, but important problem of quartet errors encountered in the quartet-based methods. Experiment results show that our newly proposed algorithm is capable of achieving very high accuracy and solid consistency in reconstructing the phylogenetic trees on different sets of synthetic DNA data under various evolution circumstances.
We present an algorithm for calculating the quartet distance between two evolutionary trees of bounded degree on a common set of n species. The previous best algorithm has running time O(d(2)n(2)) when considering tre...
详细信息
ISBN:
(纸本)9781860947834
We present an algorithm for calculating the quartet distance between two evolutionary trees of bounded degree on a common set of n species. The previous best algorithm has running time O(d(2)n(2)) when considering trees, where no node is of more than degree d. The algorithm developed herein has running time O(d(9) n log n)) which makes it the first algorithm for computing the quartet distance between non-binary trees which has a sub-quadratic worst case running time.
暂无评论