Microarray gene expression data often contains missing values resulted from various reasons. However, most of the gene expression data analysis algorithms, such as clustering, classification and network design, requir...
详细信息
ISBN:
(纸本)1860946232
Microarray gene expression data often contains missing values resulted from various reasons. However, most of the gene expression data analysis algorithms, such as clustering, classification and network design, require complete information, that is, without any missing values. It is therefore very important to accurately impute the missing values before applying the data analysis algorithms. In this paper, an Iterated Local Least Squares Imputation method (ILLsimpute) is proposed to estimate the missing values. In ILLsimpute, a similarity threshold is learned using known expression values and at every iteration it is used to obtain a set of coherent genes for every target gene containing missing values. The target gene is then represented as a linear combination of the coherent genes, using the least squares. The algorithm terminates after certain iterations or when the imputation converges. The experimental results on real microarray datasets show that ILLsimpute outperforms three most recent methods on several commonly tested datasets.
The gene tree and species tree problem remains a central problem in phylogenomics. To overcome this problem, gene concatenation approaches have been used to combine a certain number of genes randomly from a set of wid...
详细信息
ISBN:
(纸本)1860946232
The gene tree and species tree problem remains a central problem in phylogenomics. To overcome this problem, gene concatenation approaches have been used to combine a certain number of genes randomly from a set of widely distributed orthologous genes selected from genome data to conduct phylogenetic analysis. The random concatenation mechanism prevents us from the further investigations of the inner structures of the gene data set employed to infer the phylogenetic trees and locates the most phylogenetically informative genes. In this work, a phylogenomic mining approach is described to gain knowledge from a gene data set by clustering genes in the gene set through a self-organizing map (SOM) to explore the gene dataset inner structures. From this, the most phylogenetically informative gene set is created by picking the maximum entropy gene from each cluster to infer phylogenetic trees by phylogenetic analysis. Using the same data set, the phylogenetic mining approach performs better than the random gene concatenation approach.
Genome-wide computational analysis for small nuclear RNA (snRNA) genes resulted in identification of 76 and 73 putative snRNA genes from indica and japonica rice genomes, respectively. We used the basic criteria of a ...
详细信息
ISBN:
(纸本)1860946232
Genome-wide computational analysis for small nuclear RNA (snRNA) genes resulted in identification of 76 and 73 putative snRNA genes from indica and japonica rice genomes, respectively. We used the basic criteria of a minimum of 70% sequence identity to the plant snRNA gene used for genome search, presence of conserved promoter elements: TATA box, USE motif and monocot promoter specific elements (MSPs) and extensive sequence alignment to rice / plant expressed sequence tags to denote predicted sequence as snRNA genes. Comparative sequence analysis with snRNA genes from other organisms and predicted secondary structures showed that there is overall conservation of snRNA sequence and structure with plant specific features (presence of TATA box in both polymerase II and III transcribed genes, location of USE motif upstream to the TATA box at fixed but different distance in polymerase II and polymerase III transcribed snRNA genes) and the presence of multiple monocot specific MSPs upstream to the USE motif. Detailed analysis results including all multiple sequence alignments, sequence logos, secondary structures, sequences etc are available at http://***
The inference of evolutionary relationships is usually aided by a reconstruction method which is expected to produce a reasonably accurate estimation of the true evolutionary history. However, various factors are know...
详细信息
ISBN:
(纸本)1860946232
The inference of evolutionary relationships is usually aided by a reconstruction method which is expected to produce a reasonably accurate estimation of the true evolutionary history. However, various factors are known to impede the reconstruction process and result in inaccurate estimates of the true evolutionary relationships. Detecting and removing errors (wrong branches) from tree estimates bear great significance on the results of phylogenetic analyses. Methods have been devised for assessing the support of (or confidence in) phylogenetic tree branches, which is one way of quantifying inaccuracies in trees. In this paper, we study, via simulations, the performance of the most commonly used methods for assessing branch support: bootstrap of maximum likelihood and maximum parsimony trees, consensus of maximum parsimony trees, and consensus of Bayesian inference trees. Under the conditions of our experiments, our findings indicate that the actual amount of change along a branch does not have strong impact on the support of that branch. Further, we find that bootstrap and Bayesian estimates are generally comparable to each other, and superior to a consensus of maximum parsimony trees. In our opinion, the most significant finding of all is that there is no threshold value for any of the methods that would allow for the elimination of wrong branches while maintaining all correct ones-there are always weakly supported true positive branches.
We present a randomized algorithm for semi-supervised learning of Mahalanobis metrics over R-n. The inputs to the algorithm are a set, U, of unlabeled points in R-n, a set of pairs of points, S = {(x,y)i};x,y is an el...
详细信息
ISBN:
(纸本)1860946232
We present a randomized algorithm for semi-supervised learning of Mahalanobis metrics over R-n. The inputs to the algorithm are a set, U, of unlabeled points in R-n, a set of pairs of points, S = {(x,y)i};x,y is an element of U, that are known to be similar, and a set of pairs of points, D = {(x, y)i};x, y is an element of U, that are known to be dissimilar. The algorithm randomly samples S, D, and m-dimensional subspaces of R-n and learns a metric for each subspace. The metric over R-n is a linear combination of the subspace metrics. The randomization addresses issues of efficiency and overfitting. Extensions of the algorithm to learning non-linear metrics via kernels, and as a pre-processing step for dimensionality reduction are discussed. The new method is demonstrated on a regression problem (structure-based chemical shift prediction) and a classification problem (predicting clinical outcomes for immunomodulatory strategies for treating severe sepsis).
Finding motifs in DNA sequences plays an important role in deciphering transcriptional regulatory mechanisms and drug target identification. In this paper, we propose an efficient algorithm, EDAM, for finding motifs b...
详细信息
ISBN:
(纸本)1860946232
Finding motifs in DNA sequences plays an important role in deciphering transcriptional regulatory mechanisms and drug target identification. In this paper, we propose an efficient algorithm, EDAM, for finding motifs based on frequency transformation and Minimum Bounding Rectangle (MBR) techniques. It works in three phases,frequency transformation, MBR-clique searching and motif discovery. In frequency transformation, EDAM divides the sample sequences into a set of substrings by sliding windows, then transforms them to frequency vectors which are stored in MBRs. In MBR-clique searching, based on the frequency distance theorems EDAM searches for MBR-cliques used for motif discovery. In motif discovery, EDAM discovers larger cliques by extending smaller cliques with their neighbors. To accelerate the clique discovery, we propose a range query facility to avoid unnecessary computations for clique extension. The experimental results illustrate that EDAM well solves the running time bottleneck of the motif discovery problem in large DNA database.
Multiple sequence alignments can provide information for comparative analyses of proteins and protein populations. We present some statistical trend-tests that can be used when an aligned data set can be divided into ...
详细信息
ISBN:
(纸本)1860946232
Multiple sequence alignments can provide information for comparative analyses of proteins and protein populations. We present some statistical trend-tests that can be used when an aligned data set can be divided into two or more populations based on phenotypic traits such as preference of temperature, pH, salt concentration or pressure. The approach is based on estimation and analysis of the variation between the values of physicochemical parameters at positions of the sequence alignment. Monotonic trends are detected by applying a cumulative Mann-Kendall test. The method is found to be useful to identify significant physicochemical mechanisms behind adaptation to extreme environments and uncover molecular differences between mesophile and extremophile organisms. A filtering technique is also presented to visualize the underlying structure in the data. All the comparative statistical methods are available in the toolbox DeltaProt.
We address the issue of structured motif inference. This problem is stated as follows: given a set of n DNA sequences and a quorum q (%), find the optimal structured consensus motif described as gaps alternating with ...
详细信息
ISBN:
(纸本)1860946232
We address the issue of structured motif inference. This problem is stated as follows: given a set of n DNA sequences and a quorum q (%), find the optimal structured consensus motif described as gaps alternating with specific regions and shared by at least q x n sequences. Our proposal is in the domain of metaheuristics: it runs solutions to convergence through a cooperation between a sampling strategy of the search space and a quick detection of local similarities in small sequence samples. The contributions of this paper are: (1) the design of a stochastic method whose genuine novelty rests on driving the search with a threshold frequency f discrimining between specific regions and gaps;(2) the original way for justifying the operations especially designed;(3) the implementation of a mining tool well adapted to biologists' exigencies: few input parameters are required (quorum q, minimal threshold frequency f, maximal gap length g). Our approach proves efficient on simulated data, promoter sites in Dicot plants and transcription factor binding sites in E. coli genome. Our algorithm, Kaos, compares favorably with MEME and STARS in terms of accuracy.
This volume contains about 40 papers covering many of the latest developments in the fast-growing field of bioinformatics. The contributions span a wide range of topics, including computational genomics and genetics, ...
ISBN:
(数字)9781860947575
ISBN:
(纸本)9781860947001
This volume contains about 40 papers covering many of the latest developments in the fast-growing field of bioinformatics. The contributions span a wide range of topics, including computational genomics and genetics, protein function and computational proteomics, the transcriptome, structural bioinformatics, microarray data analysis, motif identification, biological pathways and systems, and biomedical applications. There are also abstracts from the keynote addresses and invited *** papers cover not only theoretical aspects of bioinformatics but also delve into the application of new methods, with input from computation, engineering and biology disciplines. This multidisciplinary approach to bioinformatics gives these proceedings a unique viewpoint of the field.
Information processing and information flow occur in the course of an organism's development and throughout its lifespan. Organisms do not exist in isolation, but interact with each other constantly within a compl...
详细信息
ISBN:
(数字)9781860946882
ISBN:
(纸本)9781860945632
Information processing and information flow occur in the course of an organism's development and throughout its lifespan. Organisms do not exist in isolation, but interact with each other constantly within a complex ecosystem. The relationships between organisms, such as those between prey or predator, host and parasite, and between mating partners, are complex and multidimensional. In all cases, there is constant communication and information flow at many *** book focuses on information processing by life forms and the use of information technology in understanding them. Readers are first given a comprehensive overview of biocomputing before navigating the complex terrain of natural processing of biological information using physiological and analogous computing models. The remainder of the book deals with “artificial” processing of biological information as a human endeavor in order to derive new knowledge and gain insight into life forms and their functioning. Specific innovative applications and tools for biological discovery are provided as the link and complement to *** “artificial” processing of biological information is complementary to natural processing, a better understanding of the former helps us improve the latter. Consequently, readers are exposed to both domains and, when dealing with biological problems of their interest, will be better equipped to grasp relevant ideas.
暂无评论