Selecting a small subset of genes out of the thousands of genes in microarray data is important for accurate classification of phenotypes. Widely used methods typically rank genes according to their differential expre...
详细信息
Selecting a small subset of genes out of the thousands of genes in microarray data is important for accurate classification of phenotypes. Widely used methods typically rank genes according to their differential expressions among phenotypes and pick the top-ranked genes. We observe that feature sets so obtained have certain redundancy and study methods to minimize it. Feature sets obtained through the minimum redundancy - maximum relevance framework represent broader spectrum of characteristics of phenotypes than those obtained through standard ranking methods; they are more robust, generalize well to unseen data, and lead to significantly improved classifications in extensive experiments on 5 gene expressions data sets.
The goal of this genomes-to-life project is to develop models that can describe the functioning of the microbial communities involved in the in situ bioremediation of uranium-contaminated groundwater and harvesting el...
详细信息
The goal of this genomes-to-life project is to develop models that can describe the functioning of the microbial communities involved in the in situ bioremediation of uranium-contaminated groundwater and harvesting electricity from waste organic matter. Previous studies have demonstrated that the microbial communities involved in uranium bioremediation and energy harvesting are both dominated by microorganisms in the family Geobacteraceae and that these Geobacteraceae are responsible for the uranium bioremediation and electron transfer to electrodes. The research plan is diagrammed below. Examples of how both pure culture and environmental genomic studies have dramatically changed the concepts of how Geobacteraceae-dominated subsurface communities function will be presented.
We present a stochastic model of proteolytic digestion of a proteome, assuming the distribution of parent protein lengths in the proteome, the relative abundances of the 20 amino acids in the proteome, and the digesti...
详细信息
We present a stochastic model of proteolytic digestion of a proteome, assuming the distribution of parent protein lengths in the proteome, the relative abundances of the 20 amino acids in the proteome, and the digestion "rules" of the enzyme used in the digestion. We derived a closed form expression for the fragment mass distribution for a large class of enzymes including the widely used trypsin. The expression uses the distribution of lengths in a mixture of proteins taken from a proteome, as well as the relative abundances of the 20 amino acids in the proteome. The agreement between theory and the in silica digest is excellent.
We introduce the SCP - the sorted common prefix, and study some of its properties. Based on the internal representations used by a class of new compression schemes, we show how the SCP table can be constructed using a...
详细信息
We introduce the SCP - the sorted common prefix, and study some of its properties. Based on the internal representations used by a class of new compression schemes, we show how the SCP table can be constructed using an O(u+| /spl Sigma/ |K/sub max/) number of comparisons on average, and O(u | /spl Sigma/ |) worst case, where u is the size of the sequence, | /spl Sigma/ | is the number of symbols, and K/sub max/ is the maximum SCP value. We describe one application of the SCP to the problem of anchor points in multiple sequence alignment.
This paper presents a novel algorithm for identification and functional characterization of "key" genome features responsible for a particular biochemical process of interest. The central idea is that indivi...
详细信息
This paper presents a novel algorithm for identification and functional characterization of "key" genome features responsible for a particular biochemical process of interest. The central idea is that individual genome features are identified as "key" features if the discrimination accuracy between two classes of genomes with respect to a given biochemical process is sufficiently affected by the inclusion or exclusion of these features. In this paper, genome features are defined by high-resolution gene functions. The discrimination procedure utilizes the support vector machine classification technique. The application to the oxygenic photosynthetic process resulted in 126 highly confident candidate genome features. While many of these features are well-known components in the oxygenic photosynthetic process, others are completely unknown, even including some hypothetical proteins. It is obvious that our algorithm is capable of discovering features related to a targeted biochemical process.
We present a computational method to analyze the propensity of superhelically stressed DNA to undergo strand separation events, as is required for the initiation of both transcription and replication. We build in sili...
详细信息
We present a computational method to analyze the propensity of superhelically stressed DNA to undergo strand separation events, as is required for the initiation of both transcription and replication. We build in silico models to analyze the statistical mechanical equilibrium distribution of a population of identical, stressed DNA molecules among its states of strand separation. In this phenomenon, which we call stress induced duplex destabilization (SIDD), a state energy is determined by the energy cost of opening the specific separated base pairs in that state, and the energy relief from the relaxation of stress this affords. We use experimentally measured values of all energy parameters, including the nearest neighbor energetics known to govern DNA base pair stability. We perform a statistical mechanical analysis in which the approximate equilibrium distribution is calculated from all states whose free energies do not exceed a user-defined threshold. This provides the most general and efficient computational approach to the analysis of this phenomenon. The algorithm is implemented in C++.
A new and essentially simple method to reconstruct prokaryotic phylogenetic trees from their complete genome data without using sequence alignment is proposed. It is based on the appearance frequency of oligopeptides ...
详细信息
ISBN:
(纸本)0769520006
A new and essentially simple method to reconstruct prokaryotic phylogenetic trees from their complete genome data without using sequence alignment is proposed. It is based on the appearance frequency of oligopeptides of a fixed length (up to K=6) in their proteomes. This is a method without fine adjustment and choice of genes. It can incorporate the effect of lateral gene transfer to some extent and leads to results comparable with the bacteriologists' systematics as reflected in the latest 2001 edition of the Sergey's manual of systematic bacteriology. A key point in our approach is subtraction of a random back-groundby using a Markovian model of order K-1 from the composition vectors to highlight the shaping role of natural selection.
The problem of clustering continuous valued data has been well studied in literature. Its application to microarray analysis relies on such algorithms as k-means, dimensionality reduction techniques, and graph-based a...
详细信息
The problem of clustering continuous valued data has been well studied in literature. Its application to microarray analysis relies on such algorithms as k-means, dimensionality reduction techniques, and graph-based approaches for building dendrograms of sample data. In contrast, similar problems for discrete-attributed data are relatively unexplored. An instance of analysis of discrete-attributed data arises in detecting co-regulated samples in microarrays. In this paper, we present an algorithm and a software framework, PROXIMUS, for error-bounded clustering of high-dimensional discrete attributed datasets in the context of extracting co-regulated samples from microarray data. We show that PROXIMUS delivers outstanding performance in extracting accurate patterns of gene-expression.
With the human and mouse genome sequences behind us, whole microbial genome sequencing has become the most active area in genomics today. As easy targets have been worked on first, the microbes under scrutiny today ar...
详细信息
With the human and mouse genome sequences behind us, whole microbial genome sequencing has become the most active area in genomics today. As easy targets have been worked on first, the microbes under scrutiny today are frequently uncharacterized and difficult to grow and isolate. In those cases, genome sequences often constitute the first and only reliable information about the microorganism to which they belong. It also becoming the rule that no experiments (genetics, transformation, mutagenesis) are directly possible on the microorganism. For better characterized microbes, the competition in the field pushes us to get interested in "anonymous genes" for which no functional clues have be gained from routine sequence analysis.
We have developed a mathematical framework for representing and testing hypotheses about gene, protein, and signaling molecule interactions. It takes a hierarchical, contradiction-based approach, and can make use of m...
详细信息
We have developed a mathematical framework for representing and testing hypotheses about gene, protein, and signaling molecule interactions. It takes a hierarchical, contradiction-based approach, and can make use of multiple data sources to assess hypothesis viability and to generate a viability partial order over the space of hypotheses. We have developed an event-based formal language for the expression of such hypotheses. This language seamlessly integrates regulatory diagrams (graphical inputs) and structured English (text input) to maximize flexibility. We have developed a pre-topological formalism that allows us to make precise statements about hypothesis similarity and the convergence of iterative refinements of a base hypothesis. To this, we add mathematical machinery that allows us to make precise statements about control and regulation.
暂无评论