Dynamic programming has been one of the most efficient approaches to sequence analysis and structure prediction in biology. However, their performance is limited due to the drastic increase in both the number of biolo...
详细信息
Dynamic programming has been one of the most efficient approaches to sequence analysis and structure prediction in biology. However, their performance is limited due to the drastic increase in both the number of biological data and variety of the computer architectures. With regard to such predicament, this paper creates excellent algorithms aimed at addressing the challenges of improving memory efficiency and network latency tolerance for nonserial polyadic dynamic programming where the dependences are nonuniform. By relaxing the nonuniform dependences, we proposed a new cache oblivious scheme to enhance its performance on memory hierarchy architectures. Moreover we develop and extend a tiling technique to parallelize this nonserial polyadic dynamic programming using an alternate block-cyclic mapping strategy for balancing the computational and memory load, where an analytical parameterized model is formulated to determine the tile volume size that minimizes the total execution time and an algorithmic transformation is used to schedule the tile to overlap communication with computation to further minimize communication overhead on parallel architectures. The numerical experiments were carried out on several high performance computer systems. The new cache-oblivious dynamic programming algorithm achieve 2-10 speedup and the parallel tiling algorithm with communication-computation overlapping shows a desired potential for fine-grained parallel computing on massively parallel computer systems
An important goal of functional genomics is to develop methods for determining ways in which individual actions of genes are integrated in the cell. One way of gaining insight into a gene's role in cellular activi...
详细信息
An important goal of functional genomics is to develop methods for determining ways in which individual actions of genes are integrated in the cell. One way of gaining insight into a gene's role in cellular activity is to study its expression pattern in a variety of circumstances and contexts, as it responds to its environment and to the action of other genes. Microarrays provide large-scale surveys of gene expression in which transcript levels can be determined for thousands of genes simultaneously. The coefficient of determination (CoD) has been proposed for the analysis of gene interaction via multivariate expression arrays. Parallel computing is essential to the application of the CoD to a large set of genes because of the large number of expression-based functions that must be statistically designed and compared. The results of the calculation of the CoD for a large set of genes with multiple superscalar processors are presented. A proposal for calculating the CoD with multiple vector processors is described. Multiple vector processor systems offer the potential to greatly reduce the time to calculate the CoD for a large set of genes
The Cancer Biomedical Informatics Grid (caBIGtrade) is a new project initiated by the National Cancer Institute to create a computational network connecting scientists and institutions to enable the sharing of data an...
详细信息
The Cancer Biomedical Informatics Grid (caBIGtrade) is a new project initiated by the National Cancer Institute to create a computational network connecting scientists and institutions to enable the sharing of data and the use of common analytical tools. The emergence of genomics and proteomics high-throughput technologies are creating a paradigm shift in biomedical research from small independent labs to large teams of researchers exploring entire genomes and proteomes and how they relate to disease. caBIGtrade is developing new software and modifying existing software within Clinical Trials Management systems, Tissue Banks and Pathology Tools and Integrated Cancer Research tools to manage the huge volume of data being generated and to facilitate collaboration across the broad spectrum of cancer research
Projection pursuit learning (PPL) refers to a well-known constructive learning algorithm characterized by a very efficient and accurate computational procedure oriented to nonparametric regression. It has been employe...
详细信息
Projection pursuit learning (PPL) refers to a well-known constructive learning algorithm characterized by a very efficient and accurate computational procedure oriented to nonparametric regression. It has been employed as a means to counteract some problems related to the design of artificial neural network (ANN) models, namely, the estimation of a (usually large) number of free parameters, the proper definition of the model's dimension, and the choice of the sources of nonlinearities (activation functions). In this work, the potentials of PPL are exploited through a different perspective, namely, in designing one-hidden-layer feedforward ANNs for the adaptive control of nonlinear dynamic systems. For such purpose, the proposed methodology is divided into three stages. In the first, the model identification process is undertaken. In the second, the ANN structure is defined according to an offline control setting. In these two stages, the PPL algorithm estimates not only the optimal number of hidden neurons but also the best activation function for each node. The final stage is performed online and promotes a fine-tuning in the parameters of the identification model and the controller. Simulation results indicate that it is possible to design effective neural models based on PPL for the control of nonlinear multivariate systems, with superior performance when compared to benchmarks
This paper extends an earlier study which outlined a bioinformatic pipeline for exploratory search for RNA motifs incorporating both primary and secondary structure. The pipeline is applied to three data sets, one of ...
详细信息
This paper extends an earlier study which outlined a bioinformatic pipeline for exploratory search for RNA motifs incorporating both primary and secondary structure. The pipeline is applied to three data sets, one of which is a larger version of that used in the earlier study. Instead of a single method of estimating the distance between RNA folds four distance measures were tested. The data sets are: a set of random control sequences, a set of synthetic sequences with simple designed folds, and the iron response element data set for which actual biological RNA folds are available. The pipeline demonstrates the ability to produce clusters that contain known motifs in the biological data and those designed into the synthetic data. The results for the distance measures varies substantially and one of the measures, difference in energy, is found to be too simplistic to be useful for differentiating motifs. The other three distance measures all demonstrate some degree of merit. At the heart of the pipeline is a non-linear projection algorithm that uses evolutionary computation to display the intra-RNA-fold distances so that the various distance measures can be visually compared. While the performance of this algorithm is acceptable, suggestions for improving it are made.
Motif discovery from bio sequences, a challenging task both experimentally and computationally, has been a topic of immense study in recent years. In this paper, we formulate the motif discovery problem as a multiple-...
详细信息
Motif discovery from bio sequences, a challenging task both experimentally and computationally, has been a topic of immense study in recent years. In this paper, we formulate the motif discovery problem as a multiple-instance problem and employ a multiple-instance learning method, the MILES method, to identify motif from biological sequences. Each sequence is mapped into a feature space defined by instances in training sequences with a novel instance-bag similarity measure. We employ I-norm SVM to select important features and construct classifiers simultaneously. These high-ranked features correspond to discovered motifs. We apply this method to discover transcriptional factor binding sites in promoters, a typical motif finding problem in biology, and show that the method is at least comparable to existing methods
Recent progress in proteomics, computational biology, and ontology development has presented an opportunity to investigate protein data sources from a unique perspective that is, examining protein data sources through...
详细信息
Recent progress in proteomics, computational biology, and ontology development has presented an opportunity to investigate protein data sources from a unique perspective that is, examining protein data sources through structure and hierarchy of protein ontology (PO). Various data mining algorithms and mathematical models provide methods for analyzing protein data sources; however, there are two issues that need to be addressed: (1) the need for standards for defining protein data description and exchange and (2) eliminating errors which arise with the data integration methodologies for complex queries. Protein ontology is designed to meet these needs by providing a structured protein data specification for protein data representation. Protein ontology is a standard for representing protein data in a way that helps in defining data integration and data mining models for protein structure and function. We report here our development of PO; a semantic heterogeneity framework based on relationships between PO concepts; and analysis of resultant PO data of human proteins. We also talk in this paper briefly about our ongoing work of designing a trustworthy framework around PO
暂无评论