The invention of microarrays has rapidly changed the state of biological and biomedical research. Clustering algorithms play an important role in clustering microarray data sets where identifying groups of co-expresse...
详细信息
The invention of microarrays has rapidly changed the state of biological and biomedical research. Clustering algorithms play an important role in clustering microarray data sets where identifying groups of co-expressed genes are a very difficult task. Here we have posed the problem of clustering the microarray data as a multiobjective clustering problem. A new symmetry based fuzzy clustering technique is developed to solve this problem. The effectiveness of the proposed technique is demonstrated on five publicly available benchmark data sets. Results are compared with some widely used microarray clustering techniques. Statistical and biological significance tests have also been carried out. (C) 2013 Elsevier Ltd. All rights reserved.
We are interested in estimating prediction error for a classification model built on high dimensional genomic data when the number of genes (p) greatly exceeds the number of subjects (n). We examine a distance argumen...
详细信息
We are interested in estimating prediction error for a classification model built on high dimensional genomic data when the number of genes (p) greatly exceeds the number of subjects (n). We examine a distance argument supporting the conventional 0.632+ bootstrap proposed for the $n > p$ scenario, modify it for the $n < p$ situation and develop learning curves to describe how the true prediction error varies with the number of subjects in the training set. The curves are then applied to define adjusted resampling estimates for the prediction error in order to achieve a balance in terms of bias and variability. The adjusted resampling methods are proposed as counterparts of the 0.632+ bootstrap when $n < p$, and are found to improve on the 0.632+ bootstrap and other existing methods in the microarray study scenario when the sample size is small and there is some level of differential expression. The Canadian Journal of Statistics 41: 133150;2013 (c) 2012 Statistical Society of Canada
In this study, we discuss and apply a novel and efficient algorithm for learning a local Bayesian network model in the vicinity of the ZNF217 oncogene from breast cancer microarray data without having to decide in adv...
详细信息
In this study, we discuss and apply a novel and efficient algorithm for learning a local Bayesian network model in the vicinity of the ZNF217 oncogene from breast cancer microarray data without having to decide in advance which genes have to be included in the learning process. ZNF217 is a candidate oncogene located at 20q13, a chromosomal region frequently amplified in breast and ovarian cancer, and correlated with shorter patient survival in these cancers. To properly address the difficulties in managing complex gene interactions given our limited sample, statistical significance of edge strengths was evaluated using bootstrapping and the less reliable edges were pruned to increase the network robustness. We found that 13 out of the 35 genes associated with deregulated ZNF217 expression in breast tumours have been previously associated with survival and/or prognosis in cancers. Identifying genes involved in lipid metabolism opens new fields of investigation to decipher the molecular mechanisms driven by the ZNF217 oncogene. Moreover, nine of the 13 genes have already been identified as putative ZNF217 targets by independent biological studies. We therefore suggest that the algorithms for inferring local BNs are valuable data mining tools for unraveling complex mechanisms of biological pathways from expression data. The source code is available at http://***-lyon1. fr/similar to aaussem/***. (c) 2012 Elsevier Ltd. All rights reserved.
The positive false discovery rate (pFDR) is the average proportion of false rejections given that the overall number of rejections is greater than zero. Assuming that the proportion of true null hypotheses, proportion...
详细信息
The positive false discovery rate (pFDR) is the average proportion of false rejections given that the overall number of rejections is greater than zero. Assuming that the proportion of true null hypotheses, proportion of false positives, and proportion of true positives all converge pointwise, the pFDR converges to a continuous limit uniformly over all significance levels. We are showing that the uniform convergence still holds given a weaker assumption that the proportion of true positives converges in L-1.
Densely connected patterns in biological networks can help biologists to elucidate meaningful insights. How to detect dense subgraphs effectively and quickly has been an urgent challenge in recent years. In this paper...
详细信息
Densely connected patterns in biological networks can help biologists to elucidate meaningful insights. How to detect dense subgraphs effectively and quickly has been an urgent challenge in recent years. In this paper, we proposed a local measure named the edge density coefficient, which could indicate whether an edge locates a dense subgraph or not. Simulation results showed that this measure could improve both the accuracy and speed in detecting dense subgraphs. Thus, the G-N algorithm can be extended to large biological networks by this local measure. Finally, we applied this algorithm to microarray data sets of Saccharomyces cerevisiae, and performed the gene ontology analysis of the result by the GOEAST.
In this paper, we propose a general spiked model called the power spiked model in high-dimensional settings. We derive relations among the data dimension, the sample size and the high-dimensional noise structure. We f...
详细信息
In this paper, we propose a general spiked model called the power spiked model in high-dimensional settings. We derive relations among the data dimension, the sample size and the high-dimensional noise structure. We first consider asymptotic properties of the conventional estimator of eigenvalues. We show that the estimator is affected by the high-dimensional noise structure directly, so that it becomes inconsistent. In order to overcome such difficulties in a high-dimensional situation, we develop new principal component analysis (PCA) methods called the noise-reduction methodology and the cross-data-matrix methodology under the power spiked model. We show that the new PCA methods can enjoy consistency properties not only for eigenvalues but also for PC directions and PC scores in high-dimensional settings. (C) 2013 Elsevier Inc. All rights reserved.
This paper addresses semantic data mining, a new data mining paradigm in which ontologies are exploited in the process of data mining and knowledge discovery. This paradigm is introduced together with new semantic sub...
详细信息
This paper addresses semantic data mining, a new data mining paradigm in which ontologies are exploited in the process of data mining and knowledge discovery. This paradigm is introduced together with new semantic subgroup discovery systems SDM-search for enriched gene sets (SEGS) and SDM-Aleph. These systems are made publicly available in the new SDM-Toolkit for semantic data mining. The toolkit is implemented in the Orange4WS data mining platform that supports knowledge discovery workflow construction from local and distributed data mining services. On the basis of the experimental evaluation of semantic subgroup discovery systems on two publicly available biomedical datasets, the paper results in a thorough quantitative and qualitative evaluation of SDM-SEGS and SDM-Aleph and their comparison with SEGS, a system for enriched gene set discovery from microarray data.
For transformations, a set of new basis is normally chosen for the data. The selection of the new basis determines the properties that will be held by the transformed data. For wavelet transform, a set of wavelet basi...
详细信息
For transformations, a set of new basis is normally chosen for the data. The selection of the new basis determines the properties that will be held by the transformed data. For wavelet transform, a set of wavelet basis aims to detect the localized features contained in microarray data. In this research, we investigate the performance of wavelet features based on wavelet detail coefficients at third level in wavelet space, which characterize the changing points of microarray data based on high-order information. In order to find the significant gene information, we reconstruct wavelet details based on detail coefficients. A genetic algorithm is used to select the best features from reconstructed details in original data space, and corresponding gene information is detected based on selected features. Experiments are carried out on four datasets and experimental results show that good performance is achieved based on twofold cross-validation experiments.
Cancer is deemed as a highly heterogeneous disease specific to cell type and tissue origin. All cancers, however, share a common pathogenesis. Therefore, it is widely believed that cancers may share common mechanisms....
详细信息
Cancer is deemed as a highly heterogeneous disease specific to cell type and tissue origin. All cancers, however, share a common pathogenesis. Therefore, it is widely believed that cancers may share common mechanisms. In this study, we introduce a novel strategy based on multi-tasking learning methods to predict core cancer genes shared by multiple cancers in the hope of elucidating common cancer mechanisms. Our strategy uses two multi-tasking learning algorithms, one for feature selection and the other for validation of selected features. The combined use of two methods results in more robust classifiers and reliable selected features. The top 73 significant features, mapped to 72 genes, are selected as core cancer genes. The effectiveness of the 73 features is further demonstrated in a blind test conducted on an independent test data. The biological significance of these genes is evaluated using systems biology analyses. Extensive functional, pathway and network analysis confirms findings in previous studies and brings new insights into common cancer mechanisms. Our strategy can be used as a general method to find important genes from large gene expression datasets on the genomic level. The selected genes can be used to predict cancers. (C) 2012 Elsevier Ltd. All rights reserved.
MicroRNAs are short non-coding RNAs that can regulate gene expression during various crucial cell processes such as differentiation, proliferation and apoptosis. Changes in expression profiles of miRNA play an importa...
详细信息
MicroRNAs are short non-coding RNAs that can regulate gene expression during various crucial cell processes such as differentiation, proliferation and apoptosis. Changes in expression profiles of miRNA play an important role in the development of many cancers, including CRC. Therefore, the identification of cancer related miRNAs and their target genes are important for cancer biology research. In this paper, we applied TSK-type recurrent neural fuzzy network (TRNFN) to infer miRNA-mRNA association network from paired miRNA, mRNA expression profiles of CRC patients. We demonstrated that the method we proposed achieved good performance in recovering known experimentally verified miRNA-mRNA associations. Moreover, our approach proved successful in identifying 17 validated cancer miRNAs which are directly involved in the CRC related pathways. Targeting such miRNAs may help not only to prevent the recurrence of disease but also to control the growth of advanced metastatic tumors. Our regulatory modules provide valuable insights into the pathogenesis of cancer. (C) 2012 Elsevier B.V. All rights reserved.
暂无评论