Transcript-level quantification is often measured across two groups of patients to aid the discovery of biomarkers and detection of biological mechanisms involving these biomarkers. Statistical tests lack power and fa...
详细信息
Transcript-level quantification is often measured across two groups of patients to aid the discovery of biomarkers and detection of biological mechanisms involving these biomarkers. Statistical tests lack power and false discovery rate is high when sample size is small. Yet, many experiments have very few samples (<= 5). This creates the impetus for a method to discover biomarkers and mechanisms under very small sample sizes. We present a powerful method, ESSNet, that is able to identify subnetworks consistently across independent datasets of the same disease phenotypes even under very small sample sizes. The key idea of ESSNet is to fragment large pathways into smaller subnetworks and compute a statistic that discriminates the subnetworks in two phenotypes. We do not greedily select genes to be included based on differential expression but rely on gene-expression-level ranking within a phenotype, which is shown to be stable even under extremely small sample sizes. We test our subnetworks on null distributions obtained by array rotation;this preserves the gene-gene correlation structure and is suitable for datasets with small sample size allowing us to consistently predict relevant subnetworks even when sample size is small. For most other methods, this consistency drops to less than 10% when we test them on datasets with only two samples from each phenotype, whereas ESSNet is able to achieve an average consistency of 58% (72% when we consider genes within the subnetworks) and continues to be superior when sample size is large. We further show that the subnetworks identified by ESSNet are highly correlated to many references in the biological literature. ESSNet and supplementary material are available at: http://***:8080/essnet.
Efficient mining of high-throughput data has become one of the popular themes in the big data era. Existing biologyrelated feature ranking methods mainly focus on statistical and annotation information. In this study,...
详细信息
Efficient mining of high-throughput data has become one of the popular themes in the big data era. Existing biologyrelated feature ranking methods mainly focus on statistical and annotation information. In this study, two efficient feature ranking methods are presented. Multi-target regression and graph embedding are incorporated in an optimization framework, and feature ranking is achieved by introducing structured sparsity norm. Unlike existing methods, the presented methods have two advantages: (1) the feature subset simultaneously account for global margin information as well as locality manifold information. Consequently, both global and locality information are considered. (2) Features are selected by batch rather than individually in the algorithm framework. Thus, the interactions between features are considered and the optimal feature subset can be guaranteed. In addition, this study presents a theoretical justification. Empirical experiments demonstrate the effectiveness and efficiency of the two algorithms in comparison with some state-of-the-art feature ranking methods through a set of real-world gene expression data sets.
Pathogen-inducible plant promoters(PIPs) are able to respond to pathogens after infection,which are usually activated by pathogens only at the time point after *** could have applications as molecular markers,and for ...
详细信息
Pathogen-inducible plant promoters(PIPs) are able to respond to pathogens after infection,which are usually activated by pathogens only at the time point after *** could have applications as molecular markers,and for engineering crops with increased disease *** study obtained 62 pathogen-inducible plant promoters sequences from 14 species through literatures and public *** potential cis-acting elements of the above PIPs were identified using Plant CARE and PLACE *** candidate rice PIPs,which contain potential pathogen-inducible AS-1,G-box,H-box and GCC-box cis-acting elements,were screened from rice promoterome *** a result,total 417 candidate rice PIPs were *** genes under the control of potential rice PIPs were annotated by searching NCBI COG database with their sequences as *** the candidate genes,55.26%are function-unknown;13.16%may involve in metabolism;9.57%may function in the cellular processes and signaling;8.61%are poorly characterized;8.37%may play roles in the information storage and processing;and 5.02%may act as dual *** validate the 417 candidate rice PIPs,several microarraydata of infected rice by pathogens were downloaded from public database and *** results indicated that changes with highly significance(p < 0.01) were observed in transcriptional level from 128 candidate genes controlled by *** the genes,92 genes were upregulated,and 63 genes were down-regulated in the microarray *** a control,20 rice genes controlled by non-PIPs(out of the PIPs) were randomly selected from 12 rice chromosomes(one or two genes selected from each chromosome) and their expression were analyzed based on the above microarray *** total 139 microarray treatments,5 gene-events were observed whose expressions varied on the significant *** false positive rate is 3.60%(5/139).The results of this study would underlie the elucidation of mechanisms by which rice PIPs regulate g
microarraydata produces expression pattern of thousands of genes at once. Grouping these gene expression patterns to have each group convey some biologically meaningful sight entails use of a clustering method. Two p...
详细信息
ISBN:
(纸本)9780769537399
microarraydata produces expression pattern of thousands of genes at once. Grouping these gene expression patterns to have each group convey some biologically meaningful sight entails use of a clustering method. Two problems exist when attempting to use conventional clustering methods for the microarray data analysis. Presence of outliers skews the mean value computation which, in turn influences placement of inconsistent gene expression patterns into one group. The clustering algorithms themselves generally cannot determine the right size of the clusters. We present a new method which approaches to the clustering problem from a different angle. That is, the clustering of gene expression patterns is better dealt with within a software framework that is conducive to helping biologists derive the right size of clusters utilizing their understanding of the experimental context once the baseline clusters are computed using the fold changes of gene expression levels. We discuss our experiences of using the framework in analyzing numerous microarraydata experiments.
Dimension reduction is an important topic in data mining, which is widely used in the areas of genetics, medicine, and bioinformatics. We propose a new local dimension reduction algorithm TotalPLS that operates in a u...
详细信息
Dimension reduction is an important topic in data mining, which is widely used in the areas of genetics, medicine, and bioinformatics. We propose a new local dimension reduction algorithm TotalPLS that operates in a unified partial least squares (PLS) framework and implement an information fusion of PLSbased feature selection and feature extraction. This paper focuses on extracting the potential structure hidden in high-dimensional multicategory microarraydata, and interpreting and understanding the results provided by the potential structure information. First, we propose using PLS-based recursive feature elimination (PLSRFE) in multicategory problems. Then, we perform feature importance analysis based on PLSRFE for high-dimensional microarraydata to determine the information feature (biomarkers) subset, which relates to the studied tumor subtypes problem. Finally, PLS-based supervised feature extraction is conducted on the selected specific genes subset to extract comprehensive features that best reflect the nature of classification to have a discriminating ability. The proposed algorithm is compared with several state-of-the-art methods using multiple high-dimensional multicategory microarraydatasets. Our comparison is performed in terms of recognition accuracy, relevance, and redundancy. Experimental results show that the algorithm proposed by us can improve the recognition rate and computational efficiency. Furthermore, mining potential structure information improves the interpretability and understandability of recognition results. The proposed algorithm can be effectively applied tomicroarray data analysis for the discovery of gene coexpression and coregulation.
Cloud based scientific data management - storage, transfer, analysis, and inference extraction - is attracting interest. In this paper, we propose a next generation cloud deployment model suitable for data intensive a...
详细信息
Cloud based scientific data management - storage, transfer, analysis, and inference extraction - is attracting interest. In this paper, we propose a next generation cloud deployment model suitable for data intensive applications. Our model is a flexible and self-service container-based infrastructure that delivers - network, computing, and storage resources together with the logic to dynamically manage the components in a holistic manner. We demonstrate the strength of our model with a bioinformatics application. Dynamic algorithms for resource provisioning and job allocation suitable for the chosen dataset are packaged and delivered in a privileged virtual machine as part of the container. We tested the model on our private internal experimental cloud that is built on low-cost commodity hardware. We demonstrate the capability of our model to create the required network and computing resources and allocate submitted jobs. The results obtained shows the benefits of increased automation in terms of both a significant improvement in the time to complete a dataanalysis and a reduction in the cost of analysis. The algorithms proposed reduced the cost of performing analysis by 50% at 15 GB of dataanalysis. The total time between submitting a job and writing the results after analysis also reduced by more than 1 hr at 15 GB of dataanalysis.
Bi-CoPaM ensemble clustering has the ability to mine a set of microarraydatasets collectively to identify the subsets of genes consistently co-expressed in all of them. It also has the capability of considering the e...
详细信息
ISBN:
(纸本)9780992862619
Bi-CoPaM ensemble clustering has the ability to mine a set of microarraydatasets collectively to identify the subsets of genes consistently co-expressed in all of them. It also has the capability of considering the entire gene set without pre-filtering as it implicitly filters out less interesting genes. While it showed success in revealing new insights into the biology of yeast, it has never been applied to bacteria. In this study, we apply Bi-CoPaM to five bacterial datasets, identifying two clusters of genes as the most consistently co-expressed. Strikingly, their average profiles are consistently negatively correlated in most of the datasets. Thus, we hypothesise that they are regulated by a common biological machinery, and that their genes with unknown biological processes may be participating in the same processes in which most of their genes known to participate. Additionally, our results demonstrate the applicability of Bi-CoPaM to a wide range of species.
Background: SNP genotyping microarrays have revolutionized the study of complex disease. The current range of commercially available genotyping products contain extensive catalogues of low frequency and rare variants....
详细信息
Background: SNP genotyping microarrays have revolutionized the study of complex disease. The current range of commercially available genotyping products contain extensive catalogues of low frequency and rare variants. Existing SNP calling algorithms have difficulty dealing with these low frequency variants, as the underlying models rely on each genotype having a reasonable number of observations to ensure accurate clustering. Results: Here we develop KRLMM, a new method for converting raw intensities into genotype calls that aims to overcome this issue. Our method is unique in that it applies careful between sample normalization and allows a variable number of clusters k (1, 2 or 3) for each SNP, where k is predicted using the available data. We compare our method to four genotyping algorithms (GenCall, GenoSNP, Illuminus and OptiCall) on several Illumina data sets that include samples from the HapMap project where the true genotypes are known in advance. All methods were found to have high overall accuracy (> 98%), with KRLMM consistently amongst the best. At low minor allele frequency, the KRLMM, OptiCall and GenoSNP algorithms were observed to be consistently more accurate than GenCall and Illuminus on our test data. Conclusions: Methods that tailor their approach to calling low frequency variants by either varying the number of clusters (KRLMM) or using information from other SNPs (OptiCall and GenoSNP) offer improved accuracy over methods that do not (GenCall and Illuminus). The KRLMM algorithm is implemented in the open-source crlmm package distributed via the Bioconductor project (http://***).
Bi-CoPaM ensemble clustering has the ability to mine a set of microarraydatasets collectively to identify the subsets of genes consistently co-expressed in all of them. It also has the capability of considering the e...
详细信息
ISBN:
(纸本)9781479946037
Bi-CoPaM ensemble clustering has the ability to mine a set of microarraydatasets collectively to identify the subsets of genes consistently co-expressed in all of them. It also has the capability of considering the entire gene set without pre-filtering as it implicitly filters out less interesting genes. While it showed success in revealing new insights into the biology of yeast, it has never been applied to bacteria. In this study, we apply Bi-CoPaM to five bacterial datasets, identifying two clusters of genes as the most consistently co-expressed. Strikingly, their average profiles are consistently negatively correlated in most of the datasets. Thus, we hypothesise that they are regulated by a common biological machinery, and that their genes with unknown biological processes may be participating in the same processes in which most of their genes known to participate. Additionally, our results demonstrate the applicability of Bi-CoPaM to a wide range of species.
Cancer prognosis is an important clinical practice in cancer medicine and is an important factor in developing personalized medicine. But till now, researches focus on developing recurrence risk indices that tell poor...
详细信息
ISBN:
(纸本)9781479956708
Cancer prognosis is an important clinical practice in cancer medicine and is an important factor in developing personalized medicine. But till now, researches focus on developing recurrence risk indices that tell poor or good survival for given cancer patients. These indices, however, are insufficient and elusive in the clinic. In this paper, we propose to predict survival time of cancer patients using pattern recognition approach, which is more informative and favorable to clinicians and patients in clinical practice. We conduct an extensive survey of pattern recognition methods for the prognosis based on real-world benchmark microarraydata sets. In particular, various types of data preprocessing methods and various types of classification models are introduced and examined for predicting survival time of lung cancer based on gene expression. The experimental results show that pattern recognition method can provide a feasible and efficient way to predict survival time of cancer patients. It is expected that the pattern classification-based strategy opens a new paradigm of cancer prognosis for predicting survival time of cancer patients in the clinic.
暂无评论