As an active subfield of Automated Machine Learning, automated structural analysis focuses on extracting the structural information, such as periodicity, from the data automatically, enabling automated data cleaning a...
详细信息
ISBN:
(纸本)9781450363594
As an active subfield of Automated Machine Learning, automated structural analysis focuses on extracting the structural information, such as periodicity, from the data automatically, enabling automated data cleaning and feature extraction. Little research, however, has been done on the periodicity mining from numeric data that contain noises and missing points. In this paper, we present a practical and innovative framework to close this gap. To validate our approach, we carry out detailed simulation studies and real data analyses. The experimental results show that our framework is more robust to data granularity with better accuracy and computational efficiency when comparing with baseline methods. Moreover, the results imply that our proposed method is insensitive to data jitters, noise points and missing signal points.
Background: Promoter region plays an important role in determining where the transcription of a particular gene should be initiated. Computational prediction of eukaryotic Pol II promoter sequences is one of the most ...
详细信息
Background: Promoter region plays an important role in determining where the transcription of a particular gene should be initiated. Computational prediction of eukaryotic Pol II promoter sequences is one of the most significant problems in sequence analysis. Existing promoter prediction methods are still far from being satisfactory. Results: We attempt to recognize the human Pol II promoter sequences from the non-promoter sequences which are made up of exon and intron sequences. Four methods are used: two kinds of multifractal analysis performed on the numeric sequences obtained from the dinucleotide free energy, Z curve analysis and global descriptor of the promoter/non-promoter primary sequences. A total of 141 parameters are extracted from these methods and categorized into seven groups (methods). They are used to generate certain spaces and then each promoter/non-promoter sequence is represented by a point in the corresponding space. All the 120 possible combinations of the seven methods are tested. Based on Fisher's linear discriminant algorithm, with a relatively smaller number of parameters (96 and 117), we get satisfactory discriminant accuracies. Particularly, in the case of 117 parameters, the accuracies for the training and test sets reach 90.43% and 89.79%, respectively. A comparison with five other existing methods indicates that our methods have a better performance. Using the global descriptor method (36 parameters), 17 of the 18 experimentally verified promoter sequences of human chromosome 22 are correctly identified. Conclusion: The high accuracies achieved suggest that the methods of this paper are useful for understanding the difficult problem of promoter prediction.
暂无评论