A new method to analyze the similarity between multiply aligned protein motifs (blocks) was developed. It identifies sets of consistently aligned blocks. These are found to be protein regions of similar function and s...
详细信息
A new method to analyze the similarity between multiply aligned protein motifs (blocks) was developed. It identifies sets of consistently aligned blocks. These are found to be protein regions of similar function and structure that appear in different contexts. For example, the Rossmann fold ligand-binding region is found similar to TIM barrel and methylase regions, various protein families are predicted to have a TIM-barrel fold and the structural relation between the ClpP protease and crotonase folds is identified from their sequence. Besides identifying local structure features, sequence similarity across short sequence-regions (less than 20 amino acid regions) also predicts structure similarity of whole domains (folds) a few hundred amino acid reidues long. Most of these relations could not be identified by other advanced sequence-to-sequence or sequence-to-multiple alignments comparisons. We describe the method (termed CYRCA), present examples of our findings, and discuss their implications. (C) 2001 Academic Press.
作者:
Zhang, JunLiu, BinHarbin Inst Technol
Shenzhen Grad Sch Sch Comp Sci & Technol HIT Campus Shenzhen Univ Town Shenzhen 518055 Guangdong Peoples R China
Background: proteins play a crucial role in life activities, such as catalyzing metabolic reactions, DNA replication, responding to stimuli, etc. Identification of proteinstructures and functions are critical for bot...
详细信息
Background: proteins play a crucial role in life activities, such as catalyzing metabolic reactions, DNA replication, responding to stimuli, etc. Identification of proteinstructures and functions are critical for both basic research and applications. Because the traditional experiments for studying the structures and functions of proteins are expensive and time consuming, computational approaches are highly desired. In key for computational methods is how to efficiently extract the features from the protein sequences. During the last decade, many powerful feature extraction algorithms have been proposed, significantly promoting the development of the studies of proteinstructures and functions. Objective: To help the researchers to catch up the recent developments in this important field, in this study, an updated review is given, focusing on the sequence-based feature extractions of protein sequences. Method: These sequence-based features of proteins were grouped into three categories, including composition-based features, autocorrelation-based features and profile-based features. The detailed information of features in each group was introduced, and their advantages and disadvantages were discussed. Besides, some useful tools for generating these features will also be introduced. Results: Generally, autocorrelation-based features outperform composition-based features, and profile-based features outperform autocorrelation-based features. The reason is that profile-based features consider the evolutionary information, which is useful for identification of proteinstructures and functions. However, profile-based features are more time consuming, because the multiple sequence alignment process is required. Conclusion: In this study, some recently proposed sequence-based features were introduced and discussed, such as basic k-mers, PseAAC, auto-cross covariance, top-n-gram etc. These features did make great contributions to the developments of protein sequence analysi
As the first step of machine-learning based protein structure and function prediction, the amino acid encoding play a fundamental role in the final success of those methods. Different from the protein sequence encodin...
详细信息
As the first step of machine-learning based protein structure and function prediction, the amino acid encoding play a fundamental role in the final success of those methods. Different from the protein sequence encoding, the amino acid encoding can be used in both residue-level and sequence-level prediction of protein properties by combining them with different algorithms. However, it has not attracted enough attention in the past decades, and there are no comprehensive reviews and assessments about encoding methods so far. In this article, we make a systematic classification and propose a comprehensive review and assessment for various amino acid encoding methods. Those methods are grouped into five categories according to their information sources and information extraction methodologies, including binary encoding, physicochemical properties encoding, evolution-based encoding, structure-based encoding, and machine-learning encoding. Then, 16 representative methods from five categories are selected and compared on protein secondary structureprediction and protein fold recognition tasks by using large-scale benchmark datasets. The results show that the evolution-based position-dependent encoding method PSSM achieved the best performance, and the structure-based and machine-learning encoding methods also show some potential for further application, the neural network based distributed representation of amino acids in particular may bring new light to this area. We hope that the review and assessment are useful for future studies in amino acid encoding.
暂无评论