One of the central problems in bioinformatics is de-novo finding of recurring motifs in the DNA. Since these motifs are preserved throughout evolution they probably have a significant biological role. One of the most ...
详细信息
ISBN:
(纸本)9781479959877
One of the central problems in bioinformatics is de-novo finding of recurring motifs in the DNA. Since these motifs are preserved throughout evolution they probably have a significant biological role. One of the most widely used existing tools uses Expectation Maximization (EM) algorithm in order to learn the parameters of a statistical model based on partial data. One such method is based on assuming the Motif-data is generated by a Hidden Markov Model (HMM). This method is called the meme algorithm. Despite its success, this method is in its essence a hill-climbing method, and as such, is known to be subject to being caught in local optima. In this work, we tackled the problem by using, instead, a genetic algorithm, and to search for the optimal probabilities of the HMM model. In certain occasions we succeeded in achieving better results using GA.
Objectives: This paper proposes a greedy algorithm for learning a mixture of motifs model through likelihood maximization, in order to discover common substrings, known as motifs, from a given collection of related bi...
详细信息
Objectives:
This paper proposes a greedy algorithm for learning a mixture of motifs model through likelihood maximization, in order to discover common substrings, known as motifs, from a given collection of related biosequences.
Methods:
The approach sequentially adds a new motif component to a mixture model by performing a combined scheme of global and local search for appropriately initializing the component parameters. A hierarchical clustering scheme is also applied initially which leads to the identification of candidate motif models and speeds up the global searching procedure.
Results:
The performance of the proposed algorithm has been studied in both artificial and real biological datasets. In comparison with the well-known meme approach, the algorithm is advantageous since it identifies motifs with significant conservation and produces larger protein fingerprints.
Conclusion:
The proposed greedy algorithm constitutes a promising approach for discovering multiple probabilistic motifs in biological sequences. By using an effective incremental mixture modeling strategy, our technique manages to successfully overcome the limitation of the meme scheme which erases motif occurrences each time a new motif is discovered.
暂无评论