Background Accumulated evidence shows that the abnormal regulation of long non-coding RNA (lncRNA) is associated with various human diseases. Accurately identifying disease-associated lncRNAs is helpful to study the m...
详细信息
Background Accumulated evidence shows that the abnormal regulation of long non-coding RNA (lncRNA) is associated with various human diseases. Accurately identifying disease-associated lncRNAs is helpful to study the mechanism of lncRNAs in diseases and explore new therapies of diseases. Many lncRNA-disease association (LDA) prediction models have been implemented by integrating multiple kinds of data resources. However, most of the existing models ignore the interference of noisy and redundancy information among these data resources. Results To improve the ability of LDA prediction models, we implemented a random forest and feature selection based LDA prediction model (RFLDA in short). First, the RFLDA integrates the experiment-supported miRNA-disease associations (MDAs) and LDAs, the disease semantic similarity (DSS), the lncRNA functional similarity (LFS) and the lncRNA-miRNA interactions (LMI) as input features. Then, the RFLDA chooses the most useful features to train prediction model by feature selection based on the random forest variable importance score that takes into account not only the effect of individual feature on prediction results but also the joint effects of multiple features on prediction results. Finally, a random forest regression model is trained to score potential lncRNA-disease associations. In terms of the area under the receiver operating characteristic curve (AUC) of 0.976 and the area under the precision-recall curve (AUPR) of 0.779 under 5-fold cross-validation, the performance of the RFLDA is better than several state-of-the-art LDA prediction models. Moreover, case studies on three cancers demonstrate that 43 of the 45 lncRNAs predicted by the RFLDA are validated by experimental data, and the other two predicted lncRNAs are supported by other LDA prediction models. Conclusions Cross-validation and case studies indicate that the RFLDA has excellent ability to identify potential disease-associated lncRNAs.
Patterns on proteins and genomic sequences are vastly analyzed, extracted and collected in databases. Although protein patterns originate from genomic coding regions, very few works have directly or indirectly dealt w...
详细信息
Patterns on proteins and genomic sequences are vastly analyzed, extracted and collected in databases. Although protein patterns originate from genomic coding regions, very few works have directly or indirectly dealt with coding region patterns induced from protein patterns. Results: In this paper, we have defined a new genomic pattern structure suitable for representing induced patterns from proteins. The provided pattern structure, which is called "Consecutive Positions Scoring Matrix (CPSSM)", is a replacement for protein patterns and profiles in the genomic context. CPSSMs can be identified, discovered, and searched in genomes. Then, we have presented a novel pattern matching algorithm between the defined genomic pattern and genomic sequences based on dynamic programming. In addition, we have modified the provided algorithm to support intronic gaps and huge sequences. We have implemented and tested the provided algorithm on real data. The results on Saccharomyces cerevisiae's genome show 132% more true positives and no false negatives and the results on human genome show no false negatives and 10 times as many true positives as those in previous works. Conclusion: CPSSM and provided methods could be used for open reading frame detection and gene finding. The application is available with source codes to run and download at http://***. ir/cpssm/.
Background: Restriction site analysis involves determining the locations of restriction sites after the process of digestion by reconstructing their positions based on the lengths of the cut DNA. Using different react...
详细信息
Background: Restriction site analysis involves determining the locations of restriction sites after the process of digestion by reconstructing their positions based on the lengths of the cut DNA. Using different reaction times with a single enzyme to cut DNA is a technique known as a partial digestion. Determining the exact locations of restriction sites following a partial digestion is challenging due to the computational time required even with the best known practical algorithm. Results: In this paper, we introduce an efficient algorithm to find the exact solution for the partial digest problem. The algorithm is able to find all possible solutions for the input and works by traversing the solution tree with a breadth-first search in two stages and deleting all repeated subproblems. Two types of simulated data, random and Zhang, are used to measure the efficiency of the algorithm. We also apply the algorithm to real data for the Luciferase gene and the E. coli K12 genome. Conclusion: Our algorithm is a fast tool to find the exact solution for the partial digest problem. The percentage of improvement is more than 75% over the best known practical algorithm for the worst case. For large numbers of inputs, our algorithm is able to solve the problem in a suitable time, while the best known practical algorithm is unable.
暂无评论