核糖体RNA(rRNA)是细胞内发挥关键功能的大分子。尽管目前有许多平台支持rRNA二级结构的可视化和序列比对,但是它们难以形成标准2D图形,生成的可视化结果不易于直接进行序列比较。基于此,本文提出一种基于模板的rRNA可视化方法RNA-SD(rRNA-Spliting and Drawing),通过制定标准模板,依据rRNA二级结构之间的特点将整条序列进行拆分,进一步将其映射到标准模板内,在生成可用于出版的高质量rRNA二级结构可视化图形的同时还可以提取任意子结构进行序列比较和统计分析。通过对14000多条后生动物线粒体基因组的实验,表明了RNA-SD方法的可行性和可扩展性。相较于现有方法,RNA-SD不仅实现了rRNA结构的有效分类和可视化,还兼容子结构序列比较,为大规模rRNA比较研究提供了有力工具。这一创新不仅填补了现有方法的空白,还推动了rRNA结构多样性和演化规律的研究进展,为深入理解rRNA的结构和功能,以及为医学、生物技术等领域的发展提供了新的思路和方向。
针对传统的k-means算法的聚类数目k无法确定、初始聚类中心随机给定、容易受到离群点影响等问题,该算法使用LOF (Local Outlier Factor)离群点检测算法计算数据集中每个数据对象的离群因子,并去除离群因子大于指定阈值的数据对象,使用手肘法来确定符合数据集的最佳k值,根据最大密度和最大距离的思想结合每个点的离群因子来选取初始聚类中心并进行后续聚类中心的迭代,聚类完成后结合三支决策的思想对聚类结果的每个簇内的数据对象进行进一步优化。实验结果表明ODT-kmeans算法能合理选取k值、减少离群点的影响并且可以消除随机选择初始聚类中心的问题,提高了k-means聚类算法的准确率。In view of the problems of the traditional k-means algorithm, such as the number of clusters k cannot be determined, the initial cluster center is randomly given, and it is easily affected by outliers, this algorithm uses the LOF (Local Outlier Factor) outlier detection algorithm to calculate the outlier factor of each data object in the data set and remove the data objects whose outlier factor is greater than the specified threshold. The elbow method is used to determine the best k value that meets the data set. The initial cluster center is selected based on the idea of maximum density and maximum distance combined with the outlier factor of each point and the subsequent cluster center iterations are performed. After clustering is completed, the idea of three-way decision is combined to further optimize the data objects in each cluster of the clustering results. Experimental results show that the ODT-kmeans algorithm can reasonably select the k value, reduce the influence of outliers, and eliminate the problem of randomly selecting the initial cluster center, thereby improving the accuracy of the k-means clustering algorithm.
暂无评论