版权所有:内蒙古大学图书馆 技术提供:维普资讯• 智图
内蒙古自治区呼和浩特市赛罕区大学西街235号 邮编: 010021
作者机构:Univ Alberta Dept Comp Sci Edmonton AB T6G 2R3 Canada North Carolina A&T State Univ Dept Ind & Syst Engn Greensboro NC 27406 USA Naval Postgrad Sch Dept Operat Res Monterey CA 93943 USA Inst Stat Math 10-3 Midoricho Tachikawa Tokyo 1908562 Japan Inst Stat Math Res Ctr Stat Machine Learning 10-3 Midoricho Tachikawa Tokyo 1908562 Japan Grad Univ Adv Studies 10-3 Midoricho Tachikawa Tokyo 1908562 Japan
出 版 物:《IEEE-ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS》 (IEEE-ACM计算生物学与生物信息学汇刊)
年 卷 期:2020年第17卷第4期
页 面:1222-1230页
核心收录:
学科分类:0710[理学-生物学] 0808[工学-电气工程] 08[工学] 0714[理学-统计学(可授理学、经济学学位)] 0701[理学-数学] 0812[工学-计算机科学与技术(可授工学、理学学位)]
基 金:Japan Society for the Promotion of Science [JSPS KAKENHI 26280009] National Science Foundation [ND EPSCoR NSF 1355466] National Science Foundation [Division of Mathematical Sciences: CDSE-MSS program] Direct For Mathematical & Physical Scien Funding Source: National Science Foundation Division Of Mathematical Sciences Funding Source: National Science Foundation
主 题:Gene trees missing information mixed integer non-linear programming
摘 要:Advances in modern genomics have allowed researchers to apply phylogenetic analyses on a genome-wide scale. While large volumes of genomic data can be generated cheaply and quickly, data missingness is a non-trivial and somewhat expected problem. Since the available information is often incomplete for a given set of genetic loci and individual organisms, a large proportion of trees that depict the evolutionary history of a single genetic locus, called gene trees, fail to contain all individuals. Data incompleteness causes difficulties in data collection, information extraction, and gene tree inference. Furthermore, identifying outlying gene trees, which can represent horizontal gene transfers, gene duplications, or hybridizations, is difficult when data is missing from the gene trees. The typical approach is to remove all individuals with missing data from the gene trees, and focus the analysis on individuals whose information is fully available - a huge loss of information. In this work, we propose and design an optimization-based imputation approach to infer the missing distances between leaves in a set of gene trees via a mixed integer non-linear programming model. We also present a new research pipeline, imPhy, that can (i) simulate a set of gene trees with leaves randomly missing in each tree, (ii) impute the missing pairwise distances in each gene tree, (iii) reconstruct the gene trees using the Neighbor Joining (NJ) and Unweighted Pair Group Method with Arithmetic Mean (UPGMA) methods, and (iv) analyze and report the efficiency of the reconstruction. To impute the missing leaves, we employ our newly proposed non-linear programming framework, and demonstrate its capability in reconstructing gene trees with incomplete information in both simulated and empirical datasets. In the empirical datasets apicomplexa and lungfish, our imputation has very small normalized mean square errors, even in the extreme case where 50 percent of the individuals in each gene tree a