检索结果-内蒙古大学图书馆

Local alignment of generalized k-base encoded DNA sequence

BMC BIOINFORMATICS 2010年第1期11卷 1-10页

作者： Homer, Nils Nelson, Stanley F. Merriman, Barry Univ Calif Los Angeles Dept Comp Sci Los Angeles CA 90095 USA Univ Calif Los Angeles David Geffen Sch Med Dept Human Genet Los Angeles CA 90095 USA

Background: DNA sequence comparison is a well-studied problem, in which two DNA sequences are compared using a weighted edit distance. Recent DNA sequencing technologies however observe an encoded form of the sequence, rather than each DNA base individually. The encoded DNA sequence may contain technical errors, and therefore encoded sequencing errors must be incorporated when comparing an encoded DNA sequence to a reference DNA sequence. Results: Although two-base encoding is currently used in practice, many other encoding schemes are possible, whereby two ore more bases are encoded at a time. A generalized k-base encoding scheme is presented, whereby feasible higher order encodings are better able to differentiate errors in the encoded sequence from true DNA sequence variants. A generalized version of the previous two-base encoding DNA sequence comparison algorithm is used to compare a k-base encoded sequence to a DNA reference sequence. Finally, simulations are performed to evaluate the power, the false positive and false negative SNP discovery rates, and the performance time of k-base encoding compared to previous methods as well as to the standard DNA sequence comparison algorithm. Conclusions: The novel generalized k-base encoding scheme and resulting local alignment algorithm permits the development of higher fidelity ligation-based next generation sequencing technology. This bioinformatic solution affords greater robustness to errors, as well as lower false SNP discovery rates, only at the cost of computational time.

关键词： Single Nucleotide Polymorphism encode scheme Base Substitution Color Error Percent Error Rate

来源：评论

学校读者我要写书评

暂无评论

Grading amino acid properties increased accuracies of single point mutation on protein stability prediction

引用

BMC BIOINFORMATICS 2012年第1期13卷 1-11页

作者： Liu, Jianguo Kang, Xianjiang Hebei Univ Sch Life Sci Baoding 071002 Hebei Peoples R China

Background: Protein stabilities can be affected sometimes by point mutations introduced to the protein. Current sequence-information-based protein stability prediction encoding schemes of machine learning approaches include sparse encoding and amino acid property encoding. Property encoding schemes employ physical-chemical information of the mutated protein environments, however, they produce complexity in the mean time when many properties joined in the scheme. The complexity introduces noises that affect machine learning algorithm accuracies. In order to overcome the problem we described a new encoding scheme that graded twenty amino acids into groups according to their specific property values. Results: We employed three predefined values, 0.1, 0.5, and 0.9 to represent 'weak', 'middle', and 'strong' groups for each amino acid property, and introduced two thresholds for each property to split twenty amino acids into one of the three groups according to their property values. Each amino acid can take only one out of three predefined values rather than twenty different values for each property. The complexity and noises in the encoding schemes were reduced in this way. More than 7% average accuracy improvement was found in the graded amino acid property encoding schemes by 20-fold cross validation. The overall accuracy of our method is more than 72% when performed on the independent test sets starting from sequence information with three-state prediction definitions. Conclusions: Grading numeric values of amino acid property can reduce the noises and complexity of input information. It is in accordance with biochemical concepts for amino acid properties and makes the input data simplified in the mean time. The idea of graded property encoding schemes may be applied to protein related predictions with machine learning approaches.

关键词： encode scheme Machine Learning Approach Neutral Mutation Amino Acid Property Mutation Sample

来源：评论

学校读者我要写书评

暂无评论

Supervised multivariate analysis of sequence groups to identify specificity determining residues

引用

BMC BIOINFORMATICS 2007年第1期8卷 135-135页

作者： Wallace, Iain M. Higgins, Desmond G. Univ Coll Dublin Conway Inst Biomol & Biomed Res Dublin 4 Ireland

Background: Proteins that evolve from a common ancestor can change functionality over time, and it is important to be able identify residues that cause this change. In this paper we show how a supervised multivariate statistical method, Between Group Analysis (BGA), can be used to identify these residues from families of proteins with different substrate specifities using multiple sequence alignments. Results: We demonstrate the usefulness of this method on three different test cases. Two of these test cases, the Lactate/Malate dehydrogenase family and Nucleotidyl Cyclases, consist of two functional groups. The other family, Serine Proteases consists of three groups. BGA was used to analyse and visualise these three families using two different encoding schemes for the amino acids. Conclusion: This overall combination of methods in this paper is powerful and flexible while being computationally very fast and simple. BGA is especially useful because it can be used to analyse any number of functional classes. In the examples we used in this paper, we have only used 2 or 3 classes for demonstration purposes but any number can be used and visualised.

关键词： Correspondence Analysis encode scheme Chymotrypsin Sequence Weight Ordination Technique

来源：评论

学校读者我要写书评

暂无评论

A comparison study of succinct data structures for use in GWAS

引用

BMC BIOINFORMATICS 2013年第1期14卷 1-7页

作者： Putnam, Patrick P. Zhang, Ge Wilsey, Philip A. Sch Elect & Comp Syst Expt Comp Lab Cincinnati OH 45221 USA Cincinnati Childrens Hosp Med Ctr Cincinnati OH 45229 USA

Background: In recent years genetic data analysis has seen a rapid increase in the scale of data to be analyzed. Schadt et al (NRG 11:647-657, 2010) offered that with data sets approaching the petabyte scale, data related challenges such as formatting, management, and transfer are increasingly important topics which need to be addressed. The use of succinct data structures is one method of reducing physical size of a data set without the use of expensive compression techniques. In this work, we consider the use of 2- and 3-bit encoding schemes for genotype data. We compare the computational performance of allele or genotype counting algorithms utilizing genotype data encoded in both schemes. Results: We perform a comparison of 2- and 3-bit genotype encoding schemes for use in genotype counting algorithms. We find that there is a 20% overhead when building simple frequency tables from 2-bit encoded genotypes. However, building pairwise count tables for genome-wide epistasis is 1.0% more efficient. Conclusions: In this work, we were concerned with comparing the performance benefits and disadvantages of using more densely packed genotype data representations in Genome Wide Associations Studies (GWAS). We implemented a 2-bit encoding for genotype data, and compared it against a more commonly used 3-bit encoding scheme. We also developed a C++ library, libgwaspp, which offers these data structures, and implementations of several common GWAS algorithms. In general, the 2-bit encoding consumes less memory, and is slightly more efficient in some algorithms than the 3-bit encoding.

关键词： Graphic Processing Unit Contingency Table encode scheme Frequency Table Brute Force Algorithm

来源：评论

学校读者我要写书评

暂无评论

Light-weight reference-based compression of FASTQ data

引用

BMC BIOINFORMATICS 2015年第1期16卷 1-8页

作者： Zhang, Yongpeng Li, Linsen Yang, Yanli Yang, Xiao He, Shan Zhu, Zexuan Shenzhen Univ Coll Comp Sci & Software Engn Shenzhen 518060 Peoples R China Broad Inst Cambridge MA 02142 USA Univ Birmingham Sch Comp Sci Birmingham B15 2TT W Midlands England

Background: The exponential growth of next generation sequencing (NGS) data has posed big challenges to data storage, management and archive. Data compression is one of the effective solutions, where reference-based compression strategies can typically achieve superior compression ratios compared to the ones not relying on any reference. Results: This paper presents a lossless light-weight reference-based compression algorithm namely LW-FQZip to compress FASTQ data. The three components of any given input, i.e., metadata, short reads and quality score strings, are first parsed into three data streams in which the redundancy information are identified and eliminated independently. Particularly, well-designed incremental and run-length-limited encoding schemes are utilized to compress the metadata and quality score streams, respectively. To handle the short reads, LW-FQZip uses a novel light-weight mapping model to fast map them against external reference sequence(s) and produce concise alignment results for storage. The three processed data streams are then packed together with some general purpose compression algorithms like LZMA. LW-FQZip was evaluated on eight real-world NGS data sets and achieved compression ratios in the range of 0.111-0.201. This is comparable or superior to other state-of-the-art lossless NGS data compression algorithms. Conclusions: LW-FQZip is a program that enables efficient lossless FASTQ data compression. It contributes to the state of art applications for NGS data storage and transmission. LW-FQZip is freely available online at: http://***/staff/zhuzx/LWFQZip.

关键词： Quality Score Compression Ratio encode scheme Next Generation Sequencing Data FASTQ Format

来源：评论

学校读者我要写书评

暂无评论

Prediction of mucin-type O-glycosylation sites in mammalian proteins using the composition of k-spaced amino acid pairs

引用

BMC BIOINFORMATICS 2008年第1期9卷 101-101页

作者： Chen, Yong-Zi Tang, Yu-Rong Sheng, Zhi-Ya Zhang, Ziding China Agr Univ Coll Biol Sci Bioinformat Ctr Beijing 100094 Peoples R China Natl Inst Biol Sci Beijing 102206 Peoples R China

Background: As one of the most common protein post-translational modifications, glycosylation is involved in a variety of important biological processes. Computational identification of glycosylation sites in protein sequences becomes increasingly important in the post-genomic era. A new encoding scheme was employed to improve the prediction of mucin-type O-glycosylation sites in mammalian proteins. Results: A new protein bioinformatics tool, CKSAAP_OGlySite, was developed to predict mucin-type O- glycosylation serine/threonine (S/T) sites in mammalian proteins. Using the composition of k-spaced amino acid pairs (CKSAAP) based encoding scheme, the proposed method was trained and tested in a new and stringent O- glycosylation dataset with the assistance of Support Vector Machine (SVM). When the ratio of O- glycosylation to non-glycosylation sites in training datasets was set as 1: 1, 10-fold cross-validation tests showed that the proposed method yielded a high accuracy of 83.1% and 81.4% in predicting O- glycosylated S and T sites, respectively. Based on the same datasets, CKSAAP_OGlySite resulted in a higher accuracy than the conventional binary encoding based method (about +5.0%). When trained and tested in 1: 5 datasets, the CKSAAP encoding showed a more significant improvement than the binary encoding. We also merged the training datasets of S and T sites and integrated the prediction of S and T sites into one single predictor (i. e. S+T predictor). Either in 1: 1 or 1: 5 datasets, the performance of this S+T predictor was always slightly better than those predictors where S and T sites were independently predicted, suggesting that the molecular recognition of O- glycosylated S/T sites seems to be similar and the increase of the S+T predictor's accuracy may be a result of expanded training datasets. Moreover, CKSAAP_OGlySite was also shown to have better performance when benchmarked against two existing predictors. Conclusion: Because of CKSAAP encoding's ability

关键词： Support Vector Machine encode scheme Support Vector Machine Model Matthew Correlation Coefficient Support Vector Machine Algorithm

来源：评论

学校读者我要写书评

暂无评论

Stimulus encoding and correlates with behavior in area MT of visual cortex is dependent on spike phase

引用

BMC Neuroscience 2007年第2期8卷 1-1页

作者： Nicolas Y Masse Erik P Cook Department of Physiology McGill University Montreal Quebec Canada

来源：评论

学校读者我要写书评

暂无评论

Optimal information encoding for multiple, simultaneously presented stimuli

引用

BMC Neuroscience 2012年第1期13卷 1-1页

作者： Jan Pieczkowski Jeanette Hellgren Kotaleski Lawrence York Mark van Rossum Department of Computational Biology CSC Royal Institute of Technology Stockholm Sweden Department of Informatics Edinburgh University Edinburgh EH8 9AB UK Department of Neuroscience Karolinska Institutet Stockholm Sweden

来源：评论

学校读者我要写书评

暂无评论

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

时间限定

文献类型

馆藏选择

核心期刊

语言

文献类型

帮助

文字说明：

检索规则说明：

检索范例：

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案：

请选择收藏分类：

通借通还

建议与咨询 留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

时间限定

文献类型

馆藏选择

核心期刊

语言

文献类型

帮助

文字说明：

检索规则说明：

检索范例：

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案： 新增检索档案 确定 取消

请选择收藏分类： 新增自定义分类 确定 取消

通借通还

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

请选择保存的检索档案：

请选择收藏分类：