BackgroundWith the rise of publicly available genomic data repositories, it is now common for scientists to rely on computational models and preprocessed data, either as control or to discover new knowledge. However, ...
详细信息
BackgroundWith the rise of publicly available genomic data repositories, it is now common for scientists to rely on computational models and preprocessed data, either as control or to discover new knowledge. However, different repositories adhere to the different principles and guidelines, and data processing plays a significant role in the quality of the resulting datasets. Two popular repositories for transcription factor binding sites data - encode and Cistrome - process the same biological samples in alternative ways, and their results are not always consistent. Moreover, the output format of the processing (BED narrowPeak) exposes a feature, the signalValue, which is seldom used in consistency checks, but can offer valuable insight on the quality of the *** provide evidence that data points with high signalValue(s) (top 25% of values) are more likely to be consistent between encode and Cistrome in human cell lines K562, GM12878, and HepG2. In addition, we show that filtering according to said high values improves the quality of predictions for a machine learning algorithm that detects transcription factor interactions based only on positional information. Finally, we provide a set of practices and guidelines, based on the signalValue feature, for scientists who wish to compare and merge narrowPeaks from encode and *** signalValue feature is an informative feature that can be effectively used to highlight consistent areas of overlap between different sources of TF binding sites that expose it. Its applicability extends to downstream to positional machine learning algorithms, making it a powerful tool for performance tweaking and data aggregation.
Maintaining patient record confidentiality in the age of telemedicine and e-health has become a difficult task. The digitization of medical data has increased the obligation of medical professionals to protect patient...
详细信息
Genomic experiments produce large sets of data, many of which are publicly available. Investigating these datasets using bioinformatics data mining techniques may reveal novel biological knowledge. We developed a bioi...
详细信息
ISBN:
(纸本)9783030638351;9783030638368
Genomic experiments produce large sets of data, many of which are publicly available. Investigating these datasets using bioinformatics data mining techniques may reveal novel biological knowledge. We developed a bioinformatics pipeline to investigate Chip-seq DNA binding proteins datasets for HepG2 liver cancer cell line downloaded from encode project. Of 276 datasets, 175 passed our proposed quantity control testing. A pair-wise DNA co-location analysis tool developed by us revealed a cluster of 19 proteins significantly collocating on DNA binding regions. The results were confirmed by tools from other labs. Narrowing down our bioinformatics analysis showed a strong enrichment of DNA-binding protein SIN3A to activator (H3K79me2) and repressor (H3K27me3) indicating SIN3A plays has an important regulatory role in vital liver functions. Whether increased enrichment varies in liver infection we compared histone modification between HepG2 and HepG2.2.15 cells (HepG2 derived hepatitis B virus (HBV) expressing stable cells) and observed an increase SIN3A enrichment in promoter regions (H3K4me3) confirming a known biological phenotype. The mechanistic role of SIN3A protein in case of liver injury or insult during liver infection warrants further dry and wet lab investigations.
Skin pigmentation in human is a complex trait, which varies widely, both within and between human populations. The exact players governing the trait of skin pigmentation remain elusive till date. Various Genome Wide A...
详细信息
Skin pigmentation in human is a complex trait, which varies widely, both within and between human populations. The exact players governing the trait of skin pigmentation remain elusive till date. Various Genome Wide Association Studies (GWAS) have shown the association of different genomic variants with normal human skin pigmentation, often indicating genes with no direct implications in melanin biosynthesis or distribution. Little has been explained in terms of the functionality of the associated Single-Nucleotide Polymorphisms (SNPs) with respect to modulating the skin pigmentation phenotype. In the present study, which, to our knowledge, is the first of its kind, we tried to analyze and prioritize 519 non-coding SNPs and 24 3UTR SNPs emerging from 14 different human skin pigmentation-related GWAS, primarily using several encode-based web-tools like rSNPBase, RegulomeDB, HaploReg, etc., most of which incorporate experimentally validated evidences in their predictions. Using this comprehensive, in-silico, analytical approach, we successfully prioritized all the pigmentation-associated GWAS-SNPs and tried to annotate pigmentation-related functionality to them, which would pave the way for deeper understanding of the molecular basis of human skin pigmentation variations.
Characterizing the tissue-specific binding sites of transcription factors (TFs) is essential to reconstruct gene regulatory networks and predict functions for non-coding genetic variation. DNase-seq footprinting enabl...
详细信息
Characterizing the tissue-specific binding sites of transcription factors (TFs) is essential to reconstruct gene regulatory networks and predict functions for non-coding genetic variation. DNase-seq footprinting enables the prediction of genome-wide binding sites for hundreds of TFs simultaneously. Despite the public availability of high-quality DNase-seq data from hundreds of samples, a comprehensive, up-to-date resource for the locations of genomic footprints is lacking. Here, we develop a scalable footprinting workflow using two state-of-the-art algorithms: Wellington and HINT. We apply our workflow to detect footprints in 192 encode DNase-seq experiments and predict the genomic occupancy of 1,515 human TFs in 27 human tissues. We validate that these footprints overlap true-positive TF binding sites from ChIP-seq. We demonstrate that the locations, depth, and tissue specificity of footprints predict effects of genetic variants on gene expression and capture a substantial proportion of genetic risk for complex traits.
This paper examines a specific kind of part-whole relations that exist in the molecular genetic domain. The central question is under which conditions a particular molecule, such as a DNA sequence, is a biological par...
详细信息
This paper examines a specific kind of part-whole relations that exist in the molecular genetic domain. The central question is under which conditions a particular molecule, such as a DNA sequence, is a biological part of the human genome. I address this question by analyzing how biologists in fact partition the human genome into parts. This paper thus presents a case study in the metaphysics of biological practice. I develop a metaphysical account of genomic parthood by analyzing the investigative and reasoning practices in the encode (ENCyclopedia Of DNA Elements) project. My account reveals two conditions that determine whether a molecule is a part of the human genome (i.e., a genomic part). First, genomic parts must possess a causal role function in the genome as a whole, that is, their functions must contribute to the genome directing the overall functioning of the cell. Second, genomic parts must have a specific chemical structure and be actual segments of the DNA sequence of the genome.
Large-scale genomic data have been utilized to generate unprecedented biological findings and new hypotheses. To delineate functional elements in the human genome, the Encyclopedia of DNA Elements (encode) project has...
详细信息
The Encyclopedia of DNA Elements (encode) web portal hosts genomic data generated by the encode Consortium, Genomics of Gene Regulation, The NIH Roadmap Epigenomics Consortium, and the modencode and modERN projects. T...
详细信息
The encode project represents a major leap from merely describing and comparing genomic sequences to surveying them for direct indicators of function. The astounding quantity of data produced by the encode consortium ...
详细信息
The encode project represents a major leap from merely describing and comparing genomic sequences to surveying them for direct indicators of function. The astounding quantity of data produced by the encode consortium can serve as a map to locate specific landmarks, guide hypothesis generation, and lead us to principles and mechanisms underlying genome biology. Despite its broad appeal, the size and complexity of the repository can be intimidating to prospective users. We present here some background about the encode data, survey the resources available for accessing them, and describe a few simple principles to help prospective users choose the data type(s) that best suit their needs, where to get them, and how to use them to their best advantage.
暂无评论