检索结果-内蒙古大学图书馆

SSRN 2025年

作者： Liu, Tianyu Zhang, Xiangyu Ying, Rex Zhao, Hongyu Interdepartmental Program in Computational Biology & Bioinformatics Yale University New HavenCT06511 United States Department of Biostatistics Yale University New HavenCT06511 United States Department of Computer Science Yale University New HavenCT06511 United States

Sequence-to-function models can predict gene expression from sequence data and be used to link genetic information with transcriptomics data to understand regulatory processes and their effects on complex phenotypes. The genomic language models are pre-trained with large-scale DNA sequences and can generate robust representations of these sequences by learning the genomic context. However, few studies can estimate the predictability of gene expression levels and bridge these two classes of models together to explore individualized gene expression prediction. In this manuscript, we propose UKBioBERT as a DNA language model pre-trained with genetic variants from UK BioBank. We demonstrate that UKBioBERT generates informative embeddings capable of identifying gene functions, and improving gene expression prediction in cell lines, thereby enhancing our understanding of gene expression predictability. Building upon these embeddings, we combine UKBioBERT with state-of-the-art sequence-to-function architectures, Enformer and Borzoi, to create UKBioFormer and UKBioZoi. These models exhibit better performance in predicting highly predictable gene expression levels and can be generalized across different cohorts. Furthermore, UKBioFormer effectively captures the relationship between genetic variants and expression variations, enabling in-silico mutation analyses. Collectively, our findings underscore the value of integrating genomic language models and sequence-to-function approaches for advancing functional genomics research. © 2025, The Authors. All rights reserved.

关键词： Genome

来源：评论

学校读者我要写书评

暂无评论

Building a unified model for drug synergy analysis powered by large language models

引用

Nature Communications 2025年第1期16卷 1-17页

作者： Liu, Tianyu Chu, Tinyi Luo, Xiao Zhao, Hongyu Interdepartmental Program in Computational Biology & Bioinformatics Yale University New Haven CT United States Department of Biostatistics Yale University New Haven CT United States Department of Computer Science University of California Los Angeles Los Angeles CA United States

Drug synergy prediction is a challenging and important task in the treatment of complex diseases including cancer. In this manuscript, we present a unified Model, known as BAITSAO, for tasks related to drug synergy prediction with a unified pipeline to handle different datasets. We construct the training datasets for BAITSAO based on the context-enriched embeddings from Large Language Models for the initial representation of drugs and cell lines. After demonstrating the relevance of these embeddings, we pre-train BAITSAO with a large-scale drug synergy database under a multi-task learning framework with rigorous selections of tasks. We demonstrate the superiority of the model architecture and the pre-trained strategies of BAITSAO over other methods through comprehensive benchmark analysis. Moreover, we investigate the sensitivity of BAITSAO and illustrate its promising functions including drug discoveries, drug combinations-gene interaction, and multi-drug synergy predictions. © The Author(s) 2025.

关键词：

来源：评论

学校读者我要写书评

暂无评论

Learning Multi-cellular Representations of Single-Cell Transcriptomics Data Enables Characterization of Patient-Level Disease States 29th

Learning Multi-cellular Representations of Single-Cell Tran...

引用

29th International Conference on Research in computational Molecular biology, RECOMB 2025

作者： Liu, Tianyu De Brouwer, Edward Kuo, Tony Diamant, Nathaniel Missarova, Alsu Wang, Hanchen Hao, Minsheng Bravo, Hector Corrada Scalia, Gabriele Regev, Aviv Heimberg, Graham Research and Early Development Genentech South San FranciscoCA94080 United States Interdepartmental Program in Computational Biology and Bioinformatics Yale University New HavenCT06511 United States Roche Informatics F. Hoffmann-La Roche Ltd. Mississauga Canada Department of Computer Science Stanford University Palo AltoCA94035 United States

ISBN: (纸本)9783031902512

Single-cell RNA-seq (scRNA-seq) has become a prominent tool for studying human biology and disease. The availability of massive scRNA-seq datasets and advanced machine learning techniques has recently driven the development of single-cell foundation models that provide informative and versatile cell representations based on expression profiles. However, to understand disease states, we need to consider entire tissue ecosystems, simultaneously considering many different interacting cells. Here, we tackle this challenge by generating patient-level representations derived from multi-cellular expression context measured with scRNA-seq of tissues. We develop PaSCient, a novel model that employs a multi-level representation learning paradigm and provides importance scores at the individual cell and gene levels for fine-grained analysis across multiple cell types and gene programs characteristic of a given disease. We apply PaSCient to learn a disease model across a large-scale scRNA-seq atlas of 12.5 million cells from over 2,700 patients. Comprehensive and rigorous benchmarking demonstrates the superiority of PaSCient in disease classification and its multiple downstream applications, including dimensionality reduction, gene/cell type prioritization, and patient subgroup discovery. © The Author(s), under exclusive license to Springer Nature Switzerland AG 2025.

关键词： Biomolecules

来源：评论

学校读者我要写书评

暂无评论

ChromActivity: integrative epigenomic and functional characterization assay based annotation of regulatory activity across diverse human cell types

引用

Genome biology 2025年第1期26卷 1-30页

作者： Dincer, Tevfik Umut Ernst, Jason Bioinformatics Interdepartmental Program University of California Los Angeles Los Angeles 90095 CA United States Department of Biological Chemistry University of California Los Angeles Los Angeles 90095 CA United States Eli and Edythe Broad Center of Regenerative Medicine and Stem Cell Research at University of California Los Angeles Los Angeles 90095 CA United States Computer Science Department University of California Los Angeles Los Angeles 90095 CA United States Jonsson Comprehensive Cancer Center University of California Los Angeles Los Angeles 90095 CA United States Molecular Biology Institute University of California Los Angeles Los Angeles 90095 CA United States Department of Computational Medicine University of California Los Angeles Los Angeles 90095 CA United States

We introduce ChromActivity, a computational framework for predicting and annotating regulatory activity across the genome through integration of multiple epigenomic maps and various functional characterization datasets. ChromActivity generates genomewide predictions of regulatory activity associated with each functional characterization dataset across many cell types based on available epigenomic data. It then for each cell type produces ChromScoreHMM genome annotations based on the combinatorial and spatial patterns within these predictions and ChromScore tracks of overall predicted regulatory activity. ChromActivity provides a resource for analyzing and interpreting the human regulatory genome across diverse cell types. © The Author(s) 2025.

关键词： CRISPR screens Epigenome Gene regulation Genome annotation Hidden Markov model Machine learning Massively parallel reporter assays

来源：评论

学校读者我要写书评

暂无评论

MetNetAPI: A flexible method to access and manipulate biological network data from MetNet

引用

BMC Research Notes 2010年第1期3卷 1-9页

作者： Sucaet, Yves Wurtele, Eve Syrkin Department of Genetics Development and Cell Biology Iowa State University Ames IA 50011 United States Interdepartmental Program in Bioinformatics and Computational Biology Iowa State University Ames IA 50011 United States

Background. Convenient programmatic access to different biological databases allows automated integration of scientific knowledge. Many databases support a function to download files or data snapshots, or a webservice that offers "live" data. However, the functionality that a database offers cannot be represented in a static data download file, and webservices may consume considerable computational resources from the host server. Results. MetNetAPI is a versatile Application programming Interface (API) to the MetNetDB database. It abstracts, captures and retains operations away from a biological network repository and website. A range of database functions, previously only available online, can be immediately (and independently from the website) applied to a dataset of interest. Data is available in four layers: molecular entities, localized entities (linked to a specific organelle), interactions, and pathways. Navigation between these layers is intuitive (e.g. one can request the molecular entities in a pathway, as well as request in what pathways a specific entity participates). Data retrieval can be customized: Network objects allow the construction of new and integration of existing pathways and interactions, which can be uploaded back to our server. In contrast to webservices, the computational demand on the host server is limited to processing data-related queries only. Conclusions. An API provides several advantages to a systems biology software platform. MetNetAPI illustrates an interface with a central repository of data that represents the complex interrelationships of a metabolic and regulatory network. As an alternative to data-dumps and webservices, it allows access to a current and "live" database and exposes analytical functions to application developers. Yet it only requires limited resources on the server-side (thin server/fat client setup). The API is available for Java, *** and R programming environments and offers flexible query and bro

关键词： Application program Interface Molecular Entity Biological Database Command Line Interface Network Object

来源：评论

学校读者我要写书评

暂无评论

Worm Perturb-Seq: massively parallel whole-animal RNAi and RNA-seq

引用

Nature Communications 2025年第1期16卷 1-21页

作者： Hefei Zhang Xuhang Li Shivani Nanda Albertha J. M. Walhout Dongyuan Song Jingyi Jessica Li Onur Yukselen Alper Kucukural Manuel Garber Department of Systems Biology University of Massachusetts Chan Medical School Worcester MA USA Bioinformatics Interdepartmental Ph.D. Program University of California Los Angeles CA USA Department of Statistics and Data Science Department of Biostatistics Department of Computational Medicine and Department of Human Genetics University of California Los Angeles CA USA Via Scientific Inc. Cambridge MA USA Department of Genomics and Computational Biology University of Massachusetts Chan Medical School Worcester MA USA

Transcriptomes provide highly informative molecular phenotypes that, combined with gene perturbation, can connect genotype to phenotype. An ultimate goal is to perturb every gene and measure transcriptome changes, however, this is challenging, especially in whole animals. Here, we present ‘Worm Perturb-Seq (WPS)’, a method that provides high-resolution RNA-sequencing profiles for hundreds of replicate perturbations at a time in living animals. WPS introduces multiple experimental advances combining strengths of Caenhorhabditis elegans genetics and multiplexed RNA-sequencing with a novel analytical framework, EmpirDE. EmpirDE leverages the unique power of large transcriptomic datasets and improves statistical rigor by using gene-specific empirical null distributions to identify DEGs. We apply WPS to 103 nuclear hormone receptors (NHRs) and find a striking ‘pairwise modularity’ in which pairs of NHRs regulate shared target genes. We envision the advances of WPS to be useful not only for C. elegans, but broadly for other models, including human cells.

关键词：

来源：评论

学校读者我要写书评

暂无评论

Doubly-stochastic normalization of the Gaussian kernel is robust to heteroskedastic noise

arXiv

引用

arXiv 2020年

作者： Landa, Boris Coifman, Ronald R. Kluger, Yuval Program in Applied Mathematics Yale University Interdepartmental Program in Computational Biology and Bioinformatics Yale University Department of Pathology Yale University School of Medicine

A fundamental step in many data-analysis techniques is the construction of an affinity matrix describing similarities between data points. When the data points reside in Euclidean space, a widespread approach is to from an affinity matrix by the Gaussian kernel with pairwise distances, and to follow with a certain normalization (e.g. the row-stochastic normalization or its symmetric variant). We demonstrate that the doubly-stochastic normalization of the Gaussian kernel with zero main diagonal (i.e. no self loops) is robust to heteroskedastic noise. That is, the doubly-stochastic normalization is advantageous in that it automatically accounts for observations with different noise variances. Specifically, we prove that in a suitable high-dimensional setting where heteroskedastic noise does not concentrate too much in any particular direction in space, the resulting (doubly-stochastic) noisy affinity matrix converges to its clean counterpart with rate m−1/2, where m is the ambient dimension. We demonstrate this result numerically, and show that in contrast, the popular row-stochastic and symmetric normalizations behave unfavourably under heteroskedastic noise. Furthermore, we provide a prototypical example of simulated single-cell RNA sequence data with strong intrinsic heteroskedasticity, where the advantage of the doubly-stochastic normalization for exploratory analysis is evident. Copyright © 2020, The Authors. All rights reserved.

关键词： Stochastic systems

来源：评论

学校读者我要写书评

暂无评论

An approach to comparing tiling array and high throughput sequencing technologies for genomic transcript mapping

引用

BMC Research Notes 2009年第1期2卷 1-7页

作者： Sasidharan, Rajkumar Agarwal, Ashish Rozowsky, Joel Gerstein, Mark Molecular Biophysics and Biochemistry Department Yale University New Haven CT 06520 United States Department of Plant Biology Carnegie Institution for Science Stanford CA 94305 United States Interdepartmental Program in Computational Biology and Bioinformatics Yale University New Haven CT 06520 United States Department of Computer Science Yale University New Haven CT 06520 United States

Background. There are two main technologies for transcriptome profiling, namely, tiling microarrays and high-throughput sequencing. Recently there has been a tremendous amount of excitement about the latter because of the advent of next-generation sequencing technologies and its promises. Consequently, the question of the moment is how these two technologies compare. Here we attempt to develop an approach to do a fair comparison of transcripts identified from tiling microarray and MPSS sequencing data. Findings. This comparison is a challenging task because the sequencing data is discrete while the tiling array data is continuous. We use the published rice and Arabidopsis datasets which provide currently best matched sets of arrays and sequencing experiments using a slightly earlier generation of sequencing, the MPSS tag sequencing technology. After scoring the arrays consistently in both the organisms, a first pass comparison reveals a surprisingly small overlap in transcripts of 22% and 66% respectively, in rice and Arabidopsis. However, when we do the analysis in detail, we find that this is an underestimate. In particular, when we map the probe intensities onto the sequencing tags and then look at their intensity distribution, we see that they are very similar to exons. Furthermore, restricting our comparison to only protein-coding gene loci revealed a very good overlap between the two technologies. Conclusion. Our approach to compare genome tiling microarray and MPSS sequencing data suggests that there is actually a reasonable overlap in transcripts identified by the two technologies. This overlap is distorted by the scoring and thresholding in the tiling array scoring procedure. © 2009 Sasidharan et al;licensee BioMed Central Ltd.

关键词： Tiling Array Transcript Mapping Tile Path Tiling Microarray Intergenic Transcript

来源：评论

学校读者我要写书评

暂无评论

Use of Bayesian networks to dissect the complexity of genetic disease: application to the Genetic Analysis Workshop 17 simulated data

引用

BMC proceedings 2011年第9期5 Suppl 9卷 S37页

作者： Jia Kang Wei Zheng Lun Li Joon Sang Lee Xiting Yan Hongyu Zhao Interdepartmental Program in Computational Biology and Bioinformatics Yale University PO Box 208009 New Haven CT 06520-8114 USA. hongyu.zhao@yale.edu.

Complex diseases are often the downstream event of a number of risk factors, including both environmental and genetic variables. To better understand the mechanism of disease onset, it is of great interest to systematically investigate the crosstalk among various risk factors. Bayesian networks provide an intuitive graphical interface that captures not only the association but also the conditional independence and dependence structures among the variables, resulting in sparser relationships between risk factors and the disease phenotype than traditional correlation-based methods. In this paper, we apply a Bayesian network to dissect the complex regulatory relationships among disease traits and various risk factors for the Genetic Analysis Workshop 17 simulated data. We use the Bayesian network as a tool for the risk prediction of disease outcome.

关键词： Bayesian Network Disease Phenotype Area Under Curve Conditional Independence Risk Prediction Model

来源：评论

学校读者我要写书评

暂无评论

Hyperbolic procrustes analysis using riemannian geometry 21

Hyperbolic procrustes analysis using riemannian geometry

引用

Proceedings of the 35th International Conference on Neural Information Processing Systems

作者： Ya-Wei Eileen Lin Yuval Kluger Ronen Talmon Viterbi Faculty of Electrical and Computer Engineering Technion Program in Applied Mathematics Yale University and Interdepartmental Program in Computational Biology and Bioinformatics Yale University and Department of Pathology Yale University

ISBN: (纸本)9781713845393

Label-free alignment between datasets collected at different times, locations, or by different instruments is a fundamental scientific task. Hyperbolic spaces have recently provided a fruitful foundation for the development of informative representations of hierarchical data. Here, we take a purely geometric approach for label-free alignment of hierarchical datasets and introduce hyperbolic Procrustes analysis (HPA). HPA consists of new implementations of the three prototypical Procrustes analysis components: translation, scaling, and rotation, based on the Riemannian geometry of the Lorentz model of hyperbolic space. We analyze the proposed components, highlighting their useful properties for alignment. The efficacy of HPA, its theoretical properties, stability and computational efficiency are demonstrated in simulations. In addition, we showcase its performance on three batch correction tasks involving gene expression and mass cytometry data. Specifically, we demonstrate high-quality unsupervised batch effect removal from data acquired at different sites and with different technologies that outperforms recent methods for label-free alignment in hyperbolic spaces.

关键词：

来源：评论

学校读者我要写书评

暂无评论

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案：

请选择收藏分类：

通借通还

建议与咨询 留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案： 新增检索档案 确定 取消

请选择收藏分类： 新增自定义分类 确定 取消

通借通还

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

请选择保存的检索档案：

请选择收藏分类：