In this digital era, we are exposed to a large amount of data. This includes biological data, which stores information about living organisms, including Deoxyribonucleic acid (DNA), genes, and proteins. With the devel...
详细信息
ISBN:
(纸本)9781665453967
In this digital era, we are exposed to a large amount of data. This includes biological data, which stores information about living organisms, including Deoxyribonucleic acid (DNA), genes, and proteins. With the development of information technology and information system, most of available biological data are stored in an online public database. Many of the databases are free-access and easily used, which helps the users, especially researchers, to make use of the data. Among the known public biological databases are the University of California Santa Cruz (UCSC) Genome Browser Database and the Rat Genome Database (RGD). These two databases provide access to the biological data from different organisms. This paper aims to describe the technology of public biological databases. Also elucidated in this paper are the differences features between UCSC Genome Browser Database and the RGD. Our results showed that the UCSC contains much more biological data and features than the RGD. However, the genome browser of UCSC has a complex display, while the RGD has a simple display. Overall, both databases give the users the option to choose the most suitable database for them.
Motivation: Bulk RNA-Seq is a widely used method for studying gene expression across a variety of contexts. The significance of RNA-Seq studies has grown with the advent of high-throughput sequencing technologies. Com...
It is challenging to guide neural network (NN) learning with prior knowledge. In contrast, many known properties, such as spatial smoothness or seasonality, are straightforward to model by choosing an appropriate kern...
详细信息
Bayesian multinomial logistic-normal (MLN) models are popular for the analysis of sequence count data (e.g., microbiome or gene expression data) due to their ability to model multivariate count data with complex covar...
详细信息
Bayesian multinomial logistic-normal (MLN) models are popular for the analysis of sequence count data (e.g., microbiome or gene expression data) due to their ability to model multivariate count data with complex covariance structure. However, existing implementations of MLN models are limited to small datasets due to the non-conjugacy of the multinomial and logistic-normal distributions. Motivated by the need to develop efficient inference for Bayesian MLN models, we develop two key ideas. First, we develop the class of Marginally Latent Matrix-T Process (Marginally LTP) models. We demonstrate that many popular MLN models, including those with latent linear, non-linear, and dynamic linear structure are special cases of this class. Second, we develop an efficient inference scheme for Marginally LTP models with specific accelerations for the MLN subclass. Through application to MLN models, we demonstrate that our inference scheme are both highly accurate and often 4-5 orders of magnitude faster than MCMC.
The rise of antibiotic resistance (AR) poses substantial threats to human and animal health, food security, and economic stability. Wastewater-based surveillance (WBS) has emerged as a powerful strategy for population...
详细信息
Flow matching in the continuous simplex has emerged as a promising strategy for DNA sequence design, but struggles to scale to higher simplex dimensions required for peptide and protein generation. We introduce Gumbel...
详细信息
With an increasing interest in the digitization effort of ancient manuscripts, ancient character recognition becomes one of the most important areas in the automated document image analysis. In this regard, we propose...
详细信息
With an increasing interest in the digitization effort of ancient manuscripts, ancient character recognition becomes one of the most important areas in the automated document image analysis. In this regard, we propose a Convolutional Neural Network (CNN)-based classifier to recognize the ancient Sundanese characters obtained from a digital collection of Southeast Asian palm leaf manuscripts. In this work, we utilize two different preprocessing techniques for the dataset. The first technique involves the use of geometric transformations, noise background addition, and brightness adjustment to augment the imbalanced samples to be fed into the classifier. The second technique makes use of the Otsu’s threshold method to binarize the characters and only uses the usual geometric transformations for the data augmentation. The proposed network with different data augmentation processes is trained on the training set and tested on the testing set. Image binarization from the second technique can outperform the performance of the CNN-based classifier upon the first technique by achieving a testing accuracy of 97.74%.
Adopting a deep learning model into bird sound classification tasks becomes a common practice in order to construct a robust automated bird sound detection system. In this paper, we employ a four-layer Convolutional N...
详细信息
Adopting a deep learning model into bird sound classification tasks becomes a common practice in order to construct a robust automated bird sound detection system. In this paper, we employ a four-layer Convolutional Neural Network (CNN) formulated to classify different species of Indonesia scops owls based on their vocal sounds. Two widely used representations of an acoustic signal: log-scaled mel-spectrogram and Mel Frequency Cepstral Coefficient (MFCC) are extracted from each sound file and fed into the network separately to compare the model performance with different inputs. A more complex CNN that can simultaneously process the two extracted acoustic representations is proposed to provide a direct comparison with the baseline model. The dual-input network is the well-performing model in our experiment that achieves 97.55% Mean Average Precision (MAP). Meanwhile, the baseline model achieves a MAP score of 94.36% for the mel-spectrogram input and 96.08% for the MFCC input.
The number of findings in cancer genomics research has grown rapidly in the last decade due to the decline in the cost of human sequencing and genotyping. However, the majority of the reported significant marker assoc...
详细信息
The number of findings in cancer genomics research has grown rapidly in the last decade due to the decline in the cost of human sequencing and genotyping. However, the majority of the reported significant marker associated with cancer traits are based on European and East Asian population. Large population such as South Asian and South-East Asian population are under-represented in genomics research. In this study, we explored the possibility of computing a Polygenic Risk Score (PRS) of colorectal cancer on our test sample based on reported significant Single Nucleotide Polymorphism (SNP). The SNPs used to compute the risk score were collected from GWAS Central and GWAS Catalog. Significant SNPs from IC3 study were used as a benchmark. The result shows that calculating colorectal cancer risk score using reported significant marker from different population group is possible. The p-value of our statistic model shows significant differences between case and control group risk score.
We introduce Cell2Sentence (C2S), a novel method to directly adapt large language models to a biological context, specifically single-cell transcriptomics. By transforming gene expression data into"cell sentences...
详细信息
We introduce Cell2Sentence (C2S), a novel method to directly adapt large language models to a biological context, specifically single-cell transcriptomics. By transforming gene expression data into"cell sentences," C2S bridges the gap between natural language processing and biology. We demonstrate cell sentences enable the finetuning of language models for diverse tasks in biology, including cell generation, complex cell-type annotation, and direct data-driven text generation. Our experiments reveal that GPT-2, when fine-tuned with C2S, can generate biologically valid cells based on cell type inputs, and accurately predict cell types from cell sentences. This illustrates that language models, through C2S finetuning, can acquire a significant understanding of single-cell biology while maintaining robust text generation capabilities. C2S offers a flexible, accessible framework to integrate natural language processing with transcriptomics, utilizing existing models and libraries for a wide range of biological applications. Copyright 2024 by the author(s)
暂无评论