Ancestry estimation which provides family history information is one of the most popular services in direct-to-consumer genomic testing. It is also an important task which aimed to reduce the confounding by ancestry o...
详细信息
Ancestry estimation which provides family history information is one of the most popular services in direct-to-consumer genomic testing. It is also an important task which aimed to reduce the confounding by ancestry on the relationship of genotypes and disease risk in assocation studies. Several methods have been developed to generate the best ancestry estimated scores even though some of them are still facing inefficient computation time. In this paper, a combination method between KMeans clustering and PCA is proposed estimate ancestry estimation from SNP genotyping data. This method was compared with baseline model, called fastSTRUCTURE, in term of the quality of clustering and computation time. Public data from 1000 Genome project is used to train and evaluate the proposed model and the baseline model. The proposed model can successfully generate clusters with better accuracy than fastSTRUCTURE (91.02% over 90.39%). More importantly, it can boost the computation time until 100 times faster than fastSTRUCTURE (from 490 seconds to 4.86 seconds).
Jakarta lifted up lockdown after passing more than 50 days of large-scale social activity restriction and initiated phase opening to new normal. To analyse Jakarta's air quality after passing lockdown, a pipeline ...
Jakarta lifted up lockdown after passing more than 50 days of large-scale social activity restriction and initiated phase opening to new normal. To analyse Jakarta's air quality after passing lockdown, a pipeline of data engineering is needed. By acquiring time series data from ***, a time-series database system is developed with Python programming language and its fundamental libraries namely Pandas, NumPy, SQLite. After PM 2.5 data are pre-processed into average per-hour and grouped by applicable periods (pre-lockdown, lockdown, and phase opening), a pattern of PM 2.5 in South Jakarta is revealed by using data visualization library Matplotlib. The apex of PM 2.5 occurs earlier during lockdown (04:00) and phase opening (02:00) rather than when it was normal or pre-lockdown (08:00) even though the nadir of PM2.5 still occurs at the same time (16:00 – 17:00).
In general, performing a nonlinearity time series analysis in the modeling of data can reach a robust and increase the quality of the results. Wavelet methods have successfully been applied in a great variety of appli...
详细信息
Protein design algorithms that model continuous sidechain flexibility and conformational ensembles better approximate the in vitro and in vivo behavior of proteins. The previous state of the art, iMinDEE- A∗ - K∗, com...
详细信息
ISBN:
(数字)9783030170837
ISBN:
(纸本)9783030170820
Protein design algorithms that model continuous sidechain flexibility and conformational ensembles better approximate the in vitro and in vivo behavior of proteins. The previous state of the art, iMinDEE- A∗ - K∗, computes provable Ε -approximations to partition functions of protein states (e.g., bound vs. unbound) by computing provable, admissible pairwise-minimized energy lower bounds on protein conformations and using the A∗ enumeration algorithm to return a gap-free list of lowest-energy conformations. iMinDEE-A ∗ - K∗ runs in time sublinear in the number of conformations, but can be trapped in loosely-bounded, low-energy conformational wells containing many conformations with highly similar energies. That is, iMinDEE- A∗ - K∗ is unable to exploit the correlation between protein conformation and energy: similar conformations often have similar energy. We introduce two new concepts that exploit this correlation: Minimization-Aware Enumeration and Recursive K∗. We combine these two insights into a novel algorithm, Minimization-Aware Recursive K∗ (MARK∗ ), that tightens bounds not on single conformations, but instead on distinct regions of the conformation space. We compare the performance of iMinDEE- A∗ - K∗ vs. MARK∗ by running the BBK∗ algorithm, which provably returns sequences in order of decreasing K∗ score, using either iMinDEE- A∗ - K∗ or MARK∗ to approximate partition functions. We show on 200 design problems that MARK∗ not only enumerates and minimizes vastly fewer conformations than the previous state of the art, but also runs up to two orders of magnitude faster. Finally, we show that MARK∗ not only efficiently approximates the partition function, but also provably approximates the energy landscape. To our knowledge, MARK∗ is the first algorithm to do so. We use MARK∗ to analyze the change in energy landscape of the bound and unbound states of the HIV-1 capsid protein C-terminal domain in complex with camelid V H H, and measure the change in conformati
One of the common and important post-translational modification (PTM) types is phosphorylation. Protein phosphorylation is used to regulate various enzyme and receptor activations which include signal pathways. There ...
详细信息
One of the common and important post-translational modification (PTM) types is phosphorylation. Protein phosphorylation is used to regulate various enzyme and receptor activations which include signal pathways. There have been many significant studies conducted to predict phosphorylation sites using various machine learning methods. Recently, several researchers claimed deep learning based methods as the best methods for phosphorylation sited prediction. However, the performance of these methods were backed up with the massive training data used in the researches. In this paper, we study the performance of simple deep neural network on the limited data generally used prior to deep learning employment. The result shows that a deep neural network can still achieve comparable performance in the limited data settings.
The transmission of severe acute respiratory syndrome coronavirus 2 (SARS CoV-2) in Indonesia is seen to be uncontrollably increasing that urges the government to leverage the capacity for the disease detections. Real...
The transmission of severe acute respiratory syndrome coronavirus 2 (SARS CoV-2) in Indonesia is seen to be uncontrollably increasing that urges the government to leverage the capacity for the disease detections. Real-time polymerase chain reaction (RT-PCR), rapid test and computed tomography (CT) scan are the most common methods to determine if one has been infected regardless of whether or not the common symptoms of such Corona Virus Disease 2019 (COVID-19) surface. Among these three, RT-PCR is considered the gold standard for qualitative and quantitative assessment of SARS CoV-2 detection. The present paper aims at elaborating the framework of Roche's RT-PCR machine employed specifically for SARS CoV-2 detection performed by Genetics Indonesia which is deemed to be efficient and relatively quicker than other detection kits. RT-PCR machine detected SARS Cov-2 with RNA amplification curve equals to 10 copies RNA below the cut off value of Crossing point (Cp) positive control. Also elucidated in the paper is the implementations of EAV RNA and LightCycler® 96 RT-PCR System through which analysis time, amounts of individual required sample, as well as the reagents, can be accordingly reduced.
Translational research requires data at multiple scales of biological organization. Advancements in sequencing and multi-omics technologies have increased the availability of these data, but researchers face significa...
详细信息
Climate anomalies are considered as an important factor closely related to many disasters causing many human losses, such as airline crash, wildfires, drought and flooding in many areas. Many researchers have projecte...
详细信息
In clinical microbiology, matrix-assisted laser desorption ionization-time-of-flight mass spectrometry (MALDI-TOF MS) is frequently employed for rapid microbial identification. However, rapid identification of antimic...
详细信息
In clinical microbiology, matrix-assisted laser desorption ionization-time-of-flight mass spectrometry (MALDI-TOF MS) is frequently employed for rapid microbial identification. However, rapid identification of antimicrobial resistance (AMR) in Escherichia coli based on a large amount of MALDI-TOF MS data has not yet been reported. This may be because building a prediction model to cover all E. coli isolates would be challenging given the high diversity of the E. coli population. This study aimed to develop a MALDI-TOF MS-based, data-driven, two-stage framework for characterizing different AMRs in E. coli. Specifically, amoxicillin (AMC), ceftazidime (CAZ), ciprofloxacin (CIP), ceftriaxone (CRO), and cefuroxime (CXM) were used. In the first stage, we split the data into two groups based on informative peaks according to the importance of the random forest. In the second stage, prediction models were constructed using four different machine learning algorithms-logistic regression, support vector machine, random forest, and extreme gradient boosting (XGBoost). The findings demonstrate that XGBoost outperformed the other four machine learning models. The values of the area under the receiver operating characteristic curve were 0.62, 0.72, 0.87, 0.72, and 0.72 for AMC, CAZ, CIP, CRO, and CXM, respectively. This implies that a data-driven, two-stage framework could improve accuracy by approximately 2.8%. As a result, we developed AMR prediction models for E. coli using a data-driven two-stage framework, which is promising for assisting physicians in making decisions. Further, the analysis of informative peaks in future studies could potentially reveal new insights. Based on a large amount of matrix-assisted laser desorption ionization-time-of-flight mass spectrometry (MALDI-TOF MS) clinical data, comprising 37,918 Escherichia coli isolates, a data-driven two-stage framework was established to evaluate the antimicrobial resistance of E. coli. Five antibiotics, including a
暂无评论