Traditional generic search engines using textual keyword matching do not support domain specific searches. However, different domains may have different domain specific searches. For example, Chemical research is mole...
详细信息
Traditional generic search engines using textual keyword matching do not support domain specific searches. However, different domains may have different domain specific searches. For example, Chemical research is molecule centric rather than document centric. Usually a chemical molecule can be represented in multiple ways, e.g., textual chemical entities such as chemical names and formulae, and chemical structures such as 2D graphs and 3D graphs. Thus, in chemoinformatics, chemical entity searches and chemical structure searches are more important than simple document searches using keyword matching. In this work, we show how to build a domain specific search engine that enables both entity and 2D graph searches for chemical molecules. First of all, documents are collected from the Web, and then preprocessed using document classification and segmentation. We apply Support Vector Machines for classification and propose a novel method of text segmentation. Then chemical entities in the documents are tagged and indexed to provide fast searches. Simultaneously, chemical structure information are collected, processed, and indexed for fast graph *** issues exist to support textual chemical entity searches. Chemical names and formulae usually appear in chemical documents when corresponding molecules are mentioned, but a chemical molecule can have different textual representative ways. A simple keyword search would retrieve only the exact match and not the others. Additionally, ambiguous non-chemical terms such as "He" are retrieved. We show how chemical entity searches can improve the relevance of returned documents by avoiding those ambiguous terms. Our search engine first extracts chemical entities from text, performs novel indexing suitable for chemical names and formulae, and supports different query models that a scientist may require. We propose a model of hierarchical conditional random fields for entity tagging that considers long-term dependencies at the
Current pandemics propelled research efforts in unprecedented fashion, primarily triggering computational efforts towards new vaccine and drug development as well as drug repurposing. There is an urgent need to design...
详细信息
Current pandemics propelled research efforts in unprecedented fashion, primarily triggering computational efforts towards new vaccine and drug development as well as drug repurposing. There is an urgent need to design novel drugs with targeted biological activity and minimum adverse reactions that may be useful to manage viral outbreaks. Hence an attempt has been made to develop Machine Learning based predictive models that can be used to assess whether a compound has the potency to be antiviral or not. To this end, a set of 2358 antiviral compounds were compiled from the CAS COVID-19 antiviral SAR dataset whose activity was reported based on IC50 value. A total 1157 two-dimensional molecular descriptors were computed among which, the most highly correlated descriptors were selected using Tree-based, Correlation-based and Mutual information-based feature selection methods. Seven Machine Learning algorithms i. e., Random Forest, XGBoost, Support Vector Machine, KNN, Decision Tree, MLP Classifier and Logistic Regression were benchmarked. The best performance was achieved by the models developed using Random Forest and XGBoost algorithms in all the feature selection methods. The maximum predictive accuracy of both these models was 88 % with internal validation. Whereas, with an external dataset, a maximum accuracy of 93.10 % for XGBoost and 100 % for Random Forest based model was achievable. Furthermore, the study demonstrated scaffold analysis of the molecules as a pragmatic approach to explore the importance of structurally diverse compounds in data driven studies.
chemoinformatics is a well established research field concerned with the discovery of molecule's properties through informational techniques. Computer science's research fields mainly concerned by the chemoinf...
详细信息
ISBN:
(纸本)9783642208447
chemoinformatics is a well established research field concerned with the discovery of molecule's properties through informational techniques. Computer science's research fields mainly concerned by the chemoinformatics field are machine learning and graph theory. From this point of view, graph kernels provide a nice framework combining machine learning techniques with graph theory. Such kernels prove their efficiency on several chemoinformatics problems. This paper presents two new graph kernels applied to regression and classification problems within the chemoinformatics field. The first kernel is based on the notion of edit distance while the second is based on sub trees enumeration. Several experiments show the complementary of both approaches.
BackgroundOCaml is a functional programming language with strong static types, Hindley-Milner type inference and garbage collection. In this article, we share our experience in prototyping chemoinformatics and structu...
详细信息
BackgroundOCaml is a functional programming language with strong static types, Hindley-Milner type inference and garbage collection. In this article, we share our experience in prototyping chemoinformatics and structural bioinformatics software in ***, we introduce the language, list entry points for chemoinformaticians who would be interested in OCaml and give code examples. Then, we list some scientific open source software written in OCaml. We also present recent open source libraries useful in chemoinformatics. The parallelization of OCaml programs and their performance is also shown. Finally, tools and methods useful when prototyping scientific software in OCaml are *** our experience, OCaml is a programming language of choice for method development in chemoinformatics and structural bioinformatics.
chemoinformatics is an interface science aimed primarily at discovering novel chemical entities that will ultimately result in the development of novel treatments for unmet medical needs, although these same methods a...
详细信息
chemoinformatics is an interface science aimed primarily at discovering novel chemical entities that will ultimately result in the development of novel treatments for unmet medical needs, although these same methods are also applied in other fields that ultimately design new molecules. The field combines expertise from, among others, chemistry, biology, physics, biochemistry, statistics, mathematics, and computer science. In this general review of chemoinformatics the emphasis is placed on describing the general methods that are routinely applied in molecular discovery and in a context that provides for an easily accessible article for computer scientists as well as scientists from other numerate disciplines.
Virtual compound libraries are increasingly being used in computer-assisted drug discovery applications and have led to numerous successful cases. This paper aims to examine the fundamental concepts of library design ...
详细信息
Virtual compound libraries are increasingly being used in computer-assisted drug discovery applications and have led to numerous successful cases. This paper aims to examine the fundamental concepts of library design and describe how to enumerate virtual libraries using open source tools. To exemplify the enumeration of chemical libraries, we emphasize the use of pre-validated or reported reactions and accessible chemical reagents. This tutorial shows a step-by-step procedure for anyone interested in designing and building chemical libraries with or without chemoinformatics experience. The aim is to explore various methodologies proposed by synthetic organic chemists and explore affordable chemical space using open-access chemoinformatics tools. As part of the tutorial, we discuss three examples of design: a Diversity-Oriented-Synthesis library based on lactams, a bis-heterocyclic combinatorial library, and a set of target-oriented molecules: isoindolinone based compounds as potential acetylcholinesterase inhibitors. This manuscript also seeks to contribute to the critical task of teaching and learning chemoinformatics.
Charge-Related Topological Index - Various chemoinformatics Applications. by Kochev, Nikolay T; Bangov, Ivan; Petrov, Emil; Moskovkina, Marina; Stoyanov, Borislav; published by
Charge-Related Topological Index - Various chemoinformatics Applications. by Kochev, Nikolay T; Bangov, Ivan; Petrov, Emil; Moskovkina, Marina; Stoyanov, Borislav; published by
Background Phytochemicals or secondary metabolites are low molecular weight organic compounds with little function in plant growth and development. Nevertheless, the metabolite diversity govern not only the phenetics ...
详细信息
Background Phytochemicals or secondary metabolites are low molecular weight organic compounds with little function in plant growth and development. Nevertheless, the metabolite diversity govern not only the phenetics of an organism but may also inform the evolutionary pattern and adaptation of green plants to the changing environment. Plant chemoinformatics analyzes the chemical system of natural products using computational tools and robust mathematical algorithms. It has been a powerful approach for species-level differentiation and is widely employed for species classifications and reinforcement of previous classifications. Results This study attempts to classify Angiosperms using plant sulfur-containing compound (SCC) or sulphated compound information. The SCC dataset of 692 plant species were collected from the comprehensive species-metabolite relationship family (KNApSAck) database. The structural similarity score of metabolite pairs under all possible combinations (plant species-metabolite) were determined and metabolite pairs with a Tanimoto coefficient value > 0.85 were selected for clustering using machine learning algorithm. Metabolite clustering showed association between the similar structural metabolite clusters and metabolite content among the plant species. Phylogenetic tree construction of Angiosperms displayed three major clades, of which, clade 1 and clade 2 represented the eudicots only, and clade 3, a mixture of both eudicots and monocots. The SCC-based construction of Angiosperm phylogeny is a subset of the existing monocot-dicot classification. The majority of eudicots present in clade 1 and 2 were represented by glucosinolate compounds. These clades with SCC may have been a mixture of ancestral species whilst the combinatorial presence of monocot-dicot in clade 3 suggests sulphated-chemical structure diversification in the event of adaptation during evolutionary change. Conclusions Sulphated chemoinformatics informs classification of Angiospe
This chapter provides a brief overview of chemoinformatics and its applications to chemical library design. It is meant to be a quick starter and to serve as an invitation to readers for more in-depth exploration of t...
详细信息
This chapter provides a brief overview of chemoinformatics and its applications to chemical library design. It is meant to be a quick starter and to serve as an invitation to readers for more in-depth exploration of the field. The topics covered in this chapter are chemical representation, chemical data and data mining, molecular descriptors, chemical space and dimension reduction, quantitative structure–activity relationship, similarity, diversity, and multiobjective optimization. less
暂无评论