Data set curation in cheminformatics is largely ignored, and many publications do not provide the specific chemical structures used in their experiments. Access to chemical structures is vital for experiment reproduci...
详细信息
ISBN:
(纸本)9781479912940
Data set curation in cheminformatics is largely ignored, and many publications do not provide the specific chemical structures used in their experiments. Access to chemical structures is vital for experiment reproducibility and comparison of competing methods. To address this limitation, the KU Chemical Biology Database (KUChemBio) has established a collection of 69 data sets for computational chemical biology experiments. Data sets fall into several categories including ADME, toxicity, binding affinity, solubility, melting points, and others. Chemical structures in SDF or Smiles format are provided along with binary or real valued activity labels. Data sets have been consolidated from other online repositories and content from recent publications has been added as well. KUChemBio is located at http://***/kuchembio.
Despite intense investment growth and technology development, there is an observed bottleneck in drug discovery and development over the past decade. NIH started the molecular Libraries Initiative (MLI) in 2004 to enl...
详细信息
Despite intense investment growth and technology development, there is an observed bottleneck in drug discovery and development over the past decade. NIH started the molecular Libraries Initiative (MLI) in 2004 to enlarge the pool for potential drug targets, especially from the “undruggable” part of human genome, and potential drug candidates from much broader types of drug-like small molecules. In this paper we used the concepts of network biology to integrate MLI data with other biological databases such as DrugBank and UniHI, and evaluated the potential of MLI target proteins being new drug targets. Our analysis provided some measures of the value of the MLI data as a resource for both basic chemical biology research and future therapeutic discovery.
Structured data including sets, sequences, trees and graphs, pose significant challenges to fundamental aspects of data management such as efficient storage, indexing, and similarity search. With the fast accumulation...
详细信息
ISBN:
(纸本)9781605584225
Structured data including sets, sequences, trees and graphs, pose significant challenges to fundamental aspects of data management such as efficient storage, indexing, and similarity search. With the fast accumulation of graph databases, similarity search in graph databases has emerged as an important research topic. Graph similarity search has applications in a wide range of domains including cheminformatics, bioinformatics, sensor network management, social network management, and XML documents, among others. Most of the current graph indexing methods focus on subgraph query processing, i.e. determining the set of database graphs that contains the query graph and hence do not directly support similarity search. In data mining and machine learning, various graph kernel functions have been designed to capture the intrinsic similarity of graphs. Though successful in constructing accurate predictive and classification models for supervised learning, graph kernel functions have (i) high computational complexity and (ii) non-trivial difficulty to be indexed in a graph database. Our objective is to bridge graph kernel function and similarity search in graph databases by proposing (i) a novel kernel-based similarity measurement and (ii) an efficient indexing structure for graph data management. Our method of similarity measurement builds upon local features extracted from each node and their neighboring nodes in graphs. A hash table is utilized to support efficient storage and fast search of the extracted local features. Using the hash table, a graph kernel function is defined to capture the intrinsic similarity of graphs and for fast similarity query processing. We have implemented our method, which we have named G-hash, and have demonstrated its utility on large chemical graph databases. Our results show that the G-hash method achieves state-of-the-art performance for k-nearest neighbor (k-NN) classification. Most importantly, the new similarity measurement and the inde
Classifying objects that are sampled jointly from two or more domains has many applications. The tensor product feature space is useful for modeling interactions between feature sets in different domains but feature s...
详细信息
Classifying objects that are sampled jointly from two or more domains has many applications. The tensor product feature space is useful for modeling interactions between feature sets in different domains but feature selection in the tensor product feature space is challenging. Conventional feature selection methods ignore the structure of the feature space and may not provide the optimal results. In this paper we propose methods for selecting features in the original feature spaces of different domains. We obtained sparsity through two approaches, one using integer quadratic programming and another using L1-norm regularization. Experimental studies on biological data sets validate our approach.
Similarity search in chemical structure databases is an important problem with many applications in chemical genomics, drug design, and efficient chemical probe screening among others. It is widely believed that struc...
详细信息
Similarity search in chemical structure databases is an important problem with many applications in chemical genomics, drug design, and efficient chemical probe screening among others. It is widely believed that structure based methods provide an efficient way to do the query. Recently various graph kernel functions have been designed to capture the intrinsic similarity of graphs. Though successful in constructing accurate predictive and classification models, graph kernel functions can not be applied to large chemical compound database due to the high computational complexity and the difficulties in indexing similarity search for large databases. To bridge graph kernel function and similarity search in chemical databases, we applied a novel kernel-based similarity measurement, developed in our team, to measure similarity of graph represented chemicals. In our method, we utilize a hash table to support new graph kernel function definition, efficient storage and fast search. We have applied our method, named G-hash, to large chemical databases. Our results show that the G-hash method achieves state-of-the-art performance for k-nearest neighbor (k-NN) classification. Moreover, the similarity measurement and the index structure is scalable to large chemical databases with smaller indexing size, and faster query processing time as compared to state-of-the-art indexing methods such as Daylight fingerprints, C-tree and GraphGrep.
In this paper we propose new methods of chemical structure classification based on the integration of graph database mining from data mining and graph kernel functions from machine learning. In our method, we first id...
详细信息
ISBN:
(纸本)9781848161085
In this paper we propose new methods of chemical structure classification based on the integration of graph database mining from data mining and graph kernel functions from machine learning. In our method, we first identify a set of general graph patterns in chemical structure data. These patterns are then used to augment a graph kernel function that calculates the pairwise similarity between molecules. The obtained similarity matrix is used as input to classify chemical compounds via a kernel machines such as the support vector machine (SVM). Our results indicate that the use of a pattern-based approach to graph similarity yields performance profiles comparable to, and sometimes exceeding that of the existing state-of-the-art approaches. In addition, the identification of highly discriminative patterns for activity classification provides evidence that our methods can make generalizations about a compound's function given its chemical structure. While we evaluated our methods on molecular structures, these methods are designed to operate on general graph data and hence could easily be applied to other domains in bioinformatics.
Graph data mining is an active research area. Graphs are general modeling tools to organize information from het-erogenous sources and have been applied in many scientific, engineering, and business fields. With the f...
详细信息
暂无评论