Background: Mutagenicity is the capability of a substance to cause genetic mutations. This property is of high public concern because it has a close relationship with carcinogenicity and potentially with reproductive ...
详细信息
Background: Mutagenicity is the capability of a substance to cause genetic mutations. This property is of high public concern because it has a close relationship with carcinogenicity and potentially with reproductive toxicity. Experimentally, mutagenicity can be assessed by the Ames test on Salmonella with an estimated experimental reproducibility of 85%;this intrinsic limitation of the in vitro test, along with the need for faster and cheaper alternatives, opens the road to other types of assessment methods, such as in silico structure-activity prediction models. A widely used method checks for the presence of known structural alerts for mutagenicity. However the presence of such alerts alone is not a definitive method to prove the mutagenicity of a compound towards Salmonella, since other parts of the molecule can influence and potentially change the classification. Hence statistically based methods will be proposed, with the final objective to obtain a cascade of modeling steps with custom-made properties, such as the reduction of false negatives. Results: A cascade model has been developed and validated on a large public set of molecular structures and their associated Salmonella mutagenicity outcome. The first step consists in the derivation of a statistical model and mutagenicity prediction, followed by further checks for specific structural alerts in the "safe" subset of the prediction outcome space. In terms of accuracy (i.e., overall correct predictions of both negative and positives), the obtained model approached the 85% reproducibility of the experimental mutagenicity Ames test. Conclusions: The model and the documentation for regulatory purposes are freely available on the CAESAR website. The input is simply a file of molecular structures and the output is the classification result.
The major technical objectives of the RC-NSPES are to provide a framework for the concurrent operation of reactive and pro-active security functions to deliver efficient and optimised intrusion detection schemes as we...
详细信息
ISBN:
(纸本)9781424436880
The major technical objectives of the RC-NSPES are to provide a framework for the concurrent operation of reactive and pro-active security functions to deliver efficient and optimised intrusion detection schemes as well as enhanced and highly correlated rule sets for more effective alerts management and root-cause analysis. The design and implementation of the RC-NSPES solution includes a number of innovative features in terms of real-time programmable embedded hardware (FPGA) deployment as well as in the integrated management station. These have been devised so as to deliver enhanced detection of attacks and contextualised alerts against threats that can arise from both the network layer and the application layer protocols. The resulting architecture represents an efficient and effective framework for the future deployment of network security systems.
Background: Population-based investigations aimed at uncovering genotype-trait associations often involve high-dimensional genetic polymorphism data as well as information on multiple environmental and clinical parame...
详细信息
Background: Population-based investigations aimed at uncovering genotype-trait associations often involve high-dimensional genetic polymorphism data as well as information on multiple environmental and clinical parameters. machinelearning (ML) algorithms offer a straightforward analytic approach for selecting subsets of these inputs that are most predictive of a pre-defined trait. The performance of these algorithms, however, in the presence of covariates is not well characterized. Methods and Results: In this manuscript, we investigate two approaches: Random Forests (RFs) and Multivariate Adaptive Regression Splines (MARS). Through multiple simulation studies, the performance under several underlying models is evaluated. An application to a cohort of HIV-1 infected individuals receiving anti-retroviral therapies is also provided. Conclusion: Consistent with more traditional regression modeling theory, our findings highlight the importance of considering the nature of underlying gene-covariate-trait relationships before applying ML algorithms, particularly when there is potential confounding or effect mediation.
To more accurately extract the characteristics of application flows, this paper proposes a set of flow attributes to characterize the possible negotiation behaviors of each flow in application layer perspective. The d...
详细信息
ISBN:
(纸本)9781424420742
To more accurately extract the characteristics of application flows, this paper proposes a set of flow attributes to characterize the possible negotiation behaviors of each flow in application layer perspective. The discriminators are available in the early stage, so they are suitable to support real-time based traffic classification and engineering. The ability of flow attributes was tested with several machine learning algorithms. On the other hand, we also compare the accuracy of our method with other related works that addressed real-time traffic classification problem based on the same sample traffic. The result shows that our method outperforms other previous works in protocol level identification with more than 8%similar to 21% accuracy improvement based on fixed-ratio sample flow sets. Furthermore, the proposed method is also suitable to identify encrypted protocols.
Background: Targeting older clients for rehabilitation is a clinical challenge and a research priority. We investigate the potential of machine learning algorithms - Support Vector machine (SVM) and K-Nearest Neighbor...
详细信息
Background: Targeting older clients for rehabilitation is a clinical challenge and a research priority. We investigate the potential of machine learning algorithms - Support Vector machine (SVM) and K-Nearest Neighbors (KNN) - to guide rehabilitation planning for home care clients. Methods: This study is a secondary analysis of data on 24,724 longer-term clients from eight home care programs in Ontario. Data were collected with the RAI-HC assessment system, in which the Activities of Daily Living Clinical Assessment Protocol (ADLCAP) is used to identify clients with rehabilitation potential. For study purposes, a client is defined as having rehabilitation potential if there was: i) improvement in ADL functioning, or ii) discharge home. SVM and KNN results are compared with those obtained using the ADLCAP. For comparison, the machine learning algorithms use the same functional and health status indicators as the ADLCAP. Results: The KNN and SVM algorithms achieved similar substantially improved performance over the ADLCAP, although false positive and false negative rates were still fairly high (FP > .18, FN > .34 versus FP > .29, FN. > .58 for ADLCAP). Results are used to suggest potential revisions to the ADLCAP. Conclusion: machine learning algorithms achieved superior predictions than the current protocol. machinelearning results are less readily interpretable, but can also be used to guide development of improved clinical protocols.
Background: Altering a protein's function by changing its sequence allows natural proteins to be converted into useful molecular tools. Current protein engineering methods are limited by a lack of high throughput ...
详细信息
Background: Altering a protein's function by changing its sequence allows natural proteins to be converted into useful molecular tools. Current protein engineering methods are limited by a lack of high throughput physical or computational tests that can accurately predict protein activity under conditions relevant to its final application. Here we describe a new synthetic biology approach to protein engineering that avoids these limitations by combining high throughput gene synthesis with machinelearning-based design algorithms. Results: We selected 24 amino acid substitutions to make in proteinase K from alignments of homologous sequences. We then designed and synthesized 59 specific proteinase K variants containing different combinations of the selected substitutions. The 59 variants were tested for their ability to hydrolyze a tetrapeptide substrate after the enzyme was first heated to 68 C for 5 minutes. Sequence and activity data was analyzed using machine learning algorithms. This analysis was used to design a new set of variants predicted to have increased activity over the training set, that were then synthesized and tested. By performing two cycles of machinelearning analysis and variant design we obtained 20-fold improved proteinase K variants while only testing a total of 95 variant enzymes. Conclusion: The number of protein variants that must be tested to obtain significant functional improvements determines the type of tests that can be performed. Protein engineers wishing to modify the property of a protein to shrink tumours or catalyze chemical reactions under industrial conditions have until now been forced to accept high throughput surrogate screens to measure protein properties that they hope will correlate with the functionalities that they intend to modify. By reducing the number of variants that must be tested to fewer than 100, machine learning algorithms make it possible to use more complex and expensive tests so that only protein properties
The agent technology has recently become one of the most vibrant and fastest growing areas in information technology. This technology is developed to use more than one agent in the system;this is called Multi Agent sy...
详细信息
ISBN:
(纸本)9780955301827
The agent technology has recently become one of the most vibrant and fastest growing areas in information technology. This technology is developed to use more than one agent in the system;this is called Multi Agent systems (MAS). The system that depends on this technology should have been studied extensively. One of the most important characteristic of this is its ability to learn and adapt itself, where it has been done using one of the machine learning algorithms. Repertory Grid (RG) which has become a widely used and accepted technique for knowledge elicitation and has been implemented as a major component for many computer-based knowledge acquisition systems. In this paper, RG has been developed to be one of the machine learning algorithms and then used in MAS.
Objective: Inpatient length of stay (LOS) is an important measure of hospital activity, health care resource consumption, and patient acuity. This research work aims at developing an incremental expectation maximizati...
详细信息
Objective: Inpatient length of stay (LOS) is an important measure of hospital activity, health care resource consumption, and patient acuity. This research work aims at developing an incremental expectation maximization (EM) based learning approach on mixture of experts (ME) system for on-line prediction of LOS. The use of a batchmode learning process in most existing artificial neural networks to predict LOS is unrealistic, as the data become available over time and their pattern change dynamically. In contrast, an on-line process is capable of providing an output whenever a new datum becomes available. This on-the-spot information is therefore more useful and practical for making decisions, especially when one deals with a tremendous amount of data. Methods and material: The proposed approach is illustrated using a real example of gastroenteritis LOS data. The data set was extracted from a retrospective cohort study on all infants born in 1995-1997 and their subsequent admissions for gastroenteritis. The total number of admissions in this data set was n = 692. Linked hospitalization records of the cohort were retrieved retrospectively to derive the outcome measure, patient demographics, and associated co-morbidities information. A comparative study of the incremental learning and the batch-mode learningalgorithms is considered. The performances of the learningalgorithms are compared based on the mean absolute difference (MAD) between the predictions and the actual LOS, and the proportion of predictions with MAD < 1 day (Prop(MAD < 1)). The significance of the comparison is assessed through a regression analysis. Results: The incremental learningalgorithm provides better on-line prediction of LOS when the system has gained sufficient training from more examples (MAD = 1.77 days and Prop(MAD < 1) = 54.3%), compared to that using the batch-mode learning. The regression analysis indicates a significant decrease of MAD (p-value = 0.063) and a significant (p-value =
Clustered linear regression (CLR) is a new machine learning algorithm that improves the accuracy of classical linear regression by partitioning training space into subspaces. CLR makes some assumptions about the domai...
详细信息
Clustered linear regression (CLR) is a new machine learning algorithm that improves the accuracy of classical linear regression by partitioning training space into subspaces. CLR makes some assumptions about the domain and the data set. Firstly, target value is assumed to be a function of feature values. Second assumption is that there axe some linear approximations for this function in each subspace. Finally, there are enough training instances to determine subspaces and their linear approximations successfully. Tests indicate that if these approximations hold, CLR outperforms all other well-known machine-learningalgorithms. Partitioning may continue until linear approximation fits all the instances in the training set-that generally occurs when the number of instances in the subspace is less than or equal to the number of features plus one. In other case, each new subspace will have a better fitting linear approximation. However, this will cause over fitting and gives less accurate results for the test instances. The stopping situation can be determined as no significant decrease or an increase in relative error. CLR uses a small portion of the training instances to determine the number of subspaces. The necessity of high number of training instances makes this algorithm suitable for data mining applications. (C) 2002 Elsevier Science B.V. All rights reserved.
暂无评论