We discuss several algorithms for sampling from unnormalized probability distributions in statistical physics, but using the language of statistics and machine learning. We provide a self-contained introduction to som...
详细信息
We discuss several algorithms for sampling from unnormalized probability distributions in statistical physics, but using the language of statistics and machine learning. We provide a self-contained introduction to some key ideas and concepts of the field, before discussing three well-known problems: phase transitions in the Ising model, the melting transition on a two-dimensional plane and simulation of an all -atom model for liquid water. We review the classical Metropolis, Glauber and molecular dynamics sampling algorithms before discussing several more recent approaches, including cluster algorithms, novel variations of hybrid Monte Carlo and Langevin dynamics and piece -wise deterministic processes such as event chain Monte Carlo. We highlight cross -over with statistics and machine learning throughout and present some results on event chain Monte Carlo and sampling from the Ising model using tools from the statistics literature. We provide a simulation study on the Ising and XY models, with reproducible code freely available online, and following this we discuss several open areas for interaction between the disciplines that have not yet been explored and suggest avenues for doing so.
stroke is a serious disease that has a significant impact on the quality of life and safety of patients. Accurately predicting stroke risk is of great significance for preventing and treating stroke. In the past few y...
详细信息
stroke is a serious disease that has a significant impact on the quality of life and safety of patients. Accurately predicting stroke risk is of great significance for preventing and treating stroke. In the past few years, machine learning methods have shown potential in predicting stroke risk. However, due to the imbalance of stroke data and the challenges of feature selection and model selection, stroke risk prediction still faces some *** article aims to compare the performance differences between different sampling algorithms and machine learning methods in stroke risk prediction. This study used the over-sampling algorithm (Random Over sampling and SMOTE), the under-sampling algorithm (Random Under sampling and ENN), and the hybrid sampling algorithm (SMOTE-ENN), and combined them with common machine learning methods such as K-Nearest Neighbors, Logistic Regression, Decision Tree and Support Vector Machine to build the prediction *** the analysis of experimental results, and found that the SMOTE combined with the LR model showed good performance in stroke risk prediction, with a high F1 score. In addition, this study found that the overall performance of the undersampling algorithm is better than that of the oversampling and hybrid sampling *** research results provide useful references for predicting stroke risk and provide a foundation for further research and application. Future research can continue to explore more sampling algorithms, machine learning methods, and feature engineering techniques to further improve the accuracy and interpretability of stroke risk prediction and promote its application in clinical practice.
The sequencing of sampling algorithms has shown to be a promising approach in generating balanced versions of unbalanced data. Sequencing allows different algorithms of under-sampling and/or over-sampling to be perfor...
详细信息
ISBN:
(数字)9781665467087
ISBN:
(纸本)9781665467087
The sequencing of sampling algorithms has shown to be a promising approach in generating balanced versions of unbalanced data. Sequencing allows different algorithms of under-sampling and/or over-sampling to be performed in sequence, producing a resulting balanced database. However, defining the most appropriate sequence of sampling algorithms is challenging. This article treats the sequencing problem as a combinatorial optimization task and proposes a multi-objective optimization method to seek promising solutions that maximize the performance of classifiers both in accuracy and in F-1-score. The results showed that the proposed method was capable of finding optimized sequences that improved the performance of the classifiers, obtaining statistically better results, mainly in F-1-score, when compared with competing methods, in most of the selected unbalanced problems.
sampling is a core process in IoT systems. It determines the data volume circulating within the network as well as the energy consumption on the IoT devices. Adaptive sampling aims to control the volume of generated d...
详细信息
ISBN:
(纸本)9781728171227
sampling is a core process in IoT systems. It determines the data volume circulating within the network as well as the energy consumption on the IoT devices. Adaptive sampling aims to control the volume of generated data to reduce energy and bandwidth consumption without undermining data quality. Within this context, we propose two new adaptive sampling techniques: a light-weight adaptive sampling algorithm and an optimized uniform sampling method. We tested our methods using various real data-sets and compared their performances against state-of-the-art adaptive sampling algorithms in terms of data quality and data volume. The results show that the proposed methods are consistently among the best with a noticeable reduction in computational load.
Drinking water quality data sets used in learning models have been highly imbalanced, which has weakened the prediction ability of models for drinking water quality. Although some efforts have been made to address the...
详细信息
Drinking water quality data sets used in learning models have been highly imbalanced, which has weakened the prediction ability of models for drinking water quality. Although some efforts have been made to address the issue of imbalance, little is known about the suitable technologies for drinking water quality prediction. Here, a total of 16 common learning models were applied individually to compare the drinking water quality prediction performance based on a large-scale highly imbalanced drinking water quality data set. Our results showed that ensemble, cost-sensitive learning models with higher F1-scores were more suitable for predicting drinking water quality, compared to other models tested in this study. In addition, the learning model performance could be enhanced by the introduction of two mainstream sampling algorithms [synthetic minority oversampling technique (SMOTE) combined with the Tomek links technique (TLTE) or the edited nearest neighbor technique (ENNTE), SMOTE + TLTE or SMOTE + ENNTE, respectively]. In particular, the F1-scores of deep cascade forest (DCF) with SMOTE + TLTE or SMOTE + ENNTE reached 94.54 +/- 2.51% and 94.68 +/- 2.72%, respectively. As a consequence, DCF with these two sampling algorithms has greater potential to be applied in drinking water quality monitoring and prediction, as well as other fields that have suffered from issues of imbalanced data.
Statistical mechanics bridges the fields of physics and probability theory, providing critical insights into both disciplines. Statistical physics models capture key features of macroscopic phenomena and consist of a ...
详细信息
Statistical mechanics bridges the fields of physics and probability theory, providing critical insights into both disciplines. Statistical physics models capture key features of macroscopic phenomena and consist of a set of configurations satisfying various constraints. Markov chain Monte Carlo algorithms are often used to sample from distributions over the exponentially large state space of these models to gain insight about the system and estimate its thermodynamic properties. Similar problems arise throughout machine learning, optimization, and counting complexity. In this dissertation, we present several new techniques based on random walks for analyzing sampling algorithms and the dynamics of various lattice models from statistical physics. We start by investigating the mixing time of Glauber dynamics for the six-vertex model in its ordered phases. We show that for every Boltzmann weight in the ferroelectric phase, there exist boundary conditions such that local Markov chains require exponential time to converge to equilibrium. This is the first rigorous result about the mixing time of Glauber dynamics for the six-vertex model in the ferroelectric phase. We also analyze the Glauber dynamics with free boundary conditions in the antiferroelectric phase and significantly extend the region for which local Markov chains are known to be slow mixing. In separate lines of work, we use techniques from the theory of random walks and electrical networks to give nearly tight bounds for the transience class of the Abelian sandpile model, closing an open problem of Babai and Gorodezky. The Abelian sandpile model is the canonical dynamical system used to study the phenomenon of self-organized criticality, and the transience class measures the time needed for the process to reach steady-state behavior. We also explore a new approach for approximately sampling elements with fixed rank from graded posets that relies solely on the mixing time of biased Markov chains. This allows
Imbalanced data sets originating from real world problems, such as medical diagnosis, can be found pervasive. Learning from imbalanced data sets poses its own challenges, as common classifiers assume a balanced distri...
详细信息
ISBN:
(纸本)9781509035663
Imbalanced data sets originating from real world problems, such as medical diagnosis, can be found pervasive. Learning from imbalanced data sets poses its own challenges, as common classifiers assume a balanced distribution of examples' classes in the data. sampling techniques overcome the imbalance in the data by modifying the examples' classes distribution. Unfortunately, selecting a sampling technique together with its parameters is still an open problem. Current solutions include the brute-force approach (try as many techniques as possible), and the random search approach (choose the most appropriate from a random subset of techniques). In this work, we propose a new method to select sampling techniques for imbalanced data sets. It uses Meta-Learning and works by recommending a technique for an imbalanced data set based on solutions to previous problems. Our experimentation compared the proposed method against the brute-force approach, all techniques with their default parameters, and the random search approach. The results of our experimentation show that the proposed method is comparable to the brute-force approach, outperforms the techniques with their default parameters most of the time, and always surpasses the random search approach.
We give an efficient perfect sampling algorithm for weighted, connected induced subgraphs (or graphlets) of rooted, bounded degree graphs. Our algorithm utilizes a vertex-percolation process with a carefully chosen re...
详细信息
We give an efficient perfect sampling algorithm for weighted, connected induced subgraphs (or graphlets) of rooted, bounded degree graphs. Our algorithm utilizes a vertex-percolation process with a carefully chosen rejection filter and works under a percolation subcriticality condition. We show that this condition is optimal in the sense that the task of (approximately) sampling weighted rooted graphlets becomes impossible in finite expected time for infinite graphs and intractable for finite graphs when the condition does not hold. We apply our sampling algorithm as a subroutine to give near linear-time perfect sampling algorithms for polymer models and weighted non-rooted graphlets in finite graphs, two widely studied yet very different problems. This new perfect sampling algorithm for polymer models gives improved sampling algorithms for spin systems at low temperatures on expander graphs and unbalanced bipartite graphs, among other applications.
High-entropy materials are composed of multiple elements on comparatively simpler lattices. Due to the multi-component nature of such materials, atomic-scale sampling is computationally expensive due to the combinator...
详细信息
High-entropy materials are composed of multiple elements on comparatively simpler lattices. Due to the multi-component nature of such materials, atomic-scale sampling is computationally expensive due to the combinatorial complexity. This study proposes a genetic algorithm-based methodology for sampling such complex chemically disordered materials. Genetic Algorithm-based Atomistic sampling Protocol (GAASP) variants can generate low as well as high-energy structures. GAASP low-energy variant in conjugation with metropolis criteria avoids premature convergence as well as ensures detailed balance condition. GAASP can be employed to generate low-energy structures for thermodynamic predictions, and diverse structures can be generated for machine-learning applications.
We study the mixing time of the single-site update Markov chain, known as the Glauber dynamics, forgenerating a random independent set of a tree. Our focus is obtaining optimal convergence results forarbitrary trees. ...
详细信息
We study the mixing time of the single-site update Markov chain, known as the Glauber dynamics, forgenerating a random independent set of a tree. Our focus is obtaining optimal convergence results forarbitrary trees. We consider the more general problem of sampling from the Gibbs distribution in the hard-core model where independent sets are weighted by a parameter lambda>0;the special case lambda=1 corresponds to the uniform distribution over all independent sets. Previous work of Martinelli, Sinclair and Weitz(2004) obtained optimal mixing time bounds for the complete Delta-regular tree for all lambda. However, Restrepo,Stefankovic, Vera, Vigoda, and Yang (2014) showed that for sufficiently large lambda there are bounded-degreetrees where optimal mixing does not hold. Recent work of Eppstein and Frishberg (2022) proved a poly-nomial mixing time bound for the Glauber dynamics for arbitrary trees, and more generally for graphs ofbounded tree-width. We establish an optimal bound on the relaxation time (i.e., inverse spectral gap) ofO(n) for the Glauber dynamics for unweighted independent sets on arbitrary trees. We stress that our results hold for arbitrarytrees and there is no dependence on the maximum degree Delta. Interestingly, our results extend (far) beyondthe uniqueness threshold which is on the order lambda=O(1/Delta). Our proof approach is inspired by recent workon spectral independence. In fact, we prove that spectral independence holds with a constant independentof the maximum degree for any tree, but this does not imply mixing for general trees as the optimal mixingresults of Chen, Liu, and Vigoda (2021) only apply for bounded-degree graphs. We instead utilize thecombinatorial nature of independent sets to directly prove approximate tensorization of variance via anon-trivial inductive proof.
暂无评论