There is an increasing prevalence of streaming data generation in diverse fields like healthcare, finance, social media, and weather forecasting. In order to acquire helpful insights from these massive datasets, timel...
详细信息
There is an increasing prevalence of streaming data generation in diverse fields like healthcare, finance, social media, and weather forecasting. In order to acquire helpful insights from these massive datasets, timely analysis is essential. In this article, we assume that the streaming data are analysed in batches. Traditional offline methods, which involve storing and analysing all individual records, can be repeatedly applied to the cumulative data, but encounter significant challenges in storage and computing costs. Existing online methods offer faster approximations but most methods neglect model uncertainty, causing overconfidence and instability. To bridge this gap, we propose novel online Bayesian approaches that incorporate model uncertainty within a Bayesian model averaging (BMA) framework, for generalized linear models (GLMs). We propose computationally efficient methods to update the posterior, with individual records from the latest batch of data and summary statistics from previous batches. We demonstrate using simulation studies and real data that our methods can offer much faster analysis compared to traditional methods, with no substantial drop in accuracy.
generalized linear models (GLM) applications have become very popular in recent years. However, if there is a high degree of relationship between the independent variables, the problem of multicollinearity arises in t...
详细信息
generalized linear models (GLM) applications have become very popular in recent years. However, if there is a high degree of relationship between the independent variables, the problem of multicollinearity arises in these models. In this paper, we introduce new first-order approximated (FOA) estimators in the case of gamma distributed response variables in GLMs. Also, the generalization of some estimation methods for ridge and Liu parameters in gamma regression models (GRM) are provided. The superiority of these estimators is assessed by the estimated mean squared error (EMSE) via Monte Carlo simulation study where the response follows a gamma distribution with the log link function. We finally consider a real data application. The proposed estimators are compared and interpreted.
A fundamental aspect of statistics is the integration of data from different sources. Classically, Fisher and others were focused on how to integrate homogeneous (or only mildly heterogeneous) sets of data. More recen...
详细信息
A fundamental aspect of statistics is the integration of data from different sources. Classically, Fisher and others were focused on how to integrate homogeneous (or only mildly heterogeneous) sets of data. More recently, as data are becoming more accessible, the question of if data sets from different sources should be integrated is becoming more relevant. The current literature treats this as a question with only two answers: integrate or don't. Here we take a different approach, motivated by information- sharing principles coming from the shrinkage estimation literature. In particular, we deviate from the do/don't perspective and propose a dial parameter that controls the extent to which two data sources are integrated. How far this dial parameter should be turned is shown to depend, for example, on the informativeness of the different data sources as measured by Fisher information. In the context of generalized linear models, this more nuanced data integration framework leads to relatively simple parameter estimates and valid tests/confidence intervals. Moreover, we demonstrate both theoretically and empirically that setting the dial parameter according to our recommendation leads to more efficient estimation compared to other binary data integration schemes.
Sample multiplexing enables pooled analysis during single-cell RNA sequencing workflows, thereby increasing throughput and reducing batch effects. A challenge for all multiplexing techniques is to link sample-specific...
详细信息
Sample multiplexing enables pooled analysis during single-cell RNA sequencing workflows, thereby increasing throughput and reducing batch effects. A challenge for all multiplexing techniques is to link sample-specific barcodes with cell-specific barcodes, then demultiplex sample identity post-sequencing. However, existing demultiplexing tools fail under many real-world conditions where barcode cross-contamination is an issue. We therefore developed deMULTIplex2, an algorithm inspired by a mechanistic model of barcode cross-contamination. deMULTIplex2 employs generalized linear models and expectation-maximization to probabilistically determine the sample identity of each cell. Benchmarking reveals superior performance across various experimental conditions, particularly on large or noisy datasets with unbalanced sample compositions.
This paper presents a nonparametric bootstrap method for estimating the proportions of inliers and outliers in robust regression models. Our approach is based on the concept of stability, providing robustness against ...
详细信息
This paper presents a nonparametric bootstrap method for estimating the proportions of inliers and outliers in robust regression models. Our approach is based on the concept of stability, providing robustness against distributional assumptions and eliminating the need for pre-specified confidence levels. Through numerical experiments, we demonstrate that this method yields more accurate and stable estimates than existing alternatives. Additionally, the generated instability paths offer a valuable graphical tool for understanding the inlier and outlier distributions within the data. The method naturally extends to generalized linear models, where we find that variance-stabilizing transformations produce residuals that are well-suited for outlier detection. Applications to two real-world datasets further illustrate the practical utility of our approach in identifying outliers.
We propose two novel one-sample Mendelian randomization (MR) approaches to causal inference from count-type health outcomes, tailored to both equidispersion and overdispersion conditions. Selecting valid single-nucleo...
详细信息
We propose two novel one-sample Mendelian randomization (MR) approaches to causal inference from count-type health outcomes, tailored to both equidispersion and overdispersion conditions. Selecting valid single-nucleotide polymorphisms (SNPs) as instrumental variables (IVs) poses a key challenge for MR approaches, as it requires meeting the necessary IV assumptions. To bolster the proposed approaches by addressing violations of IV assumptions, we incorporate a process for removing invalid SNPs that violate the assumptions. In simulations, our proposed approaches demonstrate robustness to the violations, delivering valid estimates, and interpretable type-I errors and statistical power. This increases the practical applicability of the models. We applied the proposed approaches to evaluate the causal effect of fetal hemoglobin (HbF) on the vaso-occlusive crisis and acute chest syndrome (ACS) events in patients with sickle cell disease (SCD) and revealed the causal relation between HbF and ACS events in these patients. We also developed a user-friendly Shiny web application to facilitate researchers' exploration of causal relations.
Model selection techniques have existed for many years;however, to date, simple, clear and effective methods of visualising the model building process are sparse. This article describes graphical methods that assist i...
详细信息
Model selection techniques have existed for many years;however, to date, simple, clear and effective methods of visualising the model building process are sparse. This article describes graphical methods that assist in the selection of models and comparison of many different selection criteria. Specifically, we describe for logistic regression, how to visualize measures of description loss and of model complexity to facilitate the model selection dilemma. We advocate the use of the bootstrap to assess the stability of selected models and to enhance our graphical tools. We demonstrate which variables are important using variable inclusion plots and show that these can be invaluable plots for the model building process. We show with two case studies how these proposed tools are useful to learn more about important variables in the data and how these tools can assist the understanding of the model building process. Copyright (c) 2013 John Wiley & Sons, Ltd.
Recursive partitioning algorithms separate a feature space into a set of disjoint rectangles. Then, usually, a constant in every partition is fitted. While this is a simple and intuitive approach, it may still lack in...
详细信息
Recursive partitioning algorithms separate a feature space into a set of disjoint rectangles. Then, usually, a constant in every partition is fitted. While this is a simple and intuitive approach, it may still lack interpretability as to how a specific relationship between dependent and independent variables may look. Or it may be that a certain model is assumed or of interest and there is a number of candidate variables that may non-linearly give rise to different model parameter values. We present an approach that combines generalized linear models (GLM) with recursive partitioning that offers enhanced interpretability of classical trees as well as providing an explorative way to assess a candidate variable's influence on a parametric model. This method conducts recursive partitioning of a GLM by (1) fitting the model to the data set, (2) testing for parameter instability over a set of partitioning variables, (3) splitting the data set with respect to the variable associated with the highest instability. The outcome is a tree where each terminal node is associated with a GLM. We will show the method's versatility and suitability to gain additional insight into the relationship of dependent and independent variables by two examples, modelling voting behaviour and a failure model for debt amortization, and compare it to alternative approaches.
At-haulback mortality of blue shark (Prionace glauca) captured by the Portuguese pelagic longline fishery targeting swordfish in the Atlantic was modeled. Data was collected by onboard fishery observers that monitored...
详细信息
At-haulback mortality of blue shark (Prionace glauca) captured by the Portuguese pelagic longline fishery targeting swordfish in the Atlantic was modeled. Data was collected by onboard fishery observers that monitored 762 fishing sets (1 005 486 hooks) and recorded information on 26 383 blue sharks. The sample size distribution ranged from 40 to 305 cm fork length, with 13.3% of the specimens captured dead at-haulback. Data modeling was carried out with generalized linear models (GLM) and generalized Estimation Equations (GEE), given the fishery-dependent source of the data. The explanatory variables influencing blue shark mortality rates were year, specimen size, fishing location, sex, season and branch line material. Model diagnostics and validation were performed with residual analysis, the Hosmer-Lemeshow test, a receiver operating characteristic (ROC) curve, and a 10-fold cross validation procedure. One important conclusion of this study was that blue shark sizes are important predictors for estimating at-haulback mortality rates, with the probabilities of dying at-haulback decreasing with increasing specimen sizes. The effect in terms of odds-ratios are non-linear, with the changing odds-ratios of surviving higher for the smaller sharks (as sharks grow in size) and then stabilizing as sharks reach larger sizes. The models presented in this study seem valid for predicting blue shark at-haulback mortality in this fishery, and can be used by fisheries management organizations for assessing the efficacy of management and conservation initiatives for the species in the future. (C) 2013 Elsevier B.V. All rights reserved.
An effective methodology for dealing with data extracted from clinical surveys on heart failure linked to the Public Health Database is proposed. A model for recurrent events is used for modelling the occurrence of ho...
详细信息
An effective methodology for dealing with data extracted from clinical surveys on heart failure linked to the Public Health Database is proposed. A model for recurrent events is used for modelling the occurrence of hospital readmissions in time, thus deriving a suitable way to compute individual cumulative hazard functions. Estimated cumulative hazard trajectories are then treated as functional data, and they are used as covariates along with clinical survey data within the framework of generalized linear models with functional covariates.
暂无评论