In recent years, subgroup analysis has emerged as an important tool to identify unknown subgroup memberships. However, subgroup analysis is still under-studied for longitudinal data. In this paper, we propose a struct...
详细信息
In recent years, subgroup analysis has emerged as an important tool to identify unknown subgroup memberships. However, subgroup analysis is still under-studied for longitudinal data. In this paper, we propose a structured mixed-effects approach for longitudinal data to model subgroup distribution and identify subgroup membership simultaneously. In the proposed structured mixed-effects model, the heterogeneous treatment effect is modeled as a random effect from a two-component mixture model, while the membership of the mixture model is incorporated using a logistic model with respect to some covariates. One advantage of our approach is that we are able to derive the estimation of the treatment effects through an em-type algorithm which keeps the subgroup membership unchanged over time. Our numerical studies and real data example demonstrate that the proposed model outperforms other competing methods.
A new empirical Bayes approach to variable selection in the context of generalized linear models is developed. The proposed algorithm scales to situations in which the number of putative explanatory variables is very ...
详细信息
A new empirical Bayes approach to variable selection in the context of generalized linear models is developed. The proposed algorithm scales to situations in which the number of putative explanatory variables is very large, possibly much larger than the number of responses. The coefficients in the linear predictor are modeled as a three-component mixture allowing the explanatory variables to have a random positive effect on the response, a random negative effect, or no effect. A key assumption is that only a small (but unknown) fraction of the candidate variables have a nonzero effect. This assumption, in addition to treating the coefficients as random effects facilitates an approach that is computationally efficient. In particular, the number of parameters that have to be estimated is small, and remains constant regardless of the number of explanatory variables. The model parameters are estimated using a generalized alternating maximization algorithm which is scalable, and leads to significantly faster convergence compared with simulation-based fully Bayesian methods. for this article are available online.
The heteroscedastic nonlinear regression model (HNLM) is an important tool in data modeling. In this paper we propose a HNLM considering skew scale mixtures of normal (SSMN) distributions, which allows fitting asymmet...
详细信息
The heteroscedastic nonlinear regression model (HNLM) is an important tool in data modeling. In this paper we propose a HNLM considering skew scale mixtures of normal (SSMN) distributions, which allows fitting asymmetric and heavy-tailed data simultaneously. Maximum likelihood (ML) estimation is performed via the expectation-maximization (em) algorithm. The observed information matrix is derived analytically to account for standard errors. In addition, diagnostic analysis is developed using case-deletion measures and the local influence approach. A simulation study is developed to verify the empirical distribution of the likelihood ratio statistic, the power of the homogeneity of variances test and a study for misspecification of the structure function. The method proposed is also illustrated by analyzing a real dataset.
In this paper, we consider joint modeling of repeated measurements and competing risks failure time data to allow for more than one distinct failure type in the survival endpoint. Hence, we can fit a cause-specific ha...
详细信息
In this paper, we consider joint modeling of repeated measurements and competing risks failure time data to allow for more than one distinct failure type in the survival endpoint. Hence, we can fit a cause-specific hazards submodel to allow for competing risks, with a separate latent association between longitudinal measurements and each cause of failure. We also consider the possible masked causes of failure in joint modeling of repeated measurements and competing risks failure time data. We also derive a score test to identify longitudinal biomarkers or surrogates for a time-to-event outcome in competing risks data which contain masked causes of failure. With a carefully chosen definition of complete data, the maximum likelihood estimation of the cause-specific hazard functions and of the masking probabilities is performed via an expectation maximization algorithm. The simulations are used to explore how the number of individuals, the number of time points per individual, and the functional form of the random effects from the longitudinal biomarkers considering heterogeneous baseline hazards in individuals influence the power to detect the association of a longitudinal biomarker and the survival time.
Phylogenetic trees are types of networks that describe the temporal relationship between individuals, species, or other units that are subject to evolutionary diversification. Many phylogenetic trees are constructed f...
详细信息
Phylogenetic trees are types of networks that describe the temporal relationship between individuals, species, or other units that are subject to evolutionary diversification. Many phylogenetic trees are constructed from molecular data that is often only available for extant species, and hence they lack all or some of the branches that did not make it into the present. This feature makes inference on the diversification process challenging. For relatively simple diversification models, analytical or numerical methods to compute the likelihood exist, but these do not work for more realistic models in which the likelihood depends on properties of the missing lineages. In this article, we study a general class of species diversification models, and we provide an expectation-maximization framework in combination with a uniform sampling scheme to perform maximum likelihood estimation of the parameters of the diversification process.
Background Associations between haplotypes and quantitative traits provide valuable information about the genetic basis of complex human diseases. Haplotypes also provide an effective way to deal with untyped SNPs. Tw...
详细信息
Background Associations between haplotypes and quantitative traits provide valuable information about the genetic basis of complex human diseases. Haplotypes also provide an effective way to deal with untyped SNPs. Two major challenges arise in haplotype-based association analysis of family data. First, haplotypes may not be inferred with certainty from genotype data. Second, the trait values within a family tend to be correlated because of common genetic and environmental factors. Results To address these challenges, we present an efficient likelihood-based approach to analyzing associations of quantitative traits with haplotypes or untyped SNPs. This approach properly accounts for within-family trait correlations and can handle general pedigrees with arbitrary patterns of missing genotypes. We characterize the genetic effects on the quantitative trait by a linear regression model with random effects and develop efficient likelihood-based inference procedures. Extensive simulation studies are conducted to examine the performance of the proposed methods. An application to family data from the Childhood Asthma Management Program Ancillary Genetic Study is provided. A computer program is freely available. Conclusions Results from extensive simulation studies show that the proposed methods for testing the haplotype effects on quantitative traits have correct type I error rates and are more powerful than some existing methods.
Principal component analysis (PCA) is a popular statistical tool. However, despite numerous advantages, the good practice of imputing missing data before PCA is not common. In the present work, we evaluated the hypoth...
详细信息
Principal component analysis (PCA) is a popular statistical tool. However, despite numerous advantages, the good practice of imputing missing data before PCA is not common. In the present work, we evaluated the hypothesis that the expectation-maximization (em) algorithm for missing data imputation is a reliable and advantageous procedure when using PCA to derive biomarker profiles and dietary patterns. To this aim, we used numerical simulations aimed to mimic real data commonly observed in nutritional research. Finally, we showed the advantages and pitfalls of the em algorithm for missing data imputation applied to plasma fatty acid concentrations and nutrient intakes from real data sets deriving from the US National Health and Nutrition Examination Survey. PCA applied to simulated data having missing values resulted in biased eigenvalues with respect to the original data set without missing values. The bias between the eigenvalues from the original set of data and from the data set with missing values increased with number of missing values and appeared as independent with respect to the correlation structure among variables. On the other hand, when data were imputed, the mean of the eigenvalues over the 10 missing imputation runs overlapped with the ones derived from the PCA applied to the original data set. These results were confirmed when real data sets from the National Health and Nutrition Examination Survey were analyzed. We accept the hypothesis that the em algorithm for missing data imputation applied before PCA aimed to derive biochemical profiles and dietary patterns is an effective technique especially for relatively small sample sizes. (C) 2020 Elsevier Inc. All rights reserved.
Mixture regression models have been widely used in business, marketing and social sciences to model mixed regression relationships arising from a clustered and thus heterogeneous population. The unknown mixture regres...
详细信息
Mixture regression models have been widely used in business, marketing and social sciences to model mixed regression relationships arising from a clustered and thus heterogeneous population. The unknown mixture regression parameters are usually estimated by maximum likelihood estimators using the expectation-maximisation algorithm based on the normality assumption of component error density. However, it is well known that the normality-based maximum likelihood estimation is very sensitive to outliers or heavy-tailed error distributions. This paper aims to give a selective overview of the recently proposed robust mixture regression methods and compare their performance using simulation studies.
In the discipline of software development, effort estimation renders a pivotal role. For the successful development of the project, an unambiguous estimation is necessitated. But there is the inadequacy of standard me...
详细信息
In the discipline of software development, effort estimation renders a pivotal role. For the successful development of the project, an unambiguous estimation is necessitated. But there is the inadequacy of standard methods for estimating an effort which is applicable to all projects. Hence, to procure the best way of estimating the effort becomes an indispensable need of the project manager. Mathematical models are only mediocre in performing accurate estimation. On that account, we opt for analogy-based effort estimation by means of some soft computing techniques which rely on historical effort estimation data of the successfully completed projects to estimate the effort. So in a thorough study to improve the accuracy, models are generated for the clusters of the datasets with the confidence that data within the cluster have similar properties. This paper aims mainly on the analysis of some of the techniques to improve the effort prediction accuracy. Here the research starts with analyzing the correlation coefficient of the selected datasets. Then the process moves through the analysis of classification accuracy, clustering accuracy, mean magnitude of relative error and prediction accuracy based on some machine learning methods. Finally, a bio-inspired firefly algorithm with fuzzy analogy is applied on the datasets to produce good estimation accuracy.
The behaviour of ecological systems mainly relies on the interactions between the species it involves. We consider the problem of inferring the species interaction network from abundance data. To be relevant, any netw...
详细信息
The behaviour of ecological systems mainly relies on the interactions between the species it involves. We consider the problem of inferring the species interaction network from abundance data. To be relevant, any network inference methodology needs to handle count data and to account for possible environmental effects. It also needs to distinguish between direct interactions and indirect associations and graphical models provide a convenient framework for this purpose. A simulation study shows that the proposed methodology compares well with state-of-the-art approaches, even when the underlying graph strongly differs from a tree. The analysis of two datasets highlights the influence of covariates on the inferred network. Accounting for covariates is critical to avoid spurious edges. The proposed approach could be extended to perform network comparison or to look for missing species.
暂无评论