In medical studies, composite indices and/or scores are routinely used for predicting medical conditions of patients. These indices are usually developed from observed data of certain disease risk factors, and it has ...
详细信息
In medical studies, composite indices and/or scores are routinely used for predicting medical conditions of patients. These indices are usually developed from observed data of certain disease risk factors, and it has been demonstrated in the literature that single index models can provide a powerful tool for this purpose. In practice, the observed data of disease risk factors are often longitudinal in the sense that they are collected at multiple time points for individual patients, and there are often multiple aspects of a patient's medical condition that are of our concern. However, most existing single-index models are developed for cases with independent data and a single response variable, which are inappropriate for the problem just described in which within-subject observations are usually correlated and there are multiple mutually correlated response variables involved. This paper aims to fill this methodological gap by developing a single index model for analyzing longitudinal data with multiple responses. Both theoretical and numerical justifications show that the proposed new method provides an effective solution to the related research problem. It is also demonstrated using a dataset from the English Longitudinal Study of Aging.
Length-biased data occur often in many scientific fields, including clinical trials, epidemiology surveys and genome-wide association studies, and many methods have been proposed for their analysis under various situa...
详细信息
Length-biased data occur often in many scientific fields, including clinical trials, epidemiology surveys and genome-wide association studies, and many methods have been proposed for their analysis under various situations. In this article, we consider the situation where one faces length-biased and partly interval-censored failure time data under the proportional hazards model, for which it does not seem to exist an established method. For the estimation, we propose an efficient nonparametric maximum likelihood method by incorporating the distribution information of the observed truncation times. For the implementation of the method, a flexible and stable em algorithm via two-stage data augmentation is developed. By employing the empirical process theory, we establish the asymptotic properties of the resulting estimators. A simulation study conducted to assess the finite-sample performance of the proposed method suggests that it works well and is more efficient than the conditional likelihood approach. An application to an AIDS cohort study is also provided.
We study the statistical properties of an estimator derived by applying a gradient ascent method with multiple initializations to a multi-modal likelihood function. We derive the population quantity that is the target...
详细信息
We study the statistical properties of an estimator derived by applying a gradient ascent method with multiple initializations to a multi-modal likelihood function. We derive the population quantity that is the target of this estimator and study the properties of confidence intervals (CIs) constructed from asymptotic normality and the bootstrap approach. In particular, we analyze the coverage deficiency due to finite number of random initializations. We also investigate the CIs by inverting the likelihood ratio test, the score test, and the Wald test, and we show that the resulting CIs may be very different. We propose a two-sample test procedure even when the maximum likelihood estimator is intractable. In addition, we analyze the performance of the em algorithm under random initializations and derive the coverage of a CI with a finite number of initializations. for this article are available online.
The mixture cure model is widely used to analyze survival data in the presence of a cured subgroup. Standard logistic regression-based approaches to model the incidence may lead to poor predictive accuracy of cure, sp...
详细信息
The mixture cure model is widely used to analyze survival data in the presence of a cured subgroup. Standard logistic regression-based approaches to model the incidence may lead to poor predictive accuracy of cure, specifically when the covariate effect is non-linear. Supervised machine learning techniques can be used as a better classifier than the logistic regression due to their ability to capture non-linear patterns in the data. However, the problem of interpret-ability hangs in the balance due to the trade-off between interpret-ability and predictive accuracy. We propose a new mixture cure model where the incidence part is modeled using a decision tree-based classifier and the proportional hazards structure for the latency part is preserved. The proposed model is very easy to interpret, closely mimics the human decision-making process, and provides flexibility to gauge both linear and non-linear covariate effects. For the estimation of model parameters, we develop an expectation maximization algorithm. A detailed simulation study shows that the proposed model outperforms the logistic regression-based and spline regression-based mixture cure models, both in terms of model fitting and evaluating predictive accuracy. An illustrative example with data from a leukemia study is presented to further support our conclusion.
In this paper, we formulate and estimate a flexible model of job mobility and wages with two-sided heterogeneity. The analysis extends the finite mixture approach of Bonhomme, Lamadon, and Manresa (2019) and Abowd, Mc...
详细信息
In this paper, we formulate and estimate a flexible model of job mobility and wages with two-sided heterogeneity. The analysis extends the finite mixture approach of Bonhomme, Lamadon, and Manresa (2019) and Abowd, McKinney, and Schmutte (2019) to develop a new Classification Expectation-Maximization algorithm that ensures both worker and firm latent-type identification using wage and mobility variations in the data. Workers receive job offers in worker-type segmented labor markets. Offers are accepted according to a logit form that compares the value of the current job with that of the new job. In combination with flexibly estimated layoff and job finding rates, the analysis quantifies the four different sources of sorting: job preferences, segmentation, layoffs, and job finding. Job preferences are identified through job-to-job moves in a revealed preference argument. They are in the model structurally independent of the identified job wages, possibly as a reflection of the presence of amenities. We find evidence of a strong pecuniary motive in job preferences. While the correlation between preferences and current job wages is positive, the net present value of the future earnings stream given the current job correlates much more strongly with preferences for it. This is more so for short- than long-tenure workers. In the analysis, we distinguish between type sorting and wage sorting. Type sorting is quantified by means of the mutual information index. Wage sorting is captured through correlation between identified wage types. While layoffs are less important than the other channels, we find all channels to contribute substantially to sorting. As workers age, job arrival processes are the key determinant of wage sorting, whereas the role of job preferences dictate type sorting. Over the life cycle, job preferences intensify, type sorting increases, and pecuniary considerations wane.
Spatial transcriptomics is a groundbreaking technology that allows the where the activity occurs. This technology has enabled the study of the spatial variation of the genes across the tissue. Comprehending gene funct...
详细信息
Spatial transcriptomics is a groundbreaking technology that allows the where the activity occurs. This technology has enabled the study of the spatial variation of the genes across the tissue. Comprehending gene functions and interactions in different areas of the tissue is of great scientific interest, as it might lead to a deeper understanding of several key biological mechanisms, such as cell-cell communication or tumor-microenvironment interaction. To do so, one can group cells of the same type and genes that exhibit similar expression patterns. However, adequate statistical tools that exploit the previously unavailable spatial information to more coherently group cells and genes are still *** this work we introduce SPARTACO, a new statistical model that clusters the spatial expression profiles of the genes according to a partition of the tissue. This is accomplished by performing a co-clustering, that is, inferring the latent block structure of the data and inducing two types of clustering: of the genes, using their expression across the tissue, and of the image areas, using the gene expression in the spots where the RNA is collected. Our proposed methodology is validated with a series of simulation experiments, and its usefulness in responding to specific biological questions is illustrated with an application to a human brain tissue sample processed with the 10X-Visium protocol.
A mixture cure model has been increasingly popular in the field of biostatistics, where some individuals may never experience an event of interest during a study. In most cases, effects of continuous covariates are as...
详细信息
A mixture cure model has been increasingly popular in the field of biostatistics, where some individuals may never experience an event of interest during a study. In most cases, effects of continuous covariates are assumed to be linear. However, a traditional linear assumption often fails in practical situations because real-life effects are usually nonlinear. Proposed is a linear spline Cox cure model in which a spline is used to approximate the unknown smooth functional form for the effect of a continuous covariate to identify the nonlinear functional relationship. The justification and estimation procedure starts from Laplace's approximation of the marginal log-likelihood function and leads to a penalized log-likelihood. The expectation-maximization algorithm is used to estimate the model parameters, and the proposed methodology could then be used to assess the linearity of the continuous covariate effect via the likelihood ratio procedure. An extensive simulation study is conducted to investigate the performance of the proposed lack-of-fit test for the linearity of the continuous covariate effect. The practical use of the methodology is illustrated with fibrous histiocytoma data from the Surveillance, Epidemiology, and End Results (SEER) program database.
In this paper, we propose a Bayesian additive Cox model for analyzing current status data based on the expectation-maximization variable selection method. This model concurrently estimates unknown parameters and ident...
详细信息
In this paper, we propose a Bayesian additive Cox model for analyzing current status data based on the expectation-maximization variable selection method. This model concurrently estimates unknown parameters and identifies risk factors, which efficiently improves model interpretability and predictive ability. To identify risk factors, we assign appropriate priors on the indicator variables which denote whether the risk factors are included. By assuming partially linear effects of the covariates, the proposed model offers flexibility to account for the relationship between risk factors and survival time. The baseline cumulative hazard function and nonlinear effects are approximated via penalized B-splines to reduce the dimension of parameters. An easy to implement expectation-maximization algorithm is developed using a two-stage data augmentation procedure involving latent Poisson variables. Finally, the performance of the proposed method is investigated by simulations and a real data analysis, which shows promising results of the proposed Bayesian variable selection method.
Recent evidence highlights the usefulness of DNA methylation (DNAm) biomarkers as surrogates for exposure to risk factors for noncommunicable diseases in epidemiological studies and randomized trials. DNAm variability...
详细信息
Recent evidence highlights the usefulness of DNA methylation (DNAm) biomarkers as surrogates for exposure to risk factors for noncommunicable diseases in epidemiological studies and randomized trials. DNAm variability has been demonstrated to be tightly related to lifestyle behavior and expo-sure to environmental risk factors, ultimately providing an unbiased proxy of an individual state of health. At present, the creation of DNAm surrogates relies on univariate penalized regression models, with elastic-net regularizer being the gold standard when accomplishing the task. Nonetheless, more ad-vanced modeling procedures are required in the presence of multivariate out-comes with a structured dependence pattern among the study samples. In this work we propose a general framework for mixed-effects multitask learning in presence of high-dimensional predictors to develop a multivariate DNAm biomarker from a multicenter study. A penalized estimation scheme, based on an expectation-maximization algorithm, is devised in which any penalty criteria for fixed-effects models can be conveniently incorporated in the fit-ting process. We apply the proposed methodology to create novel DNAm surrogate biomarkers for multiple correlated risk factors for cardiovascular diseases and comorbidities. We show that the proposed approach, modeling multiple outcomes together, outperforms state-of-the-art alternatives both in predictive power and biomolecular interpretation of the results.
One-way layout of count data having over/under dispersion arises in many practical situations. For example, in the mice toxicology data, Barnwal and Paul (1988, Biometrika, 75(2), 215-222) sought to assess as to wheth...
详细信息
One-way layout of count data having over/under dispersion arises in many practical situations. For example, in the mice toxicology data, Barnwal and Paul (1988, Biometrika, 75(2), 215-222) sought to assess as to whether the means of several groups of count data in the presence of such over/under dispersion are equal. Specifically, they developed and studied five statistics, two of which are score tests and the other three statistics are based on data transformed to normality. After extensive simulation study they recommended the score tests. Saha (2008, J. Stat. Plan. Inference, 138(7), 2067-2081) developed two similar test statistics for the homogeneity of the means in over/under dispersed count data situations in which no likelihood exists. Again through extensive simulations, Saha recommended a score-type statistic using a double extended quasi-likelihood (Lee and Nelder 2001, Biometrika, 88(4), 987-1006). However, as in the continuous and some other discrete data situations, some observations might be missing in the one way layout of count data. The purpose of this paper is to (i) develop estimation procedures for the parameters involved in the one way layout of count data under different missing data scenarios, (ii) study the comparative behaviour of the score tests developed by Barnwal and Paul (1988, Biometrika, 75(2), 215-222) and score type statistic developed by Saha (2008, J. Stat. Plan. Inference, 138(7), 2067-2081) for complete data, and (iii) study the comparative effect of missing data on the score and score-type statistic under different missing data scenarios. Extensive Monte Carlo simulations and real life data analysis show that for complete data as well as for data under different missing data scenarios, the score-type statistic (Saha 2008, J. Stat. Plan. Inference, 138(7), 2067-2081) has some edge in terms of power over the score test statistic (Barnwal and Paul 1988, Biometrika, 75(2), 215-222) showing that the estimation under missing data met
暂无评论