- Browse by Subject
Browsing by Subject "Bootstrap"
Now showing 1 - 2 of 2
Results Per Page
Sort Options
Item Massive data K-means clustering and bootstrapping via A-optimal Subsampling(2019-08) Zhou, Dali; Tan, Fei; Peng, Hanxiang; Boukai, Benzion; Sarkar, Jyotirmoy; Li, PeijunFor massive data analysis, the computational bottlenecks exist in two ways. Firstly, the data could be too large that it is not easy to store and read. Secondly, the computation time could be too long. To tackle these problems, parallel computing algorithms like Divide-and-Conquer were proposed, while one of its drawbacks is that some correlations may be lost when the data is divided into chunks. Subsampling is another way to simultaneously solve the problems of the massive data analysis while taking correlation into consideration. The uniform sampling is simple and fast, but it is inefficient, see detailed discussions in Mahoney (2011) and Peng and Tan (2018). The bootstrap approach uses uniform sampling and is computing time in- tensive, which will be enormously challenged when data size is massive. k-means clustering is standard method in data analysis. This method does iterations to find centroids, which would encounter difficulty when data size is massive. In this thesis, we propose the approach of optimal subsampling for massive data bootstrapping and massive data k-means clustering. We seek the sampling distribution which minimize the trace of the variance co-variance matrix of the resulting subsampling estimators. This is referred to as A-optimal in the literature. We define the optimal sampling distribution by minimizing the sum of the component variances of the subsampling estimators. We show the subsampling k-means centroids consistently approximates the full data centroids, and prove the asymptotic normality using the empirical pro- cess theory. We perform extensive simulation to evaluate the numerical performance of the proposed optimal subsampling approach through the empirical MSE and the running times. We also applied the subsampling approach to real data. For massive data bootstrap, we conducted a large simulation study in the framework of the linear regression based on the A-optimal theory proposed by Peng and Tan (2018). We focus on the performance of confidence intervals computed from A-optimal sub- sampling, including coverage probabilities, interval lengths and running times. In both bootstrap and clustering we compared the A-optimal subsampling with uniform subsampling.Item Statistical methods to study heterogeneity of treatment effects(2015-09-25) Taft, Lin H.; Shen, Changyu; Li, Xiaochun; Chen, Peng-Sheng; Wessel, JenniferRandomized studies are designed to estimate the average treatment effect (ATE) of an intervention. Individuals may derive quantitatively, or even qualitatively, different effects from the ATE, which is called the heterogeneity of treatment effect. It is important to detect the existence of heterogeneity in the treatment responses, and identify the different sub-populations. Two corresponding statistical methods will be discussed in this talk: a hypothesis testing procedure and a mixture-model based approach. The hypothesis testing procedure was constructed to test for the existence of a treatment effect in sub-populations. The test is nonparametric, and can be applied to all types of outcome measures. A key innovation of this test is to build stochastic search into the test statistic to detect signals that may not be linearly related to the multiple covariates. Simulations were performed to compare the proposed test with existing methods. Power calculation strategy was also developed for the proposed test at the design stage. The mixture-model based approach was developed to identify and study the sub-populations with different treatment effects from an intervention. A latent binary variable was used to indicate whether or not a subject was in a sub-population with average treatment benefit. The mixture-model combines a logistic formulation of the latent variable with proportional hazards models. The parameters in the mixture-model were estimated by the EM algorithm. The properties of the estimators were then studied by the simulations. Finally, all above methods were applied to a real randomized study in a low ejection fraction population that compared the Implantable Cardioverter Defibrillator (ICD) with conventional medical therapy in reducing total mortality.