Massive data K-means clustering and bootstrapping via A-optimal Subsampling

Zhou, Dali

Massive data K-means clustering and bootstrapping via A-optimal Subsampling

dc.contributor.advisor	Tan, Fei
dc.contributor.advisor	Peng, Hanxiang
dc.contributor.author	Zhou, Dali
dc.contributor.other	Boukai, Benzion
dc.contributor.other	Sarkar, Jyotirmoy
dc.contributor.other	Li, Peijun
dc.date.accessioned	2019-07-30T12:44:14Z
dc.date.available	2019-07-30T12:44:14Z
dc.date.issued	2019-08
dc.degree.date	2019	en_US
dc.degree.discipline	Mathematical Sciences	en
dc.degree.grantor	Purdue University	en_US
dc.degree.level	Ph.D.	en_US
dc.description	Purdue University West Lafayette (PUWL)	en_US
dc.description.abstract	For massive data analysis, the computational bottlenecks exist in two ways. Firstly, the data could be too large that it is not easy to store and read. Secondly, the computation time could be too long. To tackle these problems, parallel computing algorithms like Divide-and-Conquer were proposed, while one of its drawbacks is that some correlations may be lost when the data is divided into chunks. Subsampling is another way to simultaneously solve the problems of the massive data analysis while taking correlation into consideration. The uniform sampling is simple and fast, but it is inefficient, see detailed discussions in Mahoney (2011) and Peng and Tan (2018). The bootstrap approach uses uniform sampling and is computing time in- tensive, which will be enormously challenged when data size is massive. k-means clustering is standard method in data analysis. This method does iterations to find centroids, which would encounter difficulty when data size is massive. In this thesis, we propose the approach of optimal subsampling for massive data bootstrapping and massive data k-means clustering. We seek the sampling distribution which minimize the trace of the variance co-variance matrix of the resulting subsampling estimators. This is referred to as A-optimal in the literature. We define the optimal sampling distribution by minimizing the sum of the component variances of the subsampling estimators. We show the subsampling k-means centroids consistently approximates the full data centroids, and prove the asymptotic normality using the empirical pro- cess theory. We perform extensive simulation to evaluate the numerical performance of the proposed optimal subsampling approach through the empirical MSE and the running times. We also applied the subsampling approach to real data. For massive data bootstrap, we conducted a large simulation study in the framework of the linear regression based on the A-optimal theory proposed by Peng and Tan (2018). We focus on the performance of confidence intervals computed from A-optimal sub- sampling, including coverage probabilities, interval lengths and running times. In both bootstrap and clustering we compared the A-optimal subsampling with uniform subsampling.	en_US
dc.identifier.uri	https://hdl.handle.net/1805/20024
dc.identifier.uri	http://dx.doi.org/10.7912/C2/2409
dc.language.iso	en_US	en_US
dc.subject	Kmeans	en_US
dc.subject	Bootstrap	en_US
dc.subject	Subsampling	en_US
dc.title	Massive data K-means clustering and bootstrapping via A-optimal Subsampling	en_US
dc.type	Thesis	en

Files

Original bundle

Now showing 1 - 1 of 1

Name:: Dali_Zhou_s_Dissertation.pdf
Size:: 1.01 MB
Format:: Adobe Portable Document Format
Description:: Main article

Download

License bundle

Now showing 1 - 1 of 1

Name:: license.txt
Size:: 1.99 KB
Format:: Item-specific license agreed upon to submission
Description:

Download

Collections

Mathematical Sciences Department Theses and Dissertations