Regression analysis of big count data via a-optimal subsampling

dc.contributor.advisorTan, Fei
dc.contributor.advisorPeng, Hanxiang
dc.contributor.authorZhao, Xiaofeng
dc.date.accessioned2018-07-27T19:36:46Z
dc.date.available2018-07-27T19:36:46Z
dc.date.issued2018-07-19
dc.degree.date2018en_US
dc.degree.disciplineMathematical Sciencesen
dc.degree.grantorPurdue Universityen_US
dc.degree.levelPh.D.en_US
dc.descriptionIndiana University-Purdue University Indianapolis (IUPUI)en_US
dc.description.abstractThere are two computational bottlenecks for Big Data analysis: (1) the data is too large for a desktop to store, and (2) the computing task takes too long waiting time to finish. While the Divide-and-Conquer approach easily breaks the first bottleneck, the Subsampling approach simultaneously beat both of them. The uniform sampling and the nonuniform sampling--the Leverage Scores sampling-- are frequently used in the recent development of fast randomized algorithms. However, both approaches, as Peng and Tan (2018) have demonstrated, are not effective in extracting important information from data. In this thesis, we conduct regression analysis for big count data via A-optimal subsampling. We derive A-optimal sampling distributions by minimizing the trace of certain dispersion matrices in general estimating equations (GEE). We point out that the A-optimal distributions have the same running times as the full data M-estimator. To fast compute the distributions, we propose the A-optimal Scoring Algorithm, which is implementable by parallel computing and sequentially updatable for stream data, and has faster running time than that of the full data M-estimator. We present asymptotic normality for the estimates in GEE's and in generalized count regression. A data truncation method is introduced. We conduct extensive simulations to evaluate the numerical performance of the proposed sampling distributions. We apply the proposed A-optimal subsampling method to analyze two real count data sets, the Bike Sharing data and the Blog Feedback data. Our results in both simulations and real data sets indicated that the A-optimal distributions substantially outperformed the uniform distribution, and have faster running times than the full data M-estimators.en_US
dc.identifier.doi10.7912/C2TW8W
dc.identifier.urihttps://hdl.handle.net/1805/16870
dc.identifier.urihttp://dx.doi.org/10.7912/C2/2404
dc.language.isoen_USen_US
dc.subjectBig Count Dataen_US
dc.subjectA-optimalen_US
dc.subjectRegressionen_US
dc.titleRegression analysis of big count data via a-optimal subsamplingen_US
dc.typeThesisen
Files
Original bundle
Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
Zhao_thesis.pdf
Size:
1.44 MB
Format:
Adobe Portable Document Format
Description:
Main thesis
License bundle
Now showing 1 - 1 of 1
No Thumbnail Available
Name:
license.txt
Size:
1.99 KB
Format:
Item-specific license agreed upon to submission
Description: