Mathematical Sciences Department Theses and Dissertations

Permanent URI for this collection

https://hdl.handle.net/1805/2071

For more information about the Mathematical Sciences graduate programs visit: www.science.iupui.edu http://www.science.iupui.edu

Browse

Now showing 1 - 2 of 2

Regression analysis of big count data via a-optimal subsampling
(2018-07-19) Zhao, Xiaofeng; Tan, Fei; Peng, Hanxiang
There are two computational bottlenecks for Big Data analysis: (1) the data is too large for a desktop to store, and (2) the computing task takes too long waiting time to finish. While the Divide-and-Conquer approach easily breaks the first bottleneck, the Subsampling approach simultaneously beat both of them. The uniform sampling and the nonuniform sampling--the Leverage Scores sampling-- are frequently used in the recent development of fast randomized algorithms. However, both approaches, as Peng and Tan (2018) have demonstrated, are not effective in extracting important information from data. In this thesis, we conduct regression analysis for big count data via A-optimal subsampling. We derive A-optimal sampling distributions by minimizing the trace of certain dispersion matrices in general estimating equations (GEE). We point out that the A-optimal distributions have the same running times as the full data M-estimator. To fast compute the distributions, we propose the A-optimal Scoring Algorithm, which is implementable by parallel computing and sequentially updatable for stream data, and has faster running time than that of the full data M-estimator. We present asymptotic normality for the estimates in GEE's and in generalized count regression. A data truncation method is introduced. We conduct extensive simulations to evaluate the numerical performance of the proposed sampling distributions. We apply the proposed A-optimal subsampling method to analyze two real count data sets, the Bike Sharing data and the Blog Feedback data. Our results in both simulations and real data sets indicated that the A-optimal distributions substantially outperformed the uniform distribution, and have faster running times than the full data M-estimators.
Sample Size Determination for Subsampling in the Analysis of Big Data, Multiplicative Models for Confidence Intervals and Free-Knot Changepoint Models
(2024-05) Zhang, Sheng; Peng, Hanxiang; Tan, Fei; Sarkar, Jyoti; Boukai, Ben
The dissertation consists of three parts. Motivated by subsampling in the analysis of Big Data and by data-splitting in machine learning, sample size determination for multidimensional parameters is presented in the first part. In the second part, we propose a novel approach to the construction of confidence intervals based on improved concentration inequalities. We provide the missing factor for the tail probability of a random variable which generalizes Talagrand’s (1995) result of the missing factor in Hoeffding’s inequalities. We give the procedure for constructing confidence intervals and illustrate it with simulations. In the third part, we study irregular change-point models using free-knot splines. The consistency and asymptotic normality of the least squares estimators are proved for the irregular models in which the linear spline is not differentiable. Simulations are carried out to explore the numerical properties of the proposed models. The results are used to analyze the US Covid-19 data.

Browse

Browsing Mathematical Sciences Department Theses and Dissertations by Subject "A-optimal"

Results Per Page

Sort Options