Sample size determination for multidimensional parameters and the A-optimal subsampling in a big data linear regression model

Date
2025
Language
English
Embargo Lift Date
Committee Members
Degree
Degree Year
Department
Grantor
Journal Title
Journal ISSN
Volume Title
Found At
Taylor & Francis
Can't use the file because of accessibility barriers? Contact us with the title of the item, permanent link, and specifics of your accommodation need.
Abstract

To efficiently approximate the least squares estimator (LSE) in a Big Data linear regression model using a subsampling approach, optimal sampling distributions were derived by minimizing the trace norm of the covariance matrix of a smooth function of the subsampling LSE. An algorithm was developed that significantly reduces the computation time for the subsampling LSE compared to the full-sample LSE. Additionally, the subsampling LSE was shown to be asymptotically normal almost surely for an arbitrary sampling distribution under suitable conditions. Motivated by the need for subsampling in Big Data analysis and data splitting in machine learning, we investigated sample size determination (SSD) for multidimensional parameters and derived analytical formulas for calculating sample sizes. Through extensive simulations and real-world data applications, we assessed the numerical properties of both the subsampling approach and SSD methodology. Our findings revealed that the A-optimal subsampling method significantly outperformed uniform and leverage-score subsampling techniques. Furthermore, the algorithm considerably reduced the computational time required for implementing the full sample LSE. Additionally, the SSD provided a theoretical basis for selecting sample sizes.

Description
item.page.description.tableofcontents
item.page.relation.haspart
Cite As
Zhang, S., Tan, F., & Peng, H. (2025). Sample size determination for multidimensional parameters and the A-optimal subsampling in a big data linear regression model. Journal of Statistical Computation and Simulation, 95(3), 628–653. https://doi.org/10.1080/00949655.2024.2434669
ISSN
Publisher
Series/Report
Sponsorship
Major
Extent
Identifier
Relation
Journal
Journal of Statistical Computation and Simulation
Source
Author
Alternative Title
Type
Article
Number
Volume
Conference Dates
Conference Host
Conference Location
Conference Name
Conference Panel
Conference Secretariat Location
Version
Author's manuscript
Full Text Available at
This item is under embargo {{howLong}}