- Browse by Author
Browsing by Author "Peng, Hanxiang"
Now showing 1 - 10 of 12
Results Per Page
Sort Options
Item A-Optimal Subsampling For Big Data General Estimating Equations(2019-08) Cheung, Chung Ching; Peng, Hanxiang; Rubchinsky, Leonid; Boukai, Benzion; Lin, Guang; Al Hasan, MohammadA significant hurdle for analyzing big data is the lack of effective technology and statistical inference methods. A popular approach for analyzing data with large sample is subsampling. Many subsampling probabilities have been introduced in literature (Ma, \emph{et al.}, 2015) for linear model. In this dissertation, we focus on generalized estimating equations (GEE) with big data and derive the asymptotic normality for the estimator without resampling and estimator with resampling. We also give the asymptotic representation of the bias of estimator without resampling and estimator with resampling. we show that bias becomes significant when the data is of high-dimensional. We also present a novel subsampling method called A-optimal which is derived by minimizing the trace of some dispersion matrices (Peng and Tan, 2018). We derive the asymptotic normality of the estimator based on A-optimal subsampling methods. We conduct extensive simulations on large sample data with high dimension to evaluate the performance of our proposed methods using MSE as a criterion. High dimensional data are further investigated and we show through simulations that minimizing the asymptotic variance does not imply minimizing the MSE as bias not negligible. We apply our proposed subsampling method to analyze a real data set, gas sensor data which has more than four millions data points. In both simulations and real data analysis, our A-optimal method outperform the traditional uniform subsampling method.Item Asymptotic normality of quadratic forms with random vectors of increasing dimension(Elsevier, 2018-03) Peng, Hanxiang; Schick, Anton; Mathematical Sciences, School of ScienceThis paper provides sufficient conditions for the asymptotic normality of quadratic forms of averages of random vectors of increasing dimension and improves on conditions found in the literature. Such results are needed in applications of Owen’s empirical likelihood when the number of constraints is allowed to grow with the sample size. Indeed, the results of this paper are already used in Peng and Schick (2013) for this purpose. We also demonstrate how our results can be used to obtain the asymptotic distribution of the empirical likelihood with an increasing number of constraints under contiguous alternatives. In addition, we discuss potential applications of our result. The first example focuses on a chi-square test with an increasing number of cells. The second example treats testing for the equality of the marginal distributions of a bivariate random vector. The third example generalizes a result of Schott (2005) by showing that a standardized version of his test for diagonality of the dispersion matrix of a normal random vector is asymptotically standard normal even if the dimension increases faster than the sample size. Schott’s result requires the dimension and the sample size to be of the same order.Item Change Point Modeling of Covid-19 Data in the United States(Society of Statistics, Computer and Applications (SSCA), 2020-07-28) Zhang, Sheng; Xu, Ziyue; Peng, Hanxiang; Mathematical Sciences, School of ScienceTo simultaneously model the change point and the possibly nonlinear relationship in the Covid-19 data of the US, a continuous second-order free knot spline model was proposed. Using the least squares method, the change point of the daily new cases against the total confirmed cases up to the previous day was estimated to be 04 April 2020. Before the point, the daily new cases were proportional to the total cases with a ratio of 0.287, suggesting that each patient had 28.7% chance to infect another person every day. After the point, however, such ratio was no longer maintained and the daily new cases were decreasing slowly. At the individual state level, it was found that most states had change points. Before its change point for each state, the daily new cases were still proportional to the total cases. And all the ratios were about the same except for New York State in which the ratio was much higher (probably due to its high population density and heavy usage of public transportation). But after the points, different states had different patterns. One interesting observation was that the change point of one state was about 3 weeks lagged behind the state declaration of emergency. This might suggest that there was a lag period, which could help identify possible causes for the second wave. In the end, consistency and asymptotic normality of the estimates were briefly discussed where the criterion functions are continuous but not differentiable (irregular).Item Combining Multivariate Statistical Methods and Spatial Analysis to Characterize Water Quality Conditions in the White River Basin, Indiana, U.S.A.(2011-02-25) Gamble, Andrew Stephan; Babbar-Sebens, Meghna; Tedesco, Lenore P.; Peng, HanxiangThis research performs a comparative study of techniques for combining spatial data and multivariate statistical methods for characterizing water quality conditions in a river basin. The study has been performed on the White River basin in central Indiana, and uses sixteen physical and chemical water quality parameters collected from 44 different monitoring sites, along with various spatial data related to land use – land cover, soil characteristics, terrain characteristics, eco-regions, etc. Various parameters related to the spatial data were analyzed using ArcHydro tools and were included in the multivariate analysis methods for the purpose of creating classification equations that relate spatial and spatio-temporal attributes of the watershed to water quality data at monitoring stations. The study compares the use of various statistical estimates (mean, geometric mean, trimmed mean, and median) of monitored water quality variables to represent annual and seasonal water quality conditions. The relationship between these estimates and the spatial data is then modeled via linear and non-linear multivariate methods. The linear statistical multivariate method uses a combination of principal component analysis, cluster analysis, and discriminant analysis, whereas the non-linear multivariate method uses a combination of Kohonen Self-Organizing Maps, Cluster Analysis, and Support Vector Machines. The final models were tested with recent and independent data collected from stations in the Eagle Creek watershed, within the White River basin. In 6 out of 20 models the Support Vector Machine more accurately classified the Eagle Creek stations, and in 2 out of 20 models the Linear Discriminant Analysis model achieved better results. Neither the linear or non-linear models had an apparent advantage for the remaining 12 models. This research provides an insight into the variability and uncertainty in the interpretation of the various statistical estimates and statistical models, when water quality monitoring data is combined with spatial data for characterizing general spatial and spatio-temporal trends.Item Efficient Inference and Dominant-Set Based Clustering for Functional Data(2024-05) Wang, Xiang; Wang, Honglang; Boukai, Benzion; Tan, Fei; Peng, HanxiangThis dissertation addresses three progressively fundamental problems for functional data analysis: (1) To do efficient inference for the functional mean model accounting for within-subject correlation, we propose the refined and bias-corrected empirical likelihood method. (2) To identify functional subjects potentially from different populations, we propose the dominant-set based unsupervised clustering method using the similarity matrix. (3) To learn the similarity matrix from various similarity metrics for functional data clustering, we propose the modularity guided and dominant-set based semi-supervised clustering method. In the first problem, the empirical likelihood method is utilized to do inference for the mean function of functional data by constructing the refined and bias-corrected estimating equation. The proposed estimating equation not only improves efficiency but also enables practically feasible empirical likelihood inference by properly incorporating within-subject correlation, which has not been achieved by previous studies. In the second problem, the dominant-set based unsupervised clustering method is proposed to maximize the within-cluster similarity and applied to functional data with a flexible choice of similarity measures between curves. The proposed unsupervised clustering method is a hierarchical bipartition procedure under the penalized optimization framework with the tuning parameter selected by maximizing the clustering criterion called modularity of the resulting two clusters, which is inspired by the concept of dominant set in graph theory and solved by replicator dynamics in game theory. The advantage offered by this approach is not only robust to imbalanced sizes of groups but also to outliers, which overcomes the limitation of many existing clustering methods. In the third problem, the metric-based semi-supervised clustering method is proposed with similarity metric learned by modularity maximization and followed by the above proposed dominant-set based clustering procedure. Under semi-supervised setting where some clustering memberships are known, the goal is to determine the best linear combination of candidate similarity metrics as the final metric to enhance the clustering performance. Besides the global metric-based algorithm, another algorithm is also proposed to learn individual metrics for each cluster, which permits overlapping membership for the clustering. This is innovatively different from many existing methods. This method is superiorly applicable to functional data with various similarity metrics between functional curves, while also exhibiting robustness to imbalanced sizes of groups, which are intrinsic to the dominant-set based clustering approach. In all three problems, the advantages of the proposed methods are demonstrated through extensive empirical investigations using simulations as well as real data applications.Item Inference about the slope in linear regression: an empirical likelihood approach(Springer, 2017) Müller, Ursula U.; Peng, Hanxiang; Schick, Anton; Mathematical Sciences, School of ScienceWe present a new, efficient maximum empirical likelihood estimator for the slope in linear regression with independent errors and covariates. The estimator does not require estimation of the influence function, in contrast to other approaches, and is easy to obtain numerically. Our approach can also be used in the model with responses missing at random, for which we recommend a complete case analysis. This suffices thanks to results by Müller and Schick (Bernoulli 23:2693–2719, 2017), which demonstrate that efficiency is preserved. We provide confidence intervals and tests for the slope, based on the limiting Chi-square distribution of the empirical likelihood, and a uniform expansion for the empirical likelihood ratio. The article concludes with a small simulation study.Item Large-sample estimation and inference in multivariate single-index models(Elsevier, 2019-05) Wu, Jingwei; Peng, Hanxiang; Tu, Wanzhu; Mathematical Sciences, School of ScienceBy optimizing index functions against different outcomes, we propose a multivariate single-index model (SIM) for development of medical indices that simultaneously work with multiple outcomes. Fitting of a multivariate SIM is not fundamentally different from fitting a univariate SIM, as the former can be written as a sum of multiple univariate SIMs with appropriate indicator functions. What have not been carefully studied are the theoretical properties of the parameter estimators. Because of the lack of asymptotic results, no formal inference procedure has been made available for multivariate SIMs. In this paper, we examine the asymptotic properties of the multivariate SIM parameter estimators. We show that, under mild regularity conditions, estimators for the multivariate SIM parameters are indeedItem Massive data K-means clustering and bootstrapping via A-optimal Subsampling(2019-08) Zhou, Dali; Tan, Fei; Peng, Hanxiang; Boukai, Benzion; Sarkar, Jyotirmoy; Li, PeijunFor massive data analysis, the computational bottlenecks exist in two ways. Firstly, the data could be too large that it is not easy to store and read. Secondly, the computation time could be too long. To tackle these problems, parallel computing algorithms like Divide-and-Conquer were proposed, while one of its drawbacks is that some correlations may be lost when the data is divided into chunks. Subsampling is another way to simultaneously solve the problems of the massive data analysis while taking correlation into consideration. The uniform sampling is simple and fast, but it is inefficient, see detailed discussions in Mahoney (2011) and Peng and Tan (2018). The bootstrap approach uses uniform sampling and is computing time in- tensive, which will be enormously challenged when data size is massive. k-means clustering is standard method in data analysis. This method does iterations to find centroids, which would encounter difficulty when data size is massive. In this thesis, we propose the approach of optimal subsampling for massive data bootstrapping and massive data k-means clustering. We seek the sampling distribution which minimize the trace of the variance co-variance matrix of the resulting subsampling estimators. This is referred to as A-optimal in the literature. We define the optimal sampling distribution by minimizing the sum of the component variances of the subsampling estimators. We show the subsampling k-means centroids consistently approximates the full data centroids, and prove the asymptotic normality using the empirical pro- cess theory. We perform extensive simulation to evaluate the numerical performance of the proposed optimal subsampling approach through the empirical MSE and the running times. We also applied the subsampling approach to real data. For massive data bootstrap, we conducted a large simulation study in the framework of the linear regression based on the A-optimal theory proposed by Peng and Tan (2018). We focus on the performance of confidence intervals computed from A-optimal sub- sampling, including coverage probabilities, interval lengths and running times. In both bootstrap and clustering we compared the A-optimal subsampling with uniform subsampling.Item Maximum empirical likelihood estimation and related topics(IMS, 2018) Peng, Hanxiang; Schick, Anton; Mathematical Sciences, School of ScienceThis article develops a theory of maximum empirical likelihood estimation and empirical likelihood ratio testing with irregular and estimated constraint functions that parallels the theory for parametric models and is tailored for semiparametric models. The key is a uniform local asymptotic normality condition for the local empirical likelihood ratio. This condition is shown to hold under mild assumptions on the constraint function. Applications of our results are discussed to inference problems about quantiles under possibly additional information on the underlying distribution and to residual-based inference about quantiles.Item Regression analysis of big count data via a-optimal subsampling(2018-07-19) Zhao, Xiaofeng; Tan, Fei; Peng, HanxiangThere are two computational bottlenecks for Big Data analysis: (1) the data is too large for a desktop to store, and (2) the computing task takes too long waiting time to finish. While the Divide-and-Conquer approach easily breaks the first bottleneck, the Subsampling approach simultaneously beat both of them. The uniform sampling and the nonuniform sampling--the Leverage Scores sampling-- are frequently used in the recent development of fast randomized algorithms. However, both approaches, as Peng and Tan (2018) have demonstrated, are not effective in extracting important information from data. In this thesis, we conduct regression analysis for big count data via A-optimal subsampling. We derive A-optimal sampling distributions by minimizing the trace of certain dispersion matrices in general estimating equations (GEE). We point out that the A-optimal distributions have the same running times as the full data M-estimator. To fast compute the distributions, we propose the A-optimal Scoring Algorithm, which is implementable by parallel computing and sequentially updatable for stream data, and has faster running time than that of the full data M-estimator. We present asymptotic normality for the estimates in GEE's and in generalized count regression. A data truncation method is introduced. We conduct extensive simulations to evaluate the numerical performance of the proposed sampling distributions. We apply the proposed A-optimal subsampling method to analyze two real count data sets, the Bike Sharing data and the Blog Feedback data. Our results in both simulations and real data sets indicated that the A-optimal distributions substantially outperformed the uniform distribution, and have faster running times than the full data M-estimators.