- Browse by Author
Browsing by Author "Tan, Fei"
Now showing 1 - 10 of 12
Results Per Page
Sort Options
Item Avoiding Bad Control in Regression for Partially Qualitative Outcomes, and Correcting for Endogeneity Bias in Two-Part Models: Causal Inference from the Potential Outcomes Perspective(2021-05) Asfaw, Daniel Abebe; Terza, Joseph; Ottoni-Wilhelm, Mark; Tennekoon, Vidhura; Tan, FeiThe general potential outcomes framework (GPOF) is an essential structure that facilitates clear and coherent specification, identification, and estimation of causal effects. This dissertation utilizes and extends the GPOF, to specify, identify, and estimate causally interpretable (CI) effect parameter (EP) for an outcome of interest that manifests as either a value in a specified subset of the real line or a qualitative event -- a partially qualitative outcome (PQO). The limitations of the conventional GPOF for casting a regression model for a PQO is discussed. The GPOF is only capable of delivering an EP that is subject to a bias due to bad control. The dissertation proposes an outcome measure that maintains all of the essential features of a PQO that is entirely real-valued and is not subject to the bad control critique; the P-weighted outcome – the outcome weighted by the probability that it manifests as a quantitative (real) value. I detail a regression-based estimation method for such EP and, using simulated data, demonstrate its implementation and validate its consistency for the targeted EP. The practicality of the proposed approach is demonstrated by estimating the causal effect of a fully effective policy that bans pregnant women from smoking during pregnancy on a new measure of birth weight. The dissertation also proposes a Generalized Control Function (GCF) approach for modeling and estimating a CI parameter in the context of a fully parametric two-part model (2PM) for a continuous outcome in which the causal variable of interest is continuous and endogenous. The proposed approach is cast within the GPOF. Given a fully parametric specification for the causal variable and under regular Instrumental Variables (IV) assumptions, the approach is shown to satisfy the conditional independence assumption that is often difficult to hold under alternative approaches. Using simulated data, a full information maximum likelihood (FIML) estimator is derived for estimating the “deep” parameters of the model. The Average Incremental Effect (AIE) estimator based on these deep parameter estimates is shown to outperform other conventional estimators. I apply the method for estimating the medical care cost of obesity in youth in the US.Item Comparison of an alternative schedule of extended care contacts to a self-directed control: a randomized trial of weight loss maintenance(BMC, 2017-08-15) Dutton, Gareth R.; Gowey, Marissa A.; Tan, Fei; Zhou, Dali; Ard, Jamy; Perri, Michael G.; Lewis, Cora E.; Mathematical Sciences, School of ScienceBackground Behavioral interventions for obesity produce clinically meaningful weight loss, but weight regain following treatment is common. Extended care programs attenuate weight regain and improve weight loss maintenance. However, less is known about the most effective ways to deliver extended care, including contact schedules. Methods We compared the 12-month weight regain of an extended care program utilizing a non-conventional, clustered campaign treatment schedule and a self-directed program among individuals who previously achieved ≥5% weight reductions. Participants (N = 108; mean age = 51.6 years; mean weight = 92.6 kg; 52% African American; 95% female) who achieved ≥5% weight loss during an initial 16-week behavioral obesity treatment were randomized into a 2-arm, 12-month extended care trial. A clustered campaign condition included 12 group-based visits delivered in three, 4-week clusters. A self-directed condition included provision of the same printed intervention materials but no additional treatment visits. The study was conducted in a U.S. academic medical center from 2011 to 2015. Results Prior to randomization, participants lost an average of −7.55 ± 3.04 kg. Participants randomized to the 12-month clustered campaign program regained significantly less weight (0.35 ± 4.62 kg) than self-directed participants (2.40 ± 3.99 kg), which represented a significant between-group difference of 2.28 kg (p = 0.0154) after covariate adjustments. This corresponded to maintaining 87% and 64% of lost weight in the clustered campaign and self-directed conditions, respectively, which was a significant between-group difference of 29% maintenance of lost weight after covariate adjustments, p = 0.0396. Conclusions In this initial test of a clustered campaign treatment schedule, this novel approach effectively promoted 12-month maintenance of lost weight. Future trials should directly compare the clustered campaigns with conventional (e.g., monthly) extended care schedules. Trial registration Clinicaltrials.gov NCT02487121. Registered 06/26/2015 (retrospectively registered)Item Efficient Inference and Dominant-Set Based Clustering for Functional Data(2024-05) Wang, Xiang; Wang, Honglang; Boukai, Benzion; Tan, Fei; Peng, HanxiangThis dissertation addresses three progressively fundamental problems for functional data analysis: (1) To do efficient inference for the functional mean model accounting for within-subject correlation, we propose the refined and bias-corrected empirical likelihood method. (2) To identify functional subjects potentially from different populations, we propose the dominant-set based unsupervised clustering method using the similarity matrix. (3) To learn the similarity matrix from various similarity metrics for functional data clustering, we propose the modularity guided and dominant-set based semi-supervised clustering method. In the first problem, the empirical likelihood method is utilized to do inference for the mean function of functional data by constructing the refined and bias-corrected estimating equation. The proposed estimating equation not only improves efficiency but also enables practically feasible empirical likelihood inference by properly incorporating within-subject correlation, which has not been achieved by previous studies. In the second problem, the dominant-set based unsupervised clustering method is proposed to maximize the within-cluster similarity and applied to functional data with a flexible choice of similarity measures between curves. The proposed unsupervised clustering method is a hierarchical bipartition procedure under the penalized optimization framework with the tuning parameter selected by maximizing the clustering criterion called modularity of the resulting two clusters, which is inspired by the concept of dominant set in graph theory and solved by replicator dynamics in game theory. The advantage offered by this approach is not only robust to imbalanced sizes of groups but also to outliers, which overcomes the limitation of many existing clustering methods. In the third problem, the metric-based semi-supervised clustering method is proposed with similarity metric learned by modularity maximization and followed by the above proposed dominant-set based clustering procedure. Under semi-supervised setting where some clustering memberships are known, the goal is to determine the best linear combination of candidate similarity metrics as the final metric to enhance the clustering performance. Besides the global metric-based algorithm, another algorithm is also proposed to learn individual metrics for each cluster, which permits overlapping membership for the clustering. This is innovatively different from many existing methods. This method is superiorly applicable to functional data with various similarity metrics between functional curves, while also exhibiting robustness to imbalanced sizes of groups, which are intrinsic to the dominant-set based clustering approach. In all three problems, the advantages of the proposed methods are demonstrated through extensive empirical investigations using simulations as well as real data applications.Item Impact of Body Mass Index on Prognosis for Breast Cancer Patients(JScholar, 2017) Tan, Fei; Xiao, Hong; Gummadi, Sriharsha; Koniaris, Leonidas G; Feldman, Jason David; Ali, Ayalew; Adunlin, George; Huang, Youjie; Mathematical Sciences, School of ScienceThis study investigates the impact of body mass index (BMI) on the prognosis for patients with breast cancer within the context of race (African-American versus Caucasian) and ethnicity (Hispanic versus Non-Hispanic). Overall, this study included 1,368 female breast cancer patients diagnosed between 2007 and 2010 with electronic medical record data accrued from a large Florida hospital network. Non-Hispanic black patients comprised 8.77% of the cohort and Hispanic patients made up 7.56%. Multivariate analysis revealed that breast cancer death rate was increased over 2.6-fold for underweight patients ubiquitously, regardless of race or ethnicity. Patients overweight or obese did not have an increased hazard rate compared to those of normal weight. Importantly, the mechanism for the poorer prognosis for underweight patients needs to be defined. We suggest the use of a low BMI as a high-risk factor for breast-cancer mortality in all racial and ethnic populations.Item Massive data K-means clustering and bootstrapping via A-optimal Subsampling(2019-08) Zhou, Dali; Tan, Fei; Peng, Hanxiang; Boukai, Benzion; Sarkar, Jyotirmoy; Li, PeijunFor massive data analysis, the computational bottlenecks exist in two ways. Firstly, the data could be too large that it is not easy to store and read. Secondly, the computation time could be too long. To tackle these problems, parallel computing algorithms like Divide-and-Conquer were proposed, while one of its drawbacks is that some correlations may be lost when the data is divided into chunks. Subsampling is another way to simultaneously solve the problems of the massive data analysis while taking correlation into consideration. The uniform sampling is simple and fast, but it is inefficient, see detailed discussions in Mahoney (2011) and Peng and Tan (2018). The bootstrap approach uses uniform sampling and is computing time in- tensive, which will be enormously challenged when data size is massive. k-means clustering is standard method in data analysis. This method does iterations to find centroids, which would encounter difficulty when data size is massive. In this thesis, we propose the approach of optimal subsampling for massive data bootstrapping and massive data k-means clustering. We seek the sampling distribution which minimize the trace of the variance co-variance matrix of the resulting subsampling estimators. This is referred to as A-optimal in the literature. We define the optimal sampling distribution by minimizing the sum of the component variances of the subsampling estimators. We show the subsampling k-means centroids consistently approximates the full data centroids, and prove the asymptotic normality using the empirical pro- cess theory. We perform extensive simulation to evaluate the numerical performance of the proposed optimal subsampling approach through the empirical MSE and the running times. We also applied the subsampling approach to real data. For massive data bootstrap, we conducted a large simulation study in the framework of the linear regression based on the A-optimal theory proposed by Peng and Tan (2018). We focus on the performance of confidence intervals computed from A-optimal sub- sampling, including coverage probabilities, interval lengths and running times. In both bootstrap and clustering we compared the A-optimal subsampling with uniform subsampling.Item Pre-Treatment and During-Treatment Weight Trajectories in Black and White Women(Elsevier, 2022) Schneider-Worthington, Camille R.; Kinsey, Amber W.; Tan, Fei; Zhang, Sheng; Borgatti, Alena; Davis, Andrea; Dutton, Gareth R.; Mathematical Sciences, School of ScienceIntroduction: Black participants often lose less weight than White participants in response to behavioral weight-loss interventions. Many participants experience significant pretreatment weight fluctuations (between baseline measurement and treatment initiation), which have been associated with treatment outcomes. Pretreatment weight gain has been shown to be more prevalent among Black participants and may contribute to racial differences in treatment responses. The purpose of this study was to (1) examine the associations between pretreatment weight change and treatment outcomes and (2) examine racial differences in pretreatment weight change and weight loss among Black and White participants. Methods: Participants were Black and White women (n=153, 60% Black) enrolled in a 4-month weight loss program. Weight changes occurring during the pretreatment period (41 ± 14 days) were categorized as weight stable (±1.15% of baseline weight), weight gain (≥+1.15%), or weight loss (≤-1.15%). Recruitment and data collection occurred from 2011 to 2015; statistical analyses were performed in 2021. Results: During the pretreatment period, most participants (56%) remained weight stable. Pretreatment weight trajectories did not differ by race (p=0.481). At 4-months, those who lost weight before treatment experienced 2.63% greater weight loss than those who were weight stable (p<0.005), whereas those who gained weight before treatment experienced 1.91% less weight loss (p<0.01). Conclusions: Pretreatment weight changes can impact weight outcomes after initial treatment, although no differences between Black and White participants were observed. Future studies should consider the influence of pretreatment weight change on long-term outcomes (e.g., weight loss maintenance) along with potential racial differences in these associations.Item Predicting program attendance and weight loss in obesity interventions: Do triggering events help?(Sage, 2021) Borgatti, Alena; Tang, Ziting; Tan, Fei; Salvy, Sarah-Jeanne; Dutton, Gareth; Mathematical Sciences, School of ScienceMedical events that “trigger” motivation to lose weight may improve treatment outcomes compared to non-medical or no triggering events. However, previous findings include only long-term successful participants, not those initiating treatment. The current study compared those with medical triggering events or non-medical triggering events to no triggering events on attendance and weight loss during a weight management program. Medical-triggering-event participants lost 1.8 percent less weight (p = 0.03) than no-triggering-event participants. Non-medical-triggering-event participants attended 1.45 more sessions (p = 0.04) and were 1.83 times more likely to complete the program (p = 0.03) than no-triggering-event participants. These findings fail to support the benefit of medical triggering events when beginning treatment for obesity.Item Regression analysis of big count data via a-optimal subsampling(2018-07-19) Zhao, Xiaofeng; Tan, Fei; Peng, HanxiangThere are two computational bottlenecks for Big Data analysis: (1) the data is too large for a desktop to store, and (2) the computing task takes too long waiting time to finish. While the Divide-and-Conquer approach easily breaks the first bottleneck, the Subsampling approach simultaneously beat both of them. The uniform sampling and the nonuniform sampling--the Leverage Scores sampling-- are frequently used in the recent development of fast randomized algorithms. However, both approaches, as Peng and Tan (2018) have demonstrated, are not effective in extracting important information from data. In this thesis, we conduct regression analysis for big count data via A-optimal subsampling. We derive A-optimal sampling distributions by minimizing the trace of certain dispersion matrices in general estimating equations (GEE). We point out that the A-optimal distributions have the same running times as the full data M-estimator. To fast compute the distributions, we propose the A-optimal Scoring Algorithm, which is implementable by parallel computing and sequentially updatable for stream data, and has faster running time than that of the full data M-estimator. We present asymptotic normality for the estimates in GEE's and in generalized count regression. A data truncation method is introduced. We conduct extensive simulations to evaluate the numerical performance of the proposed sampling distributions. We apply the proposed A-optimal subsampling method to analyze two real count data sets, the Bike Sharing data and the Blog Feedback data. Our results in both simulations and real data sets indicated that the A-optimal distributions substantially outperformed the uniform distribution, and have faster running times than the full data M-estimators.Item Sample Size Determination for Subsampling in the Analysis of Big Data, Multiplicative Models for Confidence Intervals and Free-Knot Changepoint Models(2024-05) Zhang, Sheng; Peng, Hanxiang; Tan, Fei; Sarkar, Jyoti; Boukai, BenThe dissertation consists of three parts. Motivated by subsampling in the analysis of Big Data and by data-splitting in machine learning, sample size determination for multidimensional parameters is presented in the first part. In the second part, we propose a novel approach to the construction of confidence intervals based on improved concentration inequalities. We provide the missing factor for the tail probability of a random variable which generalizes Talagrand’s (1995) result of the missing factor in Hoeffding’s inequalities. We give the procedure for constructing confidence intervals and illustrate it with simulations. In the third part, we study irregular change-point models using free-knot splines. The consistency and asymptotic normality of the least squares estimators are proved for the irregular models in which the linear spline is not differentiable. Simulations are carried out to explore the numerical properties of the proposed models. The results are used to analyze the US Covid-19 data.Item Sample Size Determination in Multivariate Parameters With Applications to Nonuniform Subsampling in Big Data High Dimensional Linear Regression(2021-12) Wang, Yu; Peng, Hanxiang; Li, Fang; Sarkar, Jyoti; Tan, FeiSubsampling is an important method in the analysis of Big Data. Subsample size determination (SSSD) plays a crucial part in extracting information from data and in breaking the challenges resulted from huge data sizes. In this thesis, (1) Sample size determination (SSD) is investigated in multivariate parameters, and sample size formulas are obtained for multivariate normal distribution. (2) Sample size formulas are obtained based on concentration inequalities. (3) Improved bounds for McDiarmid’s inequalities are obtained. (4) The obtained results are applied to nonuniform subsampling in Big Data high dimensional linear regression. (5) Numerical studies are conducted. The sample size formula in univariate normal distribution is a melody in elementary statistics. It appears that its generalization to multivariate normal (or more generally multivariate parameters) hasn’t been caught much attention to the best of our knowledge. In this thesis, we introduce a definition for SSD, and obtain explicit formulas for multivariate normal distribution, in gratifying analogy of the sample size formula in univariate normal. Commonly used concentration inequalities provide exponential rates, and sample sizes based on these inequalities are often loose. Talagrand (1995) provided the missing factor to sharpen these inequalities. We obtained the numeric values of the constants in the missing factor and slightly improved his results. Furthermore, we provided the missing factor in McDiarmid’s inequality. These improved bounds are used to give shrunken sample sizes.