- Browse by Author
Browsing by Author "Boukai, Benzion"
Now showing 1 - 10 of 10
Results Per Page
Sort Options
Item A-Optimal Subsampling For Big Data General Estimating Equations(2019-08) Cheung, Chung Ching; Peng, Hanxiang; Rubchinsky, Leonid; Boukai, Benzion; Lin, Guang; Al Hasan, MohammadA significant hurdle for analyzing big data is the lack of effective technology and statistical inference methods. A popular approach for analyzing data with large sample is subsampling. Many subsampling probabilities have been introduced in literature (Ma, \emph{et al.}, 2015) for linear model. In this dissertation, we focus on generalized estimating equations (GEE) with big data and derive the asymptotic normality for the estimator without resampling and estimator with resampling. We also give the asymptotic representation of the bias of estimator without resampling and estimator with resampling. we show that bias becomes significant when the data is of high-dimensional. We also present a novel subsampling method called A-optimal which is derived by minimizing the trace of some dispersion matrices (Peng and Tan, 2018). We derive the asymptotic normality of the estimator based on A-optimal subsampling methods. We conduct extensive simulations on large sample data with high dimension to evaluate the performance of our proposed methods using MSE as a criterion. High dimensional data are further investigated and we show through simulations that minimizing the asymptotic variance does not imply minimizing the MSE as bias not negligible. We apply our proposed subsampling method to analyze a real data set, gas sensor data which has more than four millions data points. In both simulations and real data analysis, our A-optimal method outperform the traditional uniform subsampling method.Item classCleaner: A Quantitative Method for Validating Peptide Identification in LC-MS/MS Workflows(2020-05) Key, Melissa Chester; Boukai, Benzion; Ragg, Susanne; Katz, Barry; Mosley, AmberBecause label-free liquid chromatography-tandem mass spectrometry (LC-MS/MS) shotgun proteomics infers the peptide sequence of each measurement, there is inherent uncertainty in the identity of each peptide and its originating protein. Removing misidentified peptides can improve the accuracy and power of downstream analyses when differences between proteins are of primary interest. In this dissertation I present classCleaner, a novel algorithm designed to identify misidentified peptides from each protein using the available quantitative data. The algorithm is based on the idea that distances between peptides belonging to the same protein are stochastically smaller than those between peptides in different proteins. The method first determines a threshold based on the estimated distribution of these two groups of distances. This is used to create a decision rule for each peptide based on counting the number of within-protein distances smaller than the threshold. Using simulated data, I show that classCleaner always reduces the proportion of misidentified peptides, with better results for larger proteins (by number of constituent peptides), smaller inherent misidentification rates, and larger sample sizes. ClassCleaner is also applied to a LC-MS/MS proteomics data set and the Congressional Voting Records data set from the UCI machine learning repository. The later is used to demonstrate that the algorithm is not specific to proteomics.Item Continuous Statistical Models: With or Without Truncation Parameters?(Springer, 2015) Vancak, V.; Goldberg, Y.; Bar-Lev, S. K.; Boukai, Benzion; Department of Mathematical Sciences, School of ScienceLifetime data are usually assumed to stem from a continuous distribution supported on [0, b) for some b ≤ ∞. The continuity assumption implies that the support of the distribution does not have atom points, particularly not at 0. Accordingly, it seems reasonable that with an accurate measurement tool all data observations will be positive. This suggests that the true support may be truncated from the left. In this work we investigate the effects of adding a left truncation parameter to a continuous lifetime data statistical model. We consider two main settings: right truncation parametric models with possible left truncation, and exponential family models with possible left truncation. We analyze the performance of some optimal estimators constructed under the assumption of no left truncation when left truncation is present, and vice versa. We investigate both asymptotic and finite-sample behavior of the estimators. We show that when left truncation is not assumed but is, in fact present, the estimators have a constant bias term, and therefore will result in inaccurate and inefficient estimation. We also show that assuming left truncation where actually there is none, typically does not result in substantial inefficiency, and some estimators in this case are asymptotically unbiased and efficient.Item Efficient Inference and Dominant-Set Based Clustering for Functional Data(2024-05) Wang, Xiang; Wang, Honglang; Boukai, Benzion; Tan, Fei; Peng, HanxiangThis dissertation addresses three progressively fundamental problems for functional data analysis: (1) To do efficient inference for the functional mean model accounting for within-subject correlation, we propose the refined and bias-corrected empirical likelihood method. (2) To identify functional subjects potentially from different populations, we propose the dominant-set based unsupervised clustering method using the similarity matrix. (3) To learn the similarity matrix from various similarity metrics for functional data clustering, we propose the modularity guided and dominant-set based semi-supervised clustering method. In the first problem, the empirical likelihood method is utilized to do inference for the mean function of functional data by constructing the refined and bias-corrected estimating equation. The proposed estimating equation not only improves efficiency but also enables practically feasible empirical likelihood inference by properly incorporating within-subject correlation, which has not been achieved by previous studies. In the second problem, the dominant-set based unsupervised clustering method is proposed to maximize the within-cluster similarity and applied to functional data with a flexible choice of similarity measures between curves. The proposed unsupervised clustering method is a hierarchical bipartition procedure under the penalized optimization framework with the tuning parameter selected by maximizing the clustering criterion called modularity of the resulting two clusters, which is inspired by the concept of dominant set in graph theory and solved by replicator dynamics in game theory. The advantage offered by this approach is not only robust to imbalanced sizes of groups but also to outliers, which overcomes the limitation of many existing clustering methods. In the third problem, the metric-based semi-supervised clustering method is proposed with similarity metric learned by modularity maximization and followed by the above proposed dominant-set based clustering procedure. Under semi-supervised setting where some clustering memberships are known, the goal is to determine the best linear combination of candidate similarity metrics as the final metric to enhance the clustering performance. Besides the global metric-based algorithm, another algorithm is also proposed to learn individual metrics for each cluster, which permits overlapping membership for the clustering. This is innovatively different from many existing methods. This method is superiorly applicable to functional data with various similarity metrics between functional curves, while also exhibiting robustness to imbalanced sizes of groups, which are intrinsic to the dominant-set based clustering approach. In all three problems, the advantages of the proposed methods are demonstrated through extensive empirical investigations using simulations as well as real data applications.Item The Generalized Gamma Distribution as a Useful RND under Heston’s Stochastic Volatility Model(MDPI, 2022) Boukai, Benzion; Mathematical Sciences, School of ScienceWe present the Generalized Gamma (GG) distribution as a possible risk neutral distribution (RND) for modeling European options prices under Heston’s stochastic volatility (SV) model. We demonstrate that under a particular reparametrization, this distribution, which is a member of the scale-parameter family of distributions with the mean being the forward spot price, satisfies Heston’s solution and hence could be used for the direct risk-neutral valuation of the option price under Heston’s SV model. Indeed, this distribution is especially useful in situations in which the spot’s price follows a negatively skewed distribution for which Black–Scholes-based (i.e., the log-normal distribution) modeling is largely inapt. We illustrate the applicability of the GG distribution as an RND by modeling market option data on three large market-index exchange-traded funds (ETF), namely the SPY, IWM and QQQ as well as on the TLT (an ETF that tracks an index of long-term US Treasury bonds). As of the writing of this paper (August 2021), the option chain of each of the three market-index ETFs shows a pronounced skew of their volatility ‘smile’, which indicates a likely distortion in the Black–Scholes modeling of such option data. Reflective of entirely different market expectations, this distortion in the volatility ‘smile’ appears not to exist in the TLT option data. We provide a thorough modeling of the option data we have on each ETF (with the 15 October 2021 expiration) based on the GG distribution and compare it to the option pricing and RND modeling obtained directly from a well-calibrated Heston’s SV model (both theoretically and also empirically, using Monte Carlo simulations of the spot’s price). All three market-index ETFs exhibited negatively skewed distributions, which are well-matched with those derived under the GG distribution as RND. The inadequacy of the Black–Scholes modeling in such instances, which involves negatively skewed distribution, is further illustrated by its impact on the hedging factor, delta, and the immediate implications to the retail trader. Similarly, the closely related Inverse Generalized Gamma distribution (IGG) is also proposed as a possible RND for Heston’s SV model in situations involving positively skewed distribution. In all, utilizing the Generalized Gamma distributions as possible RNDs for direct option valuations under the Heston’s SV is seen as particularly useful to the retail traders who do not have the numerical tools or the know-how to fine-calibrate this SV model.Item IUPUI Center for Mathematical Biosciences(Office of the Vice Chancellor for Research, 2010-04-09) Boukai, Benzion; Chin, Ray; Dziubek, Andrea; Fokin, Vladimir; Ghosh, Samiran; Kuznetsov, Alexey; Li, Fang; Li, Jiliang; Rader, Andrew; Rubchinsky, Leonid; Sarkar, Jyotirmoy; Guidoboni, Giovanna; Worth, Robert; Zhu, LuodingAt-Large Mission: “to serve as an umbrella center for spearheading research and programmatic activities in the general bio-mathematics area” • promote and facilitate faculty excellence in mathematical and Computational research in the biosciences; • provide a mechanism and an environment that fosters collaborative research activities across the mathematical sciences and the life and health sciences schools at IUPUI— specifically with the IUSOM; • provide foundations and resources for further strategic development in targeted areas of mathematical and computational biosciences research; and • create greater opportunities and increase competitiveness in seeking and procuring extramural funding.Item Massive data K-means clustering and bootstrapping via A-optimal Subsampling(2019-08) Zhou, Dali; Tan, Fei; Peng, Hanxiang; Boukai, Benzion; Sarkar, Jyotirmoy; Li, PeijunFor massive data analysis, the computational bottlenecks exist in two ways. Firstly, the data could be too large that it is not easy to store and read. Secondly, the computation time could be too long. To tackle these problems, parallel computing algorithms like Divide-and-Conquer were proposed, while one of its drawbacks is that some correlations may be lost when the data is divided into chunks. Subsampling is another way to simultaneously solve the problems of the massive data analysis while taking correlation into consideration. The uniform sampling is simple and fast, but it is inefficient, see detailed discussions in Mahoney (2011) and Peng and Tan (2018). The bootstrap approach uses uniform sampling and is computing time in- tensive, which will be enormously challenged when data size is massive. k-means clustering is standard method in data analysis. This method does iterations to find centroids, which would encounter difficulty when data size is massive. In this thesis, we propose the approach of optimal subsampling for massive data bootstrapping and massive data k-means clustering. We seek the sampling distribution which minimize the trace of the variance co-variance matrix of the resulting subsampling estimators. This is referred to as A-optimal in the literature. We define the optimal sampling distribution by minimizing the sum of the component variances of the subsampling estimators. We show the subsampling k-means centroids consistently approximates the full data centroids, and prove the asymptotic normality using the empirical pro- cess theory. We perform extensive simulation to evaluate the numerical performance of the proposed optimal subsampling approach through the empirical MSE and the running times. We also applied the subsampling approach to real data. For massive data bootstrap, we conducted a large simulation study in the framework of the linear regression based on the A-optimal theory proposed by Peng and Tan (2018). We focus on the performance of confidence intervals computed from A-optimal sub- sampling, including coverage probabilities, interval lengths and running times. In both bootstrap and clustering we compared the A-optimal subsampling with uniform subsampling.Item Optimal Policies in Reliability Modelling of Systems Subject to Sporadic Shocks and Continuous Healing(2022-12) Chatterjee, Debolina; Sarkar, Jyotirmoy; Boukai, Benzion; Li, Fang; Wang, HonglangRecent years have seen a growth in research on system reliability and maintenance. Various studies in the scientific fields of reliability engineering, quality and productivity analyses, risk assessment, software reliability, and probabilistic machine learning are being undertaken in the present era. The dependency of human life on technology has made it more important to maintain such systems and maximize their potential. In this dissertation, some methodologies are presented that maximize certain measures of system reliability, explain the underlying stochastic behavior of certain systems, and prevent the risk of system failure. An overview of the dissertation is provided in Chapter 1, where we briefly discuss some useful definitions and concepts in probability theory and stochastic processes and present some mathematical results required in later chapters. Thereafter, we present the motivation and outline of each subsequent chapter. In Chapter 2, we compute the limiting average availability of a one-unit repairable system subject to repair facilities and spare units. Formulas for finding the limiting average availability of a repairable system exist only for some special cases: (1) either the lifetime or the repair-time is exponential; or (2) there is one spare unit and one repair facility. In contrast, we consider a more general setting involving several spare units and several repair facilities; and we allow arbitrary life- and repair-time distributions. Under periodic monitoring, which essentially discretizes the time variable, we compute the limiting average availability. The discretization approach closely approximates the existing results in the special cases; and demonstrates as anticipated that the limiting average availability increases with additional spare unit and/or repair facility. In Chapter 3, the system experiences two types of sporadic impact: valid shocks that cause damage instantaneously and positive interventions that induce partial healing. Whereas each shock inflicts a fixed magnitude of damage, the accumulated effect of k positive interventions nullifies the damaging effect of one shock. The system is said to be in Stage 1, when it can possibly heal, until the net count of impacts (valid shocks registered minus valid shocks nullified) reaches a threshold $m_1$. The system then enters Stage 2, where no further healing is possible. The system fails when the net count of valid shocks reaches another threshold $m_2 (> m_1)$. The inter-arrival times between successive valid shocks and those between successive positive interventions are independent and follow arbitrary distributions. Thus, we remove the restrictive assumption of an exponential distribution, often found in the literature. We find the distributions of the sojourn time in Stage 1 and the failure time of the system. Finally, we find the optimal values of the choice variables that minimize the expected maintenance cost per unit time for three different maintenance policies. In Chapter 4, the above defined Stage 1 is further subdivided into two parts: In the early part, called Stage 1A, healing happens faster than in the later stage, called Stage 1B. The system stays in Stage 1A until the net count of impacts reaches a predetermined threshold $m_A$; then the system enters Stage 1B and stays there until the net count reaches another predetermined threshold $m_1 (>m_A)$. Subsequently, the system enters Stage 2 where it can no longer heal. The system fails when the net count of valid shocks reaches another predetermined higher threshold $m_2 (> m_1)$. All other assumptions are the same as those in Chapter 3. We calculate the percentage improvement in the lifetime of the system due to the subdivision of Stage 1. Finally, we make optimal choices to minimize the expected maintenance cost per unit time for two maintenance policies. Next, we eliminate the restrictive assumption that all valid shocks and all positive interventions have equal magnitude, and the boundary threshold is a preset constant value. In Chapter 5, we study a system that experiences damaging external shocks of random magnitude at stochastic intervals, continuous degradation, and self-healing. The system fails if cumulative damage exceeds a time-dependent threshold. We develop a preventive maintenance policy to replace the system such that its lifetime is utilized prudently. Further, we consider three variations on the healing pattern: (1) shocks heal for a fixed finite duration $\tau$; (2) a fixed proportion of shocks are non-healable (that is, $\tau=0$); (3) there are two types of shocks---self healable shocks heal for a finite duration, and non-healable shocks. We implement a proposed preventive maintenance policy and compare the optimal replacement times in these new cases with those in the original case, where all shocks heal indefinitely. Finally, in Chapter 6, we present a summary of the dissertation with conclusions and future research potential.Item A Statistical Testing Procedure for Validating Class Labels(Arxiv, 2020) Key, Melissa C.; Boukai, Benzion; Biostatistics, School of Public HealthMotivated by an open problem of validating protein identities in label-free shotgun proteomics work-flows, we present a testing procedure to validate class/protein labels using available measurements across instances/peptides. More generally, we present a solution to the problem of identifying instances that are deemed, based on some distance (or quasi-distance) measure, as outliers relative to the subset of instances assigned to the same class. The proposed procedure is non-parametric and requires no specific distributional assumption on the measured distances. The only assumption underlying the testing procedure is that measured distances between instances within the same class are stochastically smaller than measured distances between instances from different classes. The test is shown to simultaneously control the Type I and Type II error probabilities whilst also controlling the overall error probability of the repeated testing invoked in the validation procedure of initial class labeling. The theoretical results are supplemented with results from an extensive numerical study, simulating a typical setup for labeling validation in proteomics work-flow applications. These results illustrate the applicability and viability of our method. Even with up to 25% of instances mislabeled, our testing procedure maintains a high specificity and greatly reduces the proportion of mislabeled instances.Item Study designs and statistical methods for pharmacogenomics and drug interaction studies(2016-04-01) Zhang, Pengyue; Li, Lang; Boukai, Benzion; Shen, Changyu; Zeng, Donglin; Liu, YunlongAdverse drug events (ADEs) are injuries resulting from drug-related medical interventions. ADEs can be either induced by a single drug or a drug-drug interaction (DDI). In order to prevent unnecessary ADEs, many regulatory agencies in public health maintain pharmacovigilance databases for detecting novel drug-ADE associations. However, pharmacovigilance databases usually contain a significant portion of false associations due to their nature structure (i.e. false drug-ADE associations caused by co-medications). Besides pharmacovigilance studies, the risks of ADEs can be minimized by understating their mechanisms, which include abnormal pharmacokinetics/pharmacodynamics due to genetic factors and synergistic effects between drugs. During the past decade, pharmacogenomics studies have successfully identified several predictive markers to reduce ADE risks. While, pharmacogenomics studies are usually limited by the sample size and budget. In this dissertation, we develop statistical methods for pharmacovigilance and pharmacogenomics studies. Firstly, we propose an empirical Bayes mixture model to identify significant drug-ADE associations. The proposed approach can be used for both signal generation and ranking. Following this approach, the portion of false associations from the detected signals can be well controlled. Secondly, we propose a mixture dose response model to investigate the functional relationship between increased dimensionality of drug combinations and the ADE risks. Moreover, this approach can be used to identify high-dimensional drug combinations that are associated with escalated ADE risks at a significantly low local false discovery rates. Finally, we proposed a cost-efficient design for pharmacogenomics studies. In order to pursue a further cost-efficiency, the proposed design involves both DNA pooling and two-stage design approach. Compared to traditional design, the cost under the proposed design will be reduced dramatically with an acceptable compromise on statistical power. The proposed methods are examined by extensive simulation studies. Furthermore, the proposed methods to analyze pharmacovigilance databases are applied to the FDA’s Adverse Reporting System database and a local electronic medical record (EMR) database. For different scenarios of pharmacogenomics study, optimized designs to detect a functioning rare allele are given as well.