- Browse by Subject
Browsing by Subject "Big Data"
Now showing 1 - 10 of 13
Results Per Page
Sort Options
Item A-Optimal Subsampling For Big Data General Estimating Equations(2019-08) Cheung, Chung Ching; Peng, Hanxiang; Rubchinsky, Leonid; Boukai, Benzion; Lin, Guang; Al Hasan, MohammadA significant hurdle for analyzing big data is the lack of effective technology and statistical inference methods. A popular approach for analyzing data with large sample is subsampling. Many subsampling probabilities have been introduced in literature (Ma, \emph{et al.}, 2015) for linear model. In this dissertation, we focus on generalized estimating equations (GEE) with big data and derive the asymptotic normality for the estimator without resampling and estimator with resampling. We also give the asymptotic representation of the bias of estimator without resampling and estimator with resampling. we show that bias becomes significant when the data is of high-dimensional. We also present a novel subsampling method called A-optimal which is derived by minimizing the trace of some dispersion matrices (Peng and Tan, 2018). We derive the asymptotic normality of the estimator based on A-optimal subsampling methods. We conduct extensive simulations on large sample data with high dimension to evaluate the performance of our proposed methods using MSE as a criterion. High dimensional data are further investigated and we show through simulations that minimizing the asymptotic variance does not imply minimizing the MSE as bias not negligible. We apply our proposed subsampling method to analyze a real data set, gas sensor data which has more than four millions data points. In both simulations and real data analysis, our A-optimal method outperform the traditional uniform subsampling method.Item Big Data Edge on Consumer Devices for Precision Medicine(IEEE, 2022) Stauffer, Jake; Zhang, Qingxue; Biomedical Engineering and Informatics, Luddy School of Informatics, Computing, and EngineeringConsumer electronics like smartphones and wearable computers are furthering precision medicine significantly, through capturing/leveraging big data on the edge towards real-time, interactive healthcare applications. Here we propose a big data edge platform that can, not only capture/manage different biomedical dynamics, but also enable real-time visualization of big data. The big data can also be uploaded to cloud for long-term management. The system has been evaluated on the real-world biomechanical data-based application, and demonstrated its effectiveness on big data management and interactive visualization. This study is expected to greatly advance big data-driven precision medicine applications.Item Correlations of Online Search Engine Trends with Coronavirus Disease (COVID-19) Incidence: Infodemiology Study(JMIR Publications, 2020-05-21) Higgins, Thomas S.; Wu, Arthur W.; Sharma, Dhruv; Illing, Elisa A.; Rubel, Kolin E.; Ting, Jonathan Y.; Otolaryngology -- Head and Neck Surgery, School of MedicineBackground: The coronavirus disease (COVID-19) is the latest pandemic of the digital age. With the internet harvesting large amounts of data from the general population in real time, public databases such as Google Trends (GT) and the Baidu Index (BI) can be an expedient tool to assist public health efforts. Objective: The aim of this study is to apply digital epidemiology to the current COVID-19 pandemic to determine the utility of providing adjunctive epidemiologic information on outbreaks of this disease and evaluate this methodology in the case of future pandemics. Methods: An epidemiologic time series analysis of online search trends relating to the COVID-19 pandemic was performed from January 9, 2020, to April 6, 2020. BI was used to obtain online search data for China, while GT was used for worldwide data, the countries of Italy and Spain, and the US states of New York and Washington. These data were compared to real-world confirmed cases and deaths of COVID-19. Chronologic patterns were assessed in relation to disease patterns, significant events, and media reports. Results: Worldwide search terms for shortness of breath, anosmia, dysgeusia and ageusia, headache, chest pain, and sneezing had strong correlations (r>0.60, P<.001) to both new daily confirmed cases and deaths from COVID-19. GT COVID-19 (search term) and GT coronavirus (virus) searches predated real-world confirmed cases by 12 days (r=0.85, SD 0.10 and r=0.76, SD 0.09, respectively, P<.001). Searches for symptoms of diarrhea, fever, shortness of breath, cough, nasal obstruction, and rhinorrhea all had a negative lag greater than 1 week compared to new daily cases, while searches for anosmia and dysgeusia peaked worldwide and in China with positive lags of 5 days and 6 weeks, respectively, corresponding with widespread media coverage of these symptoms in COVID-19. Conclusions: This study demonstrates the utility of digital epidemiology in providing helpful surveillance data of disease outbreaks like COVID-19. Although certain online search trends for this disease were influenced by media coverage, many search terms reflected clinical manifestations of the disease and showed strong correlations with real-world cases and deaths.Item Developing Automated Computer Algorithms to Track Periodontal Disease Change from Longitudinal Electronic Dental Records(MDPI, 2023-03-08) Patel, Jay S.; Kumar, Krishna; Zai, Ahad; Shin, Daniel; Willis, Lisa; Thyvalikakath, Thankam P.Objective: To develop two automated computer algorithms to extract information from clinical notes, and to generate three cohorts of patients (disease improvement, disease progression, and no disease change) to track periodontal disease (PD) change over time using longitudinal electronic dental records (EDR). Methods: We conducted a retrospective study of 28,908 patients who received a comprehensive oral evaluation between 1 January 2009, and 31 December 2014, at Indiana University School of Dentistry (IUSD) clinics. We utilized various Python libraries, such as Pandas, TensorFlow, and PyTorch, and a natural language tool kit to develop and test computer algorithms. We tested the performance through a manual review process by generating a confusion matrix. We calculated precision, recall, sensitivity, specificity, and accuracy to evaluate the performances of the algorithms. Finally, we evaluated the density of longitudinal EDR data for the following follow-up times: (1) None; (2) Up to 5 years; (3) > 5 and ≤ 10 years; and (4) >10 and ≤ 15 years. Results: Thirty-four percent (n = 9954) of the study cohort had up to five years of follow-up visits, with an average of 2.78 visits with periodontal charting information. For clinician-documented diagnoses from clinical notes, 42% of patients (n = 5562) had at least two PD diagnoses to determine their disease change. In this cohort, with clinician-documented diagnoses, 72% percent of patients (n = 3919) did not have a disease status change between their first and last visits, 669 (13%) patients’ disease status progressed, and 589 (11%) patients’ disease improved. Conclusions: This study demonstrated the feasibility of utilizing longitudinal EDR data to track disease changes over 15 years during the observation study period. We provided detailed steps and computer algorithms to clean and preprocess the EDR data and generated three cohorts of patients. This information can now be utilized for studying clinical courses using artificial intelligence and machine learning methods.Item Developing Bottom-Up, Integrated Omics Methodologies for Big Data Biomarker Discovery(2020-11) Kechavarzi, Bobak David; Wu, Huanmei; Doman, Thompson; Dow, Ernst; Liu, Yunlong; Liu, Xiaowen; Yan, JingwenThe availability of highly-distributed computing compliments the proliferation of next generation sequencing (NGS) and genome-wide association studies (GWAS) datasets. These data sets are often complex, poorly annotated or require complex domain knowledge to sensibly manage. These novel datasets provide a rare, multi-dimensional omics (proteomics, transcriptomics, and genomics) view of a single sample or patient. Previously, biologists assumed a strict adherence to the central dogma: replication, transcription and translation. Recent studies in genomics and proteomics emphasize that this is not the case. We must employ big-data methodologies to not only understand the biogenesis of these molecules, but also their disruption in disease states. The Cancer Genome Atlas (TCGA) provides high-dimensional patient data and illustrates the trends that occur in expression profiles and their alteration in many complex disease states. I will ultimately create a bottom-up multi-omics approach to observe biological systems using big data techniques. I hypothesize that big data and systems biology approaches can be applied to public datasets to identify important subsets of genes in cancer phenotypes. By exploring these signatures, we can better understand the role of amplification and transcript alterations in cancer.Item Distributed graph decomposition algorithms on Apache Spark(2018-04-20) Mandal, Aritra; Hasan, Mohammad Al; Mohler, George; Song, FengguangStructural analysis and mining of large and complex graphs for describing the characteristics of a vertex or an edge in the graph have widespread use in graph clustering, classification, and modeling. There are various methods for structural analysis of graphs including the discovery of frequent subgraphs or network motifs, counting triangles or graphlets, spectral analysis of networks using eigenvectors of graph Laplacian, and finding highly connected subgraphs such as cliques and quasi cliques. Unfortunately, the algorithms for solving most of the above tasks are quite costly, which makes them not-scalable to large real-life networks. Two such very popular decompositions, k-core and k-truss of a graph give very useful insight about the graph vertex and edges respectively. These decompositions have been applied to solve protein functions reasoning on protein-protein networks, fraud detection and missing link prediction problems. k-core decomposition with is linear time complexity is scalable to large real-life networks as long as the input graph fits in the main memory. k-truss on the other hands is computationally more intensive due to its definition relying on triangles and their is no linear time algorithm available for it. In this paper, we propose distributed algorithms on Apache Spark for k-truss and k-core decomposition of a graph. We also compare the performance of our algorithm with state-of-the-art Map-Reduce and parallel algorithms using openly available real world network data. Our proposed algorithms have shown substantial performance improvement.Item Learning Analytics and the Academic Library: Professional Ethics Commitments at a Crossroads(ACRL, 2018) Jones, Kyle M. L.; Library and Information Science, School of Informatics and ComputingIn this paper, the authors address learning analytics and the ways academic libraries are beginning to participate in wider institutional learning analytics initiatives. Since there are moral issues associated with learning analytics, the authors consider how data mining practices run counter to ethical principles in the American Library Association’s “Code of Ethics.” Specifically, the authors address how learning analytics implicates professional commitments to promote intellectual freedom; protect patron privacy and confidentiality; and balance intellectual property interests between library users, their institution, and content creators and vendors. The authors recommend that librarians should embed their ethical positions in technological designs, practices, and governance mechanisms.Item Nursing in the spotlight: Talk about nurses and the nursing profession on Twitter during the early COVID-19 pandemic(Elsevier, 2022) Miller, Wendy R.; Malloy, Caeli; Mravec, Michelle; Sposato, Margaret F.; Groves, Doyle; School of NursingBackground: Nurses comprise the largest portion of healthcare workers and are integral to the COVID-19 response. Twitter has become a popular platform for the public, including nurses, to engage in pandemic-related discourse. Purpose: We sought to analyze the representation of the nursing profession and characterize nurses’ experiences during the pandemic from tweets published in April 2020. Methods: We analyzed tweets using natural language processing, Word Adjacency Graph (WAG) Modeling, and thematic analysis. Authors independently reviewed 10% of raw tweets in each WAG-generated topic, qualitatively analyzed tweets, and identified emerging themes. Findings: Six themes emerged: Support and Recognition of Nurses, Military Metaphors, Superhuman/Spiritual Metaphors, Advocacy, Personal Experiences with Nurses, and Social/Political Commentary. Public perception of nurses was positive, but nurses conveyed harsh realities of their work. Discussion: Findings highlight discrepancies in nursing experiences and public perceptions of nursing. Further research should accurately identify and convey the complexities of the nursing profession.Item Sample Size Determination for Subsampling in the Analysis of Big Data, Multiplicative Models for Confidence Intervals and Free-Knot Changepoint Models(2024-05) Zhang, Sheng; Peng, Hanxiang; Tan, Fei; Sarkar, Jyoti; Boukai, BenThe dissertation consists of three parts. Motivated by subsampling in the analysis of Big Data and by data-splitting in machine learning, sample size determination for multidimensional parameters is presented in the first part. In the second part, we propose a novel approach to the construction of confidence intervals based on improved concentration inequalities. We provide the missing factor for the tail probability of a random variable which generalizes Talagrand’s (1995) result of the missing factor in Hoeffding’s inequalities. We give the procedure for constructing confidence intervals and illustrate it with simulations. In the third part, we study irregular change-point models using free-knot splines. The consistency and asymptotic normality of the least squares estimators are proved for the irregular models in which the linear spline is not differentiable. Simulations are carried out to explore the numerical properties of the proposed models. The results are used to analyze the US Covid-19 data.Item A Smart and Interactive Edge-Cloud Big Data System(2021-08) Stauffer, Jake; Zhang, Qingxue; King, Brian; Fang, ShiaofenData and information have increased exponentially in recent years. The promising era of big data is advancing many new practices. One of the emerging big data applications is healthcare. Large quantities of data with varying complexities have been leading to a great need in smart and secure big data systems. Mobile edge, more specifically the smart phone, is a natural source of big data and is ubiquitous in our daily lives. Smartphones offer a variety of sensors, which make them a very valuable source of data that can be used for analysis. Since this data is coming directly from personal phones, that means the generated data is sensitive and must be handled in a smart and secure way. In addition to generating data, it is also important to interact with the big data. Therefore, it is critical to create edge systems that enable users to access their data and ensure that these applications are smart and secure. As the first major contribution of this thesis, we have implemented a mobile edge system, called s2Edge. This edge system leverages Amazon Web Service (AWS) security features and is backed by an AWS cloud system. The implemented mobile application securely logs in, signs up, and signs out users, as well as connects users to the vast amounts of data they generate. With a high interactive capability, the system allows users (like patients) to retrieve and view their data and records, as well as communicate with the cloud users (like physicians). The resulting mobile edge system is promising and is expected to demonstrate the potential of smart and secure big data interaction. The smart and secure transmission and management of the big data on the cloud is essential for healthcare big data, including both patient information and patient measurements. The second major contribution of this thesis is to demonstrate a novel big data cloud system, s2Cloud, which can help enhance healthcare systems to better monitor patients and give doctors critical insights into their patients' health. s2Cloud achieves big data security through secure sign up and log in for the doctors, as well as data transmission protection. The system allows the doctors to manage both patients and their records effectively. The doctors can add and edit the patient and record information through the interactive website. Furthermore, the system supports both real-time and historical modes for big data management. Therefore, the patient measurement information can, not only be visualized and demonstrated in real-time, but also be retrieved for further analysis. The smart website also allows doctors and patients to interact with each other effectively through instantaneous chat. Overall, the proposed s2Cloud system, empowered by smart secure design innovations, has demonstrated the feasibility and potential for healthcare big data applications. This study will further broadly benefit and advance other smart home and world big data applications.