- Browse by Author
Browsing by Author "Dundar, Murat"
Now showing 1 - 10 of 28
Results Per Page
Sort Options
Item Applications of Data Mining in Healthcare(2019-05) Peng, Bo; Mohler, George; Dundar, Murat; Zheng, Jiang YuWith increases in the quantity and quality of healthcare related data, data mining tools have the potential to improve people’s standard of living through personalized and predictive medicine. In this thesis we improve the state-of-the-art in data mining for several problems in the healthcare domain. In problems such as drug-drug interaction prediction and Alzheimer’s Disease (AD) biomarkers discovery and prioritization, current methods either require tedious feature engineering or have unsatisfactory performance. New effective computational tools are needed that can tackle these complex problems. In this dissertation, we develop new algorithms for two healthcare problems: high-order drug-drug interaction prediction and amyloid imaging biomarker prioritization in Alzheimer’s Disease. Drug-drug interactions (DDIs) and their associated adverse drug reactions (ADRs) represent a significant detriment to the public h ealth. Existing research on DDIs primarily focuses on pairwise DDI detection and prediction. Effective computational methods for high-order DDI prediction are desired. In this dissertation, I present a deep learning based model D 3 I for cardinality-invariant and order-invariant high-order DDI pre- diction. The proposed models achieve 0.740 F1 value and 0.847 AUC value on high-order DDI prediction, and outperform classical methods on order-2 DDI prediction. These results demonstrate the strong potential of D 3 I and deep learning based models in tackling the prediction problems of high-order DDIs and their induced ADRs. The second problem I consider in this thesis is amyloid imaging biomarkers discovery, for which I propose an innovative machine learning paradigm enabling precision medicine in this domain. The paradigm tailors the imaging biomarker discovery process to individual characteristics of a given patient. I implement this paradigm using a newly developed learning-to-rank method PLTR. The PLTR model seamlessly integrates two objectives for joint optimization: pushing up relevant biomarkers and ranking among relevant biomarkers. The empirical study of PLTR conducted on the ADNI data yields promising results to identify and prioritize individual-specific amyloid imaging biomarkers based on the individual’s structural MRI data. The resulting top ranked imaging biomarkers have the potential to aid personalized diagnosis and disease subtyping.Item Automated Assessment of Disease Progression in Acute Myeloid Leukemia by Probabilistic Analysis of Flow Cytometry Data(Institute of Electrical and Electronics Engineers, 2017-05) Rajwa, Bartek; Wallace, Paul K.; Griffiths, Elizabeth A.; Dundar, Murat; Computer and Information Science, School of ScienceOBJECTIVE: Flow cytometry (FC) is a widely acknowledged technology in diagnosis of acute myeloid leukemia (AML) and has been indispensable in determining progression of the disease. Although FC plays a key role as a posttherapy prognosticator and evaluator of therapeutic efficacy, the manual analysis of cytometry data is a barrier to optimization of reproducibility and objectivity. This study investigates the utility of our recently introduced nonparametric Bayesian framework in accurately predicting the direction of change in disease progression in AML patients using FC data. METHODS: The highly flexible nonparametric Bayesian model based on the infinite mixture of infinite Gaussian mixtures is used for jointly modeling data from multiple FC samples to automatically identify functionally distinct cell populations and their local realizations. Phenotype vectors are obtained by characterizing each sample by the proportions of recovered cell populations, which are, in turn, used to predict the direction of change in disease progression for each patient. RESULTS: We used 200 diseased and nondiseased immunophenotypic panels for training and tested the system with 36 additional AML cases collected at multiple time points. The proposed framework identified the change in direction of disease progression with accuracies of 90% (nine out of ten) for relapsing cases and 100% (26 out of 26) for the remaining cases. CONCLUSIONS: We believe that these promising results are an important first step toward the development of automated predictive systems for disease monitoring and continuous response evaluation. SIGNIFICANCE: Automated measurement and monitoring of therapeutic response is critical not only for objective evaluation of disease status prognosis but also for timely assessment of treatment strategies.Item Batch Discovery of Recurring Rare Classes toward Identifying Anomalous Samples(ACM, 2014) Dundar, Murat; Yerebakan, Halid Ziya; Rajwa, Bartek; Computer and Information Science, School of ScienceWe present a clustering algorithm for discovering rare yet significant recurring classes across a batch of samples in the presence of random effects. We model each sample data by an infinite mixture of Dirichlet-process Gaussian-mixture models (DPMs) with each DPM representing the noisy realization of its corresponding class distribution in a given sample. We introduce dependencies across multiple samples by placing a global Dirichlet process prior over individual DPMs. This hierarchical prior introduces a sharing mechanism across samples and allows for identifying local realizations of classes across samples. We use collapsed Gibbs sampler for inference to recover local DPMs and identify their class associations. We demonstrate the utility of the proposed algorithm, processing a flow cytometry data set containing two extremely rare cell populations, and report results that significantly outperform competing techniques. The source code of the proposed algorithm is available on the web via the link: http://cs.iupui.edu/~dundar/aspire.htm.Item Bayesian Non-Exhaustive Classification A Case Study: Online Name Disambiguation using Temporal Record Streams(ACM, 2016-10) Zhang, Baichuan; Dundar, Murat; Al Hasan, Mohammad; Department of Computer and Information Science, School of ScienceThe name entity disambiguation task aims to partition the records of multiple real-life persons so that each partition contains records pertaining to a unique person. Most of the existing solutions for this task operate in a batch mode, where all records to be disambiguated are initially available to the algorithm. However, more realistic settings require that the name disambiguation task be performed in an online fashion, in addition to, being able to identify records of new ambiguous entities having no preexisting records. In this work, we propose a Bayesian non-exhaustive classification framework for solving online name disambiguation task. Our proposed method uses a Dirichlet process prior with a Normal x Normal x Inverse Wishart data model which enables identification of new ambiguous entities who have no records in the training data. For online classification, we use one sweep Gibbs sampler which is very efficient and effective. As a case study we consider bibliographic data in a temporal stream format and disambiguate authors by partitioning their papers into homogeneous groups. Our experimental results demonstrate that the proposed method is better than existing methods for performing online name disambiguation task.Item Bayesian Zero-Shot Learning(Springer, 2020) Badirli, Sarkhan; Akata, Zeynep; Dundar, Murat; Computer and Information Science, School of ScienceObject classes that surround us have a natural tendency to emerge at varying levels of abstraction. We propose a Bayesian approach to zero-shot learning (ZSL) that introduces the notion of meta-classes and implements a Bayesian hierarchy around these classes to effectively blend data likelihood with local and global priors. Local priors driven by data from seen classes, i.e., classes available at training time, become instrumental in recovering unseen classes, i.e., classes that are missing at training time, in a generalized ZSL (GZSL) setting. Hyperparameters of the Bayesian model offer a convenient way to optimize the trade-off between seen and unseen class accuracy. We conduct experiments on seven benchmark datasets, including a large scale ImageNet and show that our model produces promising results in the challenging GZSL setting.Item Benchmarking authorship attribution techniques using over a thousand books by fifty Victorian era novelists(2018-04-03) Gungor, Abdulmecit; Dundar, MuratAuthorship attribution (AA) is the process of identifying the author of a given text and from the machine learning perspective, it can be seen as a classification problem. In the literature, there are a lot of classification methods for which feature extraction techniques are conducted. In this thesis, we explore information retrieval techniques such as Word2Vec, paragraph2vec, and other useful feature selection and extraction techniques for a given text with different classifiers. We have performed experiments on novels that are extracted from GDELT database by using different features such as bag of words, n-grams or newly developed techniques like Word2Vec. To improve our success rate, we have combined some useful features some of which are diversity measure of text, bag of words, bigrams, specific words that are written differently between English and American authors. Support vector machine classifiers with nu-SVC type is observed to give best success rates on the stacked useful feature set. The main purpose of this work is to lay the foundations of feature extraction techniques in AA. These are lexical, character-level, syntactic, semantic, application specific features. We also have aimed to offer a new data resource for the author attribution research community and demonstrate how it can be used to extract features as in any kind of AA problem. The dataset we have introduced consists of works of Victorian era authors and the main feature extraction techniques are shown with exemplary code snippets for audiences in different knowledge domains. Feature extraction approaches and implementation with different classifiers are employed in simple ways such that it would also serve as a beginner step to AA. Some feature extraction techniques introduced in this work are also meant to be employed in different NLP tasks such as sentiment analysis with Word2Vec or text summarization. Using the introduced NLP tasks and feature extraction techniques one can start to implement them on our dataset. We have also introduced several methods to implement extracted features in different methodologies such as feature stack engineering with different classifiers, or using Word2Vec to create sentence level vectors.Item A Case Study for Massive Text Mining: K Nearest Neighbor Algorithm on PubMed data(Office of the Vice Chancellor for Research, 2015-04-17) Do, Nhan; Dundar, MuratUS National Library of Medicine (NLM) has a huge collections of millions of books, journals, and other publications relating to medical domain. NLM creates the database called MEDLINE to store and link the citations to the publications. This database allows the researchers and students to access and find medical articles easily. The public can search on MEDLINE using a database called PubMed. When the new PubMed documents become available online, the curators have to manually decide the labels for them. The process is tedious and time-consuming because there are more than 27,149 descriptor (MeSH terms). Although the curators are already using a system called MTI for MeSH terms suggestion, the performance needs to be improved. This research explores the usage of text classification to annotate new PubMed document automatically, efficiently, and with reasonable accuracy. The data is gathered from BioASQ Contest, which contains 4 millions of abstracts. The research process includes preprocess the data, reduce the feature space, classify and evaluate the result. We focus on the K nearest neighbor algorithm in this case study.Item Classifying the Unknown: Identification of Insects by Deep Open-set Bayesian Learning(bioRxiv, 2021-09-17) Badirli, Sarkhan; Picard, Christine J.; Mohler, George; Akata, Zeynep; Dundar, MuratInsects represent a large majority of biodiversity on Earth, yet only 20% of the estimated 5.5 million insect species are currently described (1). While describing new species typically requires specific taxonomic expertise to identify morphological characters that distinguish it from other potential species, DNA-based methods have aided in providing additional evidence of separate species (2). Machine learning (ML) is emerging as a potential new approach in identifying new species, given that this analysis may be more sensitive to subtle differences humans may not process. Existing ML algorithms are limited by image repositories that do not include undescribed species. We developed a Bayesian deep learning method for the open-set classification of species. The proposed approach forms a Bayesian hierarchy of species around corresponding genera and uses deep embeddings of images and barcodes together to identify insects at the lowest level of abstraction possible. To demonstrate proof of concept, we used a database of 32,848 insect instances from 1,040 described species split into training and test data. The test data included 243 species not present in the training data. Our results demonstrate that using DNA sequences and images together, insect instances of described species can be classified with 96.66% accuracy while achieving accuracy of 81.39% in identifying genera of insect instances of undescribed species. The proposed deep open-set Bayesian model demonstrates a powerful new approach that can be used for the gargantuan task of identifying new insect species.Item Classifying the unknown: Insect identification with deep hierarchical Bayesian learning(Wiley, 2023) Badirli, Sarkhan; Picard, Christine Johanna; Mohler, George; Richert, Frannie; Akata, Zeynep; Dundar, Murat1. Classifying insect species involves a tedious process of identifying distinctive morphological insect characters by taxonomic experts. Machine learning can harness the power of computers to potentially create an accurate and efficient method for performing this task at scale, given that its analytical processing can be more sensitive to subtle physical differences in insects, which experts may not perceive. However, existing machine learning methods are designed to only classify insect samples into described species, thus failing to identify samples from undescribed species. 2. We propose a novel deep hierarchical Bayesian model for insect classification, given the taxonomic hierarchy inherent in insects. This model can classify samples of both described and undescribed species; described samples are assigned a species while undescribed samples are assigned a genus, which is a pivotal advancement over just identifying them as outliers. We demonstrated this proof of concept on a new database containing paired insect image and DNA barcode data from four insect orders, including 1040 species, which far exceeds the number of species used in existing work. A quarter of the species were excluded from the training set to simulate undescribed species. 3. With the proposed classification framework using combined image and DNA data in the model, species classification accuracy for described species was 96.66% and genus classification accuracy for undescribed species was 81.39%. Including both data sources in the model resulted in significant improvement over including image data only (39.11% accuracy for described species and 35.88% genus accuracy for undescribed species), and modest improvement over including DNA data only (73.39% genus accuracy for undescribed species). 4. Unlike current machine learning methods, the proposed deep hierarchical Bayesian learning approach can simultaneously classify samples of both described and undescribed species, a functionality that could become instrumental in biodiversity monitoring across the globe. This framework can be customized for any taxonomic classification problem for which image and DNA data can be obtained, thus making it relevant for use across all biological kingdoms.Item Clustering patient mobility patterns to assess effectiveness of health-service delivery(BMC, 2017-07-04) Delil, Selman; Çelik, Rahmi Nurhan; San, Sayın; Dundar, Murat; Computer and Information Science, School of ScienceBACKGROUND: Analysis of patient mobility in a country not only gives an idea of how the health-care system works, but also can be a guideline to determine the quality of health care and health disparity among regions. Even though determination of patient movement is important, it is not often realized that patient mobility could have a unique pattern beyond health-related endowments (e.g., facilities, medical staff). This study therefore addresses the following research question: Is there a way to identify regions with similar patterns using spatio-temporal distribution of patient mobility? The aim of the paper is to answer this question and improve a classification method that is useful for populous countries like Turkey that have many administrative areas. METHODS: The data used in the study consist of spatio-temporal information on patient mobility for the period between 2009 and 2013. Patient mobility patterns based on the number of patients attracted/escaping across 81 provinces of Turkey are illustrated graphically. The hierarchical clustering method is used to group provinces in terms of the mobility characteristics revealed by the patterns. Clustered groups of provinces are analyzed using non-parametric statistical tests to identify potential correlations between clustered groups and the selected basic health indicators. RESULTS: Ineffective health-care delivery in certain regions of Turkey was determined through identifying patient mobility patterns. High escape values obtained for a large number of provinces suggest poor health-care accessibility. On the other hand, over the period of time studied, visualization of temporal mobility revealed a considerable decrease in the escape ratio for inadequately equipped provinces. Among four of twelve clusters created using the hierarchical clustering method, which include 64 of 81 Turkish provinces, there was a statistically significant relationship between the patterns and the selected basic health indicators of the clusters. The remaining eight clusters included 17 provinces and showed anomalies. CONCLUSIONS: The most important contribution of this study is the development of a way to identify patient mobility patterns by analyzing patient movements across the clusters. These results are strong evidence that patient mobility patterns provide a useful tool for decisions concerning the distribution of health-care services and the provision of health care equipment to the provinces.
- «
- 1 (current)
- 2
- 3
- »