- Browse by Author
Browsing by Author "Shah, Setu"
Now showing 1 - 9 of 9
Results Per Page
Sort Options
Item Are Recent Terrorism Trends Reflected in Social Media?(IEEE, 2017-10) Terziyska, Ivana; Shah, Setu; Luo, Xiao; Engineering Technology, School of Engineering and TechnologySocial media plays an important role in shaping the beliefs and sentiments of an audience regarding an event. A comparison between public data sets that have holistic features and social media data set that include more user features would give insight into the spread of misinformation and aspects of events that are reflected in user behavior. In this research, we compare the trends identified in the public data set - Global Terrorism Database (GTD) with the trends reflected through the social media data obtained using the Twitter API. The unsupervised learning algorithm Self-Organizing Map (SOM) is used to identify the features and trends summarized by the clusters. The results show discrepancies in the features and related trends of terrorism events in the GTD data set and obtained Twitter data set to suggest some media bias and public perception on terrorism.Item Biomedical concept association and clustering using word embeddings(2018-12) Shah, Setu; Luo, Xiao; El-Sharkawy, Mohamed; King, BrianBiomedical data exists in the form of journal articles, research studies, electronic health records, care guidelines, etc. While text mining and natural language processing tools have been widely employed across various domains, these are just taking off in the healthcare space. A primary hurdle that makes it difficult to build artificial intelligence models that use biomedical data, is the limited amount of labelled data available. Since most models rely on supervised or semi-supervised methods, generating large amounts of pre-processed labelled data that can be used for training purposes becomes extremely costly. Even for datasets that are labelled, the lack of normalization of biomedical concepts further affects the quality of results produced and limits the application to a restricted dataset. This affects reproducibility of the results and techniques across datasets, making it difficult to deploy research solutions to improve healthcare services. The research presented in this thesis focuses on reducing the need to create labels for biomedical text mining by using unsupervised recurrent neural networks. The proposed method utilizes word embeddings to generate vector representations of biomedical concepts based on semantics and context. Experiments with unsupervised clustering of these biomedical concepts show that concepts that are similar to each other are clustered together. While this clustering captures different synonyms of the same concept, it also captures the similarities between various diseases and the symptoms that those diseases are symptomatic of. To test the performance of the concept vectors on corpora of documents, a document vector generation method that utilizes these concept vectors is also proposed. The document vectors thus generated are used as an input to clustering algorithms, and the results show that across multiple corpora, the proposed methods of concept and document vector generation outperform the baselines and provide more meaningful clustering. The applications of this document clustering are huge, especially in the search and retrieval space, providing clinicians, researchers and patients more holistic and comprehensive results than relying on the exclusive term that they search for. At the end, a framework for extracting clinical information that can be mapped to electronic health records from preventive care guidelines is presented. The extracted information can be integrated with the clinical decision support system of an electronic health record. A visualization tool to better understand and observe patient trajectories is also explored. Both these methods have potential to improve the preventive care services provided to patients.Item Comparison of Deep Learning based Concept Representations for Biomedical Document Clustering(IEEE, 2018) Shah, Setu; Luo, Xiao; Computer and Information Science, School of ScienceIn this research, document representations based on distributed representations of the concepts along with new weighting schemes for the documents are explored. The baseline weighting scheme is the traditional Term Frequency-Inverse Document Frequency (TF-IDF) of the concepts, whereas, the other two newly proposed ones consider both local content using the TF-IDF and associations between concepts. The distributed representations of the concepts are measured using a deep learning algorithm. The evaluation of the proposed document representations is based on the k-means clustering results. The results show that document representation based on TF-IDF in combination with the term based distributed representations for concepts outperforms the other two based on the returned evaluation metrics - F1-measure (80.21%) and Purity (77.1%).Item Concept embedding-based weighting scheme for biomedical text clustering and visualization(BioMed Central, 2018-11-01) Luo, Xiao; Shah, Setu; Computer Information and Graphics Technology, School of Engineering and TechnologyBiomedical text clustering is a text mining technique used to provide better document search, browsing, and retrieval in biomedical and clinical text collections. In this research, the document representation based on the concept embedding along with the proposed weighting scheme is explored. The concept embedding is learned through the neural networks to capture the associations between the concepts. The proposed weighting scheme makes use of the concept associations to build document vectors for clustering. We evaluate two types of concept embedding and new weighting scheme for text clustering and visualization on two different biomedical text collections. The returned results demonstrate that the concept embedding along with the new weighting scheme performs better than the baseline tf–idf for clustering and visualization. Based on the internal clustering evaluation metric-Davies–Bouldin index and the visualization, the concept embedding generated from aggregated word embedding can form well-separated clusters, whereas the intact concept embedding can better identify more clusters of specific diseases and gain better F-measure.Item Differential Learning for Outliers: A Case Study of Water Demand Prediction(MDPI, 2018-11) Shah, Setu; Ben Miled, Zina; Schaefer, Rebecca; Berube, Steve; Electrical and Computer Engineering, School of Engineering and TechnologyPredicting water demands is becoming increasingly critical because of the scarcity of this natural resource. In fact, the subject was the focus of numerous studies by a large number of researchers around the world. Several models have been proposed that are able to predict water demands using both statistical and machine learning techniques. These models have successfully identified features that can impact water demand trends for rural and metropolitan areas. However, while the above models, including recurrent network models proposed by the authors are able to predict normal water demands, most have difficulty estimating potential deviations from the norms. Outliers in water demand can be due to various reasons including high temperatures and voluntary or mandatory consumption restrictions by the water utility companies. Estimating these deviations is necessary, especially for water utility companies with a small service footprint, in order to efficiently plan water distribution. This paper proposes a differential learning model that can help model both over-consumption and under-consumption. The proposed differential model builds on a previously proposed recurrent neural network model that was successfully used to predict water demand in central Indiana.Item Exploring diseases based biomedical document clustering and visualization using self-organizing maps(IEEE, 2017-10) Shah, Setu; Luo, Xiao; Computer and Information Science, School of ScienceDocument clustering is a text mining technique used to provide better document search and browsing in digital libraries or online corpora. In this research, a vector representation of concepts of diseases and similarity measurement between concepts are proposed. They identify the closest concepts of diseases in the context of a corpus. Each document is represented by using the vector space model. A weight scheme is proposed to consider both local content and associations between concepts. Self-Organizing Maps (SOM) are often used as document clustering algorithm. The vector projection and visualization features of SOM enable visualization and analysis of the cluster distribution and relationships on the two dimensional space. The Davies-Bouldin index is used to validate the clusters based on the visualized cluster distributions. The results show that the proposed document clustering framework generates meaningful clusters and can facilitate clustering visualization and information retrieval based on the concepts of diseases.Item A natural language processing framework to analyse the opinions on HPV vaccination reflected in twitter over 10 years (2008 - 2017)(Taylor & Francis, 2019-07-16) Luo, Xiao; Zimet, Gregory; Shah, Setu; Computer Information and Graphics Technology, School of Engineering and TechnologyIn this research, we developed a natural language processing (NLP) framework to investigate the opinions on HPV vaccination reflected on Twitter over a 10-year period – 2008–2017. The NLP framework includes sentiment analysis, entity analysis, and artificial intelligence (AI)-based phrase association mining. The sentiment analysis demonstrates the sentiment fluctuation over the past 10 years. The results show that there are more negative tweets in 2008 to 2011 and 2015 to 2016. The entity extraction and analysis help to identify the organization, geographical location and events entities associated with the negative and positive tweets. The results show that the organization entities such as FDA, CDC and Merck occur in both negative and positive tweets of almost every year, whereas the geographical location entities mentioned in both negative and positive tweets change from year to year. The reason is because of the specific events that happened in those different locations. The objective of the AI-based phrase association mining is to identify the main topics reflected in both negative and positive tweets and detailed tweet content. Through the phrase association mining, we found that the main negative topics on Twitter include “injuries”, “deaths”, “scandal”, “safety concerns”, and “adverse/side effects”, whereas the main positive topics include “cervical cancers”, “cervical screens”, “prevents”, and “vaccination campaigns”. We believe the results of this research can help public health researchers better understand the nature of social media influence on HPV vaccination attitudes and to develop strategies to counter the proliferation of misinformation.Item Neural networks for mining the associations between diseases and symptoms in clinical notes(Springer, 2018-11-28) Shah, Setu; Luo, Xiao; Kanakasabai, Saravanan; Tuason, Ricardo; Klopper, Gregory; Engineering Technology, School of Engineering and TechnologyThere are challenges for analyzing the narrative clinical notes in Electronic Health Records (EHRs) because of their unstructured nature. Mining the associations between the clinical concepts within the clinical notes can support physicians in making decisions, and provide researchers evidence about disease development and treatment. In this paper, in order to model and analyze disease and symptom relationships in the clinical notes, we present a concept association mining framework that is based on word embedding learned through neural networks. The approach is tested using 154,738 clinical notes from 500 patients, which are extracted from the Indiana University Health’s Electronic Health Records system. All patients are diagnosed with more than one type of disease. The results show that this concept association mining framework can identify related diseases and symptoms. We also propose a method to visualize a patients’ diseases and related symptoms in chronological order. This visualization can provide physicians an overview of the medical history of a patient and support decision making. The presented approach can also be expanded to analyze the associations of other clinical concepts, such as social history, family history, medications, etc.Item A Water Demand Prediction Model for Central Indiana(AAAI, 2018) Shah, Setu; Hosseini, Mahmood; Miled, Zina Ben; Shafer, Rebecca; Berube, Steve; Electrical and Computer Engineering, School of Engineering and TechnologyDue to the limited natural water resources and the increase in population, managing water consumption is becoming an increasingly important subject worldwide. In this paper, we present and compare different machine learning models that are able to predict water demand for Central Indiana. The models are developed for two different time scales: daily and monthly. The input features for the proposed model include weather conditions (temperature, rainfall, snow), social features (holiday, median income), date (day of the year, month), and operational features (number of customers, previous water demand levels). The importance of these input features as accurate predictors is investigated. The results show that daily and monthly models based on recurrent neural networks produced the best results with an average error in prediction of 1.69% and 2.29%, respectively for 2016. These models achieve a high accuracy with a limited set of input features.