- Computer & Information Science Department Theses and Dissertations
Computer & Information Science Department Theses and Dissertations
Permanent URI for this collection
For more information about the Computer & Information Science graduate programs visit: https://science.indianapolis.iu.edu.
Browse
Recent Submissions
Item Network Alignment Using Topological and Node Embedding Features(2024-08) Almulhim, Aljohara; Al Hasan, Mohammad; Tuceryan, Mihran; Durresi, Arjan; Mukhopadhyay, Snehasis; Jang, HyejuIn today's big data environment, development of robust knowledge discovery solutions depends on integration of data from various sources. For example, intelligence agencies fuse data from multiple sources to identify criminal activities; e-commerce platforms consolidate user activities on various platforms and devices to build better user profile; scientists connect data from various modality to develop new drugs, and treatments. In all such activities, entities from different data sources need to be aligned---first, to ensure accurate analysis and more importantly, to discover novel knowledge regarding these entities. If the data sources are networks, aligning entities from different sources leads to the task of network alignment, which is the focus of this thesis. The main objective of this task is to find an optimal one-to-one correspondence among nodes in two or more networks utilizing graph topology and nodes/edges attributes. In existing works, diverse computational schemes have been adopted for solving the network alignment task; these schemes include finding eigen-decomposition of similarity matrices, solving quadratic assignment problems via sub-gradient optimization, and designing iterative greedy matching techniques. Contemporary works approach this problem using a deep learning framework by learning node representations to identify matches. Node matching's key challenges include computational complexity and scalability. However, privacy concerns or unavailability often prevent the utilization of node attributes in real-world scenarios. In light of this, we aim to solve this problem by relying solely on the graph structure, without the need for prior knowledge, external attributes, or guidance from landmark nodes. Clearly, topology-based matching emerges as a hard problem when compared to other network matching tasks. In this thesis, I propose two original works to solve network topology-based alignment task. The first work, Graphlet-based Alignment (Graphlet-Align), employs a topological approach to network alignment. Graphlet-Align represents each node with a local graphlet count based signature and use that as feature for deriving node to node similarity across a pair of networks. By using these similarity values in a bipartite matching algorithm Graphlet-Align obtains a preliminary alignment. It then uses high-order information extending to k-hop neighborhood of a node to further refine the alignment, achieving better accuracy. We validated Graphlet-Align's efficacy by applying it to various large real-world networks, achieving accuracy improvements ranging from $20\%$ to $72\%$ over state-of-the-art methods on both duplicated and noisy graphs. Expanding on this paradigm that focuses solely on topology for solving graph alignment, in my second work, I develop a self-supervised learning framework known as Self-Supervised Topological Alignment (SST-Align). SST-Align uses graphlet-based signature for creating self-supervised node alignment labels, and then use those labels to generate node embedding vectors of both the networks in a joint space from which node alignment task can be effectively and accurately solved. It starts with an optimization process that applies average pooling on top of the extracted graphlet signature to construct an initial node assignment. Next, a self-supervised Siamese network architecture utilizes both the initial node assignment and graph convolutional networks to generate node embeddings through a contrastive loss. By applying kd-tree similarity to the two networks' embeddings, we achieve the final node mapping. Extensive testing on real-world graph alignment datasets shows that our developed methodology has competitive results compared to seven existing competing models in terms of node mapping accuracy. Additionally, we establish the Ablation Study to evaluate the two-stage accuracy, excluding the learning representation part and comparing the mapping accuracy accordingly. This thesis enhances the theoretical understanding of topological features in the analysis of graph data for network alignment task, hence facilitating future advancements toward the field.Item Interactive Mitigation of Biases in Machine Learning Models(2024-08) Van Busum, Kelly; Fang, Shiaofen; Mukhopadhyay, Snehasis; Xia, Yuni; Tuceryan, MihranBias and fairness issues in artificial intelligence algorithms are major concerns as people do not want to use AI software they cannot trust. This work uses college admissions data as a case study to develop methodology to define and detect bias, and then introduces a new method for interactive bias mitigation. Admissions data spanning six years was used to create machine learning-based predictive models to determine whether a given student would be directly admitted into the School of Science under various scenarios at a large urban research university. During this time, submission of standardized test scores as part of a student’s application became optional which led to interesting questions about the impact of standardized test scores on admission decisions. We developed and analyzed predictive models to understand which variables are important in admissions decisions, and how the decision to exclude test scores affects the demographics of the students who are admitted. Then, using a variety of bias and fairness metrics, we analyzed these predictive models to detect biases the models may carry with respect to three variables chosen to represent sensitive populations: gender, race, and whether a student was the first in his/her family to attend college. We found that high accuracy rates can mask underlying algorithmic bias towards these sensitive groups. Finally, we describe our method for bias mitigation which uses a combination of machine learning and user interaction. Because bias is intrinsically a subjective and context-dependent matter, it requires human input and feedback. Our approach allows the user to iteratively and incrementally adjust bias and fairness metrics to change the training dataset for an AI model to make the model more fair. This interactive bias mitigation approach was then used to successfully decrease the biases in three AI models in the context of undergraduate student admissions.Item Investigation of Backdoor Attacks and Design of Effective Countermeasures in Federated Learning(2024-08) Palanisamy Sundar, Agnideven; Zou, Xukai; Li, Feng; Luo, Xiao; Hu, Qin; Tuceryan, MihranFederated Learning (FL), a novel subclass of Artificial Intelligence, decentralizes the learning process by enabling participants to benefit from a comprehensive model trained on a broader dataset without direct sharing of private data. This approach integrates multiple local models into a global model, mitigating the need for large individual datasets. However, the decentralized nature of FL increases its vulnerability to adversarial attacks. These include backdoor attacks, which subtly alter classification in some categories, and byzantine attacks, aimed at degrading the overall model accuracy. Detecting and defending against such attacks is challenging, as adversaries can participate in the system, masquerading as benign contributors. This thesis provides an extensive analysis of the various security attacks, highlighting the distinct elements of each and the inherent vulnerabilities of FL that facilitate these attacks. The focus is primarily on backdoor attacks, which are stealthier and more difficult to detect compared to byzantine attacks. We explore defense strategies effective in identifying malicious participants or mitigating attack impacts on the global model. The primary aim of this research is to evaluate the effectiveness and limitations of existing server-level defenses and to develop innovative defense mechanisms under diverse threat models. This includes scenarios where the server collaborates with clients to thwart attacks, cases where the server remains passive but benign, and situations where no server is present, requiring clients to independently minimize and isolate attacks while enhancing main task performance. Throughout, we ensure that the interventions do not compromise the performance of both global and local models. The research predominantly utilizes 2D and 3D datasets to underscore the practical implications and effectiveness of proposed methodologies.Item Trustworthy and Causal Artificial Intelligence in Environmental Decision Making(2024-05) Uslu, Suleyman; Durresi, Arjan; Tuceryan, Mihran; Dundar, Murat; Hu, QinWe present a framework for Trustworthy Artificial Intelligence (TAI) that dynamically assesses trust and scrutinizes past decision-making, aiming to identify both individual and community behavior. The modeling of behavior incorporates proposed concepts, namely trust pressure and trust sensitivity, laying the foundation for predicting future decision-making regarding community behavior, consensus level, and decision-making duration. Our framework involves the development and mathematical modeling of trust pressure and trust sensitivity, drawing on social validation theory within the context of environmental decision-making. To substantiate our approach, we conduct experiments encompassing (i) dynamic trust sensitivity to reveal the impact of learning actors between decision-making, (ii) multi-level trust measurements to capture disruptive ratings, and (iii) different distributions of trust sensitivity to emphasize the significance of individual progress as well as overall progress. Additionally, we introduce TAI metrics, trustworthy acceptance, and trustworthy fairness, designed to evaluate the acceptance of decisions proposed by AI or humans and the fairness of such proposed decisions. The dynamic trust management within the framework allows these TAI metrics to discern support for decisions among individuals with varying levels of trust. We propose both the metrics and their measurement methodology as contributions to the standardization of trustworthy AI. Furthermore, our trustability metric incorporates reliability, resilience, and trust to evaluate systems with multiple components. We illustrate experiments showcasing the effects of different trust declines on the overall trustability of the system. Notably, we depict the trade-off between trustability and cost, resulting in net utility, which facilitates decision-making in systems and cloud security. This represents a pivotal step toward an artificial control model involving multiple agents engaged in negotiation. Lastly, the dynamic management of trust and trustworthy acceptance, particularly in varying criteria, serves as a foundation for causal AI by providing inference methods. We outline a mechanism and present an experiment on human-driven causal inference, where participant discussions act as interventions, enabling counterfactual evaluations once actor and community behavior are modeled.Item Towards Representation Learning for Robust Network Intrusion Detection Systems(2024-05) Hosler, Ryan; Zou, Xukai; Li, Feng; Tsechpenakis, Gavriil; Durresi, Arjan; Hu, QinThe most cost-effective method for cybersecurity defense is prevention. Ideally, before a malicious actor steals information or affects the functionality of a network, a Network Intrusion Detection System (NIDS) will identify and allow for a complete prevention of an attack. For this reason, there are commercial availabilities for rule-based NIDS which will use a packet sniffer to monitor all incoming network traffic for potential intrusions. However, such a NIDS will only work on known intrusions, therefore, researchers have devised sophisticated Deep Learning methods for detecting malicious network activity. By using statistical features from network flows, such as packet count, connection duration, flow bytes per second, etc., a Machine Learning or Deep Learning NIDS may identify an advanced attack that would otherwise bypass a rule-based NIDS. For this research, the presented work will develop novel applications of Deep Learning for NIDS development. Specifically, an image embedding algorithms will be adapted to this domain. Moreover, novel methods for representing network traffic as a graph and applying Deep Graph Representation Learning algorithms for an NIDS will be considered. When compared to the existing state-of-the-art methods within NIDS literature, the methods developed in the research manage to outperform them on numerous Network Traffic Datasets. Furthermore, an NIDS was deployed and successfully configured to a live network environment. Another domain in which this research is applied to is Android Malware Detection. By analyzing network traffic produced by either a benign or malicious Android Application, current research has failed to accurately detect Android Malware. Instead, they rely on features which are extracted from the APK file itself. Therefore, this research presents a NIDS inspired Graph-Based model which demonstrably distinguishes benign and malicious applications through analysis of network traffic alone, which outperforms existing sophisticated malware detection frameworks.Item CyberWater: An Open Framework for Data and Model Integration(2024-05) Chen, Ranran; Liang, Yao; Song, Fengguang; Xia, Yuni; Zheng, JiangyuWorkflow management systems (WMSs) are commonly used to organize/automate sequences of tasks as workflows to accelerate scientific discoveries. During complex workflow modeling, a local interactive workflow environment is desirable, as users usually rely on their rich, local environments for fast prototyping and refinements before they consider using more powerful computing resources. This dissertation delves into the innovative development of the CyberWater framework based on Workflow Management Systems (WMSs). Against the backdrop of data-intensive and complex models, CyberWater exemplifies the transition of intricate data into insightful and actionable knowledge and introduces the nuanced architecture of CyberWater, particularly focusing on its adaptation and enhancement from the VisTrails system. It highlights the significance of control and data flow mechanisms and the introduction of new data formats for effective data processing within the CyberWater framework. This study presents an in-depth analysis of the design and implementation of Generic Model Agent Toolkits. The discussion centers on template-based component mechanisms and the integration with popular platforms, while emphasizing the toolkits ability to facilitate on-demand access to High-Performance Computing resources for large-scale data handling. Besides, the development of an asynchronously controlled workflow within CyberWater is also explored. This innovative approach enhances computational performance by optimizing pipeline-level parallelism and allows for on-demand submissions of HPC jobs, significantly improving the efficiency of data processing. A comprehensive methodology for model-driven development and Python code integration within the CyberWater framework and innovative applications of GPT models for automated data retrieval are introduced in this research as well. It examines the implementation of Git Actions for system automation in data retrieval processes and discusses the transformation of raw data into a compatible format, enhancing the adaptability and reliability of the data retrieval component in the adaptive generic model agent toolkit component. For the development and maintenance of software within the CyberWater framework, the use of tools like GitHub for version control and outlining automated processes has been applied for software updates and error reporting. Except that, the user data collection also emphasizes the role of the CyberWater Server in these processes. In conclusion, this dissertation presents our comprehensive work on the CyberWater framework’s advancements, setting new standards in scientific workflow management and demonstrating how technological innovation can significantly elevate the process of scientific discovery.Item Crime Detection from Pre-crime Video Analysis(2024-05) Kilic, Sedat; Tuceryan, Mihran; Zheng, Jiang Yu; Tsechpenakis, Gavriil; Durresi, ArjanThis research investigates the detection of pre-crime events, specifically targeting behaviors indicative of shoplifting, through the advanced analysis of CCTV video data. The study introduces an innovative approach that leverages augmented human pose and emotion information within individual frames, combined with the extraction of activity information across subsequent frames, to enhance the identification of potential shoplifting actions before they occur. Utilizing a diverse set of models including 3D Convolutional Neural Networks (CNNs), Graph Neural Networks (GNNs), Recurrent Neural Networks (RNNs), and a specially developed transformer architecture, the research systematically explores the impact of integrating additional contextual information into video analysis. By augmenting frame-level video data with detailed pose and emotion insights, and focusing on the temporal dynamics between frames, our methodology aims to capture the nuanced behavioral patterns that precede shoplifting events. The comprehensive experimental evaluation of our models across different configurations reveals a significant improvement in the accuracy of pre-crime detection. The findings underscore the crucial role of combining visual features with augmented data and the importance of analyzing activity patterns over time for a deeper understanding of pre-shoplifting behaviors. The study’s contributions are multifaceted, including a detailed examination of pre-crime frames, strategic augmentation of video data with added contextual information, the creation of a novel transformer architecture customized for pre-crime analysis, and an extensive evaluation of various computational models to improve predictive accuracy.Item Unraveling Complexity: Panoptic Segmentation in Cellular and Space Imagery(2024-05) Plebani, Emanuele; Dundar, Murat; Tuceryan, Mihran; Tsechpenakis, Gavriil; Al Hasan, MohammadAdvancements in machine learning, especially deep learning, have facilitated the creation of models capable of performing tasks previously thought impossible. This progress has opened new possibilities across diverse fields such as medical imaging and remote sensing. However, the performance of these models relies heavily on the availability of extensive labeled datasets. Collecting large amounts of labeled data poses a significant financial burden, particularly in specialized fields like medical imaging and remote sensing, where annotation requires expert knowledge. To address this challenge, various methods have been developed to mitigate the necessity for labeled data or leverage information contained in unlabeled data. These encompass include self-supervised learning, few-shot learning, and semi-supervised learning. This dissertation centers on the application of semi-supervised learning in segmentation tasks. We focus on panoptic segmentation, a task that combines semantic segmentation (assigning a class to each pixel) and instance segmentation (grouping pixels into different object instances). We choose two segmentation tasks in different domains: nerve segmentation in microscopic imaging and hyperspectral segmentation in satellite images from Mars. Our study reveals that, while direct application of methods developed for natural images may yield low performance, targeted modifications or the development of robust models can provide satisfactory results, thereby unlocking new applications like machine-assisted annotation of new data. This dissertation begins with a challenging panoptic segmentation problem in microscopic imaging, systematically exploring model architectures to improve generalization. Subsequently, it investigates how semi-supervised learning may mitigate the need for annotated data. It then moves to hyperspectral imaging, introducing a Hierarchical Bayesian model (HBM) to robustly classify single pixels. Key contributions of include developing a state-of-the-art U-Net model for nerve segmentation, improving the model's ability to segment different cellular structures, evaluating semi-supervised learning methods in the same setting, and proposing HBM for hyperspectral segmentation. The dissertation also provides a dataset of labeled CRISM pixels and mineral detections, and a software toolbox implementing the full HBM pipeline, to facilitate the development of new models.Item Deep Learning Based Methods for Automatic Extraction of Syntactic Patterns and their Application for Knowledge Discovery(2023-12-28) Kabir, Md. Ahsanul; Hasan, Mohammad Al; Mukhopadhyay, Snehasis; Tuceryan, Mihran; Fang, ShiaofenSemantic pairs, which consist of related entities or concepts, serve as the foundation for comprehending the meaning of language in both written and spoken forms. These pairs enable to grasp the nuances of relationships between words, phrases, or ideas, forming the basis for more advanced language tasks like entity recognition, sentiment analysis, machine translation, and question answering. They allow to infer causality, identify hierarchies, and connect ideas within a text, ultimately enhancing the depth and accuracy of automated language processing. Nevertheless, the task of extracting semantic pairs from sentences poses a significant challenge, necessitating the relevance of syntactic dependency patterns (SDPs). Thankfully, semantic relationships exhibit adherence to distinct SDPs when connecting pairs of entities. Recognizing this fact underscores the critical importance of extracting these SDPs, particularly for specific semantic relationships like hyponym-hypernym, meronym-holonym, and cause-effect associations. The automated extraction of such SDPs carries substantial advantages for various downstream applications, including entity extraction, ontology development, and question answering. Unfortunately, this pivotal facet of pattern extraction has remained relatively overlooked by researchers in the domains of natural language processing (NLP) and information retrieval. To address this gap, I introduce an attention-based supervised deep learning model, ASPER. ASPER is designed to extract SDPs that denote semantic relationships between entities within a given sentential context. I rigorously evaluate the performance of ASPER across three distinct semantic relations: hyponym-hypernym, cause-effect, and meronym-holonym, utilizing six datasets. My experimental findings demonstrate ASPER's ability to automatically identify an array of SDPs that mirror the presence of these semantic relationships within sentences, outperforming existing pattern extraction methods by a substantial margin. Second, I want to use the SDPs to extract semantic pairs from sentences. I choose to extract cause-effect entities from medical literature. This task is instrumental in compiling various causality relationships, such as those between diseases and symptoms, medications and side effects, and genes and diseases. Existing solutions excel in sentences where cause and effect phrases are straightforward, such as named entities, single-word nouns, or short noun phrases. However, in the complex landscape of medical literature, cause and effect expressions often extend over several words, stumping existing methods, resulting in incomplete extractions that provide low-quality, non-informative, and at times, conflicting information. To overcome this challenge, I introduce an innovative unsupervised method for extracting cause and effect phrases, PatternCausality tailored explicitly for medical literature. PatternCausality employs a set of cause-effect dependency patterns as templates to identify the key terms within cause and effect phrases. It then utilizes a novel phrase extraction technique to produce comprehensive and meaningful cause and effect expressions from sentences. Experiments conducted on a dataset constructed from PubMed articles reveal that PatternCausality significantly outperforms existing methods, achieving a remarkable order of magnitude improvement in the F-score metric over the best-performing alternatives. I also develop various PatternCausality variants that utilize diverse phrase extraction methods, all of which surpass existing approaches. PatternCausality and its variants exhibit notable performance improvements in extracting cause and effect entities in a domain-neutral benchmark dataset, wherein cause and effect entities are confined to single-word nouns or noun phrases of one to two words. Nevertheless, PatternCausality operates within an unsupervised framework and relies heavily on SDPs, motivating me to explore the development of a supervised approach. Although SDPs play a pivotal role in semantic relation extraction, pattern-based methodologies remain unsupervised, and the multitude of potential patterns within a language can be overwhelming. Furthermore, patterns do not consistently capture the broader context of a sentence, leading to the extraction of false-positive semantic pairs. As an illustration, consider the hyponym-hypernym pattern the w of u which can correctly extract semantic pairs for a sentence like the village of Aasu but fails to do so for the phrase the moment of impact. The root cause of this limitation lies in the pattern's inability to capture the nuanced meaning of words and phrases in a sentence and their contextual significance. These observations have spurred my exploration of a third model, DepBERT which constitutes a dependency-aware supervised transformer model. DepBERT's primary contribution lies in introducing the underlying dependency structure of sentences to a language model with the aim of enhancing token classification performance. To achieve this, I must first reframe the task of semantic pair extraction as a token classification problem. The DepBERT model can harness both the tree-like structure of dependency patterns and the masked language architecture of transformers, marking a significant milestone, as most large language models (LLMs) predominantly focus on semantics and word co-occurrence while neglecting the crucial role of dependency architecture. In summary, my overarching contributions in this thesis are threefold. First, I validate the significance of the dependency architecture within various components of sentences and publish SDPs that incorporate these dependency relationships. Subsequently, I employ these SDPs in a practical medical domain to extract vital cause-effect pairs from sentences. Finally, my third contribution distinguishes this thesis by integrating dependency relations into a deep learning model, enhancing the understanding of language and the extraction of valuable semantic associations.Item Identifying High Acute Care Users Among Bipolar and Schizophrenia Patients(2023-12) Li, Shuo; Ben-Miled, Zina; Fang, Shiaofen; Zheng, Jiang YuThe electronic health record (EHR) documents the patient’s medical history, with information such as demographics, diagnostic history, procedures, laboratory tests, and observations made by healthcare providers. This source of information can help support preventive health care and management. The present thesis explores the potential for EHR-driven models to predict acute care utilization (ACU) which is defined as visits to an emergency department (ED) or inpatient hospitalization (IH). ACU care is often associated with significant costs compared to outpatient visits. Identifying patients at risk can improve the quality of care for patients and can reduce the need for these services making healthcare organizations more cost-effective. This is important for vulnerable patients including those suffering from schizophrenia and bipolar disorders. This study compares the ability of the MedBERT architecture, the MedBERT+ architecture and standard machine learning models to identify at risk patients. MedBERT is a deep learning language model which was trained on diagnosis codes to predict the patient’s at risk for certain disease conditions. MedBERT+, the architecture introduced in this study is also trained on diagnosis codes. However, it adds socio-demographic embeddings and targets a different outcome, namely ACU. MedBERT+ outperformed the original architecture, MedBERT, as well as XGB achieving an AUC of 0.71 for both bipolar and schizophrenia patients when predicting ED visits and an AUC of 0.72 for bipolar patients when predicting IH visits. For schizophrenia patients, the IH predictive model had an AUC of 0.66 requiring further improvements. One potential direction for future improvement is the encoding of the demographic variables. Preliminary results indicate that an appropriate encoding of the age of the patient increased the AUC of Bipolar ED models to up to 0.78.