- Browse by Author
Browsing by Author "Hasan, Mohammad Al"
Now showing 1 - 10 of 11
Results Per Page
Sort Options
Item Deep Learning Based Methods for Automatic Extraction of Syntactic Patterns and their Application for Knowledge Discovery(2023-12-28) Kabir, Md. Ahsanul; Hasan, Mohammad Al; Mukhopadhyay, Snehasis; Tuceryan, Mihran; Fang, ShiaofenSemantic pairs, which consist of related entities or concepts, serve as the foundation for comprehending the meaning of language in both written and spoken forms. These pairs enable to grasp the nuances of relationships between words, phrases, or ideas, forming the basis for more advanced language tasks like entity recognition, sentiment analysis, machine translation, and question answering. They allow to infer causality, identify hierarchies, and connect ideas within a text, ultimately enhancing the depth and accuracy of automated language processing. Nevertheless, the task of extracting semantic pairs from sentences poses a significant challenge, necessitating the relevance of syntactic dependency patterns (SDPs). Thankfully, semantic relationships exhibit adherence to distinct SDPs when connecting pairs of entities. Recognizing this fact underscores the critical importance of extracting these SDPs, particularly for specific semantic relationships like hyponym-hypernym, meronym-holonym, and cause-effect associations. The automated extraction of such SDPs carries substantial advantages for various downstream applications, including entity extraction, ontology development, and question answering. Unfortunately, this pivotal facet of pattern extraction has remained relatively overlooked by researchers in the domains of natural language processing (NLP) and information retrieval. To address this gap, I introduce an attention-based supervised deep learning model, ASPER. ASPER is designed to extract SDPs that denote semantic relationships between entities within a given sentential context. I rigorously evaluate the performance of ASPER across three distinct semantic relations: hyponym-hypernym, cause-effect, and meronym-holonym, utilizing six datasets. My experimental findings demonstrate ASPER's ability to automatically identify an array of SDPs that mirror the presence of these semantic relationships within sentences, outperforming existing pattern extraction methods by a substantial margin. Second, I want to use the SDPs to extract semantic pairs from sentences. I choose to extract cause-effect entities from medical literature. This task is instrumental in compiling various causality relationships, such as those between diseases and symptoms, medications and side effects, and genes and diseases. Existing solutions excel in sentences where cause and effect phrases are straightforward, such as named entities, single-word nouns, or short noun phrases. However, in the complex landscape of medical literature, cause and effect expressions often extend over several words, stumping existing methods, resulting in incomplete extractions that provide low-quality, non-informative, and at times, conflicting information. To overcome this challenge, I introduce an innovative unsupervised method for extracting cause and effect phrases, PatternCausality tailored explicitly for medical literature. PatternCausality employs a set of cause-effect dependency patterns as templates to identify the key terms within cause and effect phrases. It then utilizes a novel phrase extraction technique to produce comprehensive and meaningful cause and effect expressions from sentences. Experiments conducted on a dataset constructed from PubMed articles reveal that PatternCausality significantly outperforms existing methods, achieving a remarkable order of magnitude improvement in the F-score metric over the best-performing alternatives. I also develop various PatternCausality variants that utilize diverse phrase extraction methods, all of which surpass existing approaches. PatternCausality and its variants exhibit notable performance improvements in extracting cause and effect entities in a domain-neutral benchmark dataset, wherein cause and effect entities are confined to single-word nouns or noun phrases of one to two words. Nevertheless, PatternCausality operates within an unsupervised framework and relies heavily on SDPs, motivating me to explore the development of a supervised approach. Although SDPs play a pivotal role in semantic relation extraction, pattern-based methodologies remain unsupervised, and the multitude of potential patterns within a language can be overwhelming. Furthermore, patterns do not consistently capture the broader context of a sentence, leading to the extraction of false-positive semantic pairs. As an illustration, consider the hyponym-hypernym pattern the w of u which can correctly extract semantic pairs for a sentence like the village of Aasu but fails to do so for the phrase the moment of impact. The root cause of this limitation lies in the pattern's inability to capture the nuanced meaning of words and phrases in a sentence and their contextual significance. These observations have spurred my exploration of a third model, DepBERT which constitutes a dependency-aware supervised transformer model. DepBERT's primary contribution lies in introducing the underlying dependency structure of sentences to a language model with the aim of enhancing token classification performance. To achieve this, I must first reframe the task of semantic pair extraction as a token classification problem. The DepBERT model can harness both the tree-like structure of dependency patterns and the masked language architecture of transformers, marking a significant milestone, as most large language models (LLMs) predominantly focus on semantics and word co-occurrence while neglecting the crucial role of dependency architecture. In summary, my overarching contributions in this thesis are threefold. First, I validate the significance of the dependency architecture within various components of sentences and publish SDPs that incorporate these dependency relationships. Subsequently, I employ these SDPs in a practical medical domain to extract vital cause-effect pairs from sentences. Finally, my third contribution distinguishes this thesis by integrating dependency relations into a deep learning model, enhancing the understanding of language and the extraction of valuable semantic associations.Item Distributed graph decomposition algorithms on Apache Spark(2018-04-20) Mandal, Aritra; Hasan, Mohammad Al; Mohler, George; Song, FengguangStructural analysis and mining of large and complex graphs for describing the characteristics of a vertex or an edge in the graph have widespread use in graph clustering, classification, and modeling. There are various methods for structural analysis of graphs including the discovery of frequent subgraphs or network motifs, counting triangles or graphlets, spectral analysis of networks using eigenvectors of graph Laplacian, and finding highly connected subgraphs such as cliques and quasi cliques. Unfortunately, the algorithms for solving most of the above tasks are quite costly, which makes them not-scalable to large real-life networks. Two such very popular decompositions, k-core and k-truss of a graph give very useful insight about the graph vertex and edges respectively. These decompositions have been applied to solve protein functions reasoning on protein-protein networks, fraud detection and missing link prediction problems. k-core decomposition with is linear time complexity is scalable to large real-life networks as long as the input graph fits in the main memory. k-truss on the other hands is computationally more intensive due to its definition relying on triangles and their is no linear time algorithm available for it. In this paper, we propose distributed algorithms on Apache Spark for k-truss and k-core decomposition of a graph. We also compare the performance of our algorithm with state-of-the-art Map-Reduce and parallel algorithms using openly available real world network data. Our proposed algorithms have shown substantial performance improvement.Item GPU Accelerated Browser for Neuroimaging Genomics(Springer, 2018-10) Zigon, Bob; Li, Huang; Yao, Xiaohui; Fang, Shiaofen; Hasan, Mohammad Al; Yan, Jingwen; Moore, Jason H.; Saykin, Andrew J.; Shen, Li; Alzheimer’s Disease Neuroimaging Initiative; Computer and Information Science, School of ScienceNeuroimaging genomics is an emerging field that provides exciting opportunities to understand the genetic basis of brain structure and function. The unprecedented scale and complexity of the imaging and genomics data, however, have presented critical computational bottlenecks. In this work we present our initial efforts towards building an interactive visual exploratory system for mining big data in neuroimaging genomics. A GPU accelerated browsing tool for neuroimaging genomics is created that implements the ANOVA algorithm for single nucleotide polymorphism (SNP) based analysis and the VEGAS algorithm for gene-based analysis, and executes them at interactive rates. The ANOVA algorithm is 110 times faster than the 4-core OpenMP version, while the VEGAS algorithm is 375 times faster than its 4-core OpenMP counter part. This approach lays a solid foundation for researchers to address the challenges of mining large-scale imaging genomics datasets via interactive visual exploration.Item Marginal Regression Analysis of Clustered and Incomplete Event History Data(2022-12) Zhou, Wenxian; Bakoyannis, Giorgos; Zhang, Ying; Yiannoutsos, Constantin T.; Zang, Yong; Hasan, Mohammad AlEvent history data, including competing risks and more general multistate process data, are commonly encountered in biomedical studies. In practice, such event history data are often subject to intra-cluster correlation in multicenter studies and are complicated due to informative cluster size, a situation where the outcomes under study are associated with the size of the cluster. In addition, outcomes or covariates are frequently incompletely observed in real-world settings. Ignoring these statistical issues will lead to invalid inferences. In this dissertation, I develop a series of marginal regression methods to address these statistical issues with competing risks and more general multistate process data. The motivation for this research comes from a large multicenter HIV study and a multicenter randomized oncology trial. First, I propose a marginal regression method for clustered competing risks data with missing cause of failure. I consider the semiparametric proportional cause-specific hazards model and propose a maximum partial pseudolikelihood estimator under a plausible missing at random assumption. Second, I consider more general clustered multistate process data and propose a marginal regression framework for the transient state occupation probabilities. The proposed method is based on a weighted functional generalized estimating equation approach. A nonparametric hypothesis test for the covariate effect is also provided. Third, I extend the proposed framework in the second part of the dissertation to account for missing covariates, via a weighted functional pseudo-expected estimating equation approach. I conduct extensive simulation studies to evaluate the finite sample performance of the proposed methods. The proposed methods are applied to the motivating multicenter HIV study and oncology trial datasets.Item Pathway and network analysis in proteomics(Elsevier, 2014-12-07) Wu, Xiaogang; Hasan, Mohammad Al; Chen, Jake Yue; Department of BioHealth Informatics, School of Informatics and ComputingProteomics is inherently a systems science that studies not only measured protein and their expressions in a cell, but also the interplay of proteins, protein complexes, signaling pathways, and network modules. There is a rapid accumulation of Proteomics data in recent years. However, Proteomics data are highly variable, with results sensitive to data preparation methods, sample condition, instrument types, and analytical methods. To address the challenge in Proteomics data analysis, we review current tools being developed to incorporate biological function and network topological information. We categorize these tools into four types: tools with basic functional information and little topological features (e.g., GO category analysis), tools with rich functional information and little topological features (e.g., GSEA), tools with basic functional information and rich topological features (e.g., Cytoscape), and tools with rich functional information and rich topological features (e.g., PathwayExpress). We first review the potential application of these tools to Proteomics; then we review tools that can achieve automated learning of pathway modules and features, and tools that help perform integrated network visual analytics.Item Privacy-Preserving Facial Recognition Using Biometric-Capsules(2020-05) Phillips, Tyler S.; Zou, Xukai; Li, Feng; Hasan, Mohammad AlIn recent years, developers have used the proliferation of biometric sensors in smart devices, along with recent advances in deep learning, to implement an array of biometrics-based recognition systems. Though these systems demonstrate remarkable performance and have seen wide acceptance, they present unique and pressing security and privacy concerns. One proposed method which addresses these concerns is the elegant, fusion-based Biometric-Capsule (BC) scheme. The BC scheme is provably secure, privacy-preserving, cancellable and interoperable in its secure feature fusion design. In this work, we demonstrate that the BC scheme is uniquely fit to secure state-of-the-art facial verification, authentication and identification systems. We compare the performance of unsecured, underlying biometrics systems to the performance of the BC-embedded systems in order to directly demonstrate the minimal effects of the privacy-preserving BC scheme on underlying system performance. Notably, we demonstrate that, when seamlessly embedded into a state-of-the-art FaceNet and ArcFace verification systems which achieve accuracies of 97.18% and 99.75% on the benchmark LFW dataset, the BC-embedded systems are able to achieve accuracies of 95.13% and 99.13% respectively. Furthermore, we also demonstrate that the BC scheme outperforms or performs as well as several other proposed secure biometric methods.Item RASMA: a reverse search algorithm for mining maximal frequent subgraphs(BMC, 2021-03-16) Salem, Saeed; Alokshiya, Mohammed; Hasan, Mohammad Al; Computer and Information Science, School of ScienceBackground: Given a collection of coexpression networks over a set of genes, identifying subnetworks that appear frequently is an important research problem known as mining frequent subgraphs. Maximal frequent subgraphs are a representative set of frequent subgraphs; A frequent subgraph is maximal if it does not have a super-graph that is frequent. In the bioinformatics discipline, methodologies for mining frequent and/or maximal frequent subgraphs can be used to discover interesting network motifs that elucidate complex interactions among genes, reflected through the edges of the frequent subnetworks. Further study of frequent coexpression subnetworks enhances the discovery of biological modules and biological signatures for gene expression and disease classification. Results: We propose a reverse search algorithm, called RASMA, for mining frequent and maximal frequent subgraphs in a given collection of graphs. A key innovation in RASMA is a connected subgraph enumerator that uses a reverse-search strategy to enumerate connected subgraphs of an undirected graph. Using this enumeration strategy, RASMA obtains all maximal frequent subgraphs very efficiently. To overcome the computationally prohibitive task of enumerating all frequent subgraphs while mining for the maximal frequent subgraphs, RASMA employs several pruning strategies that substantially improve its overall runtime performance. Experimental results show that on large gene coexpression networks, the proposed algorithm efficiently mines biologically relevant maximal frequent subgraphs. Conclusion: Extracting recurrent gene coexpression subnetworks from multiple gene expression experiments enables the discovery of functional modules and subnetwork biomarkers. We have proposed a reverse search algorithm for mining maximal frequent subnetworks. Enrichment analysis of the extracted maximal frequent subnetworks reveals that subnetworks that are frequent are highly enriched with known biological ontologies.Item Rewiring Police Officer Training Networks to Reduce Forecasted Use of Force(2023-08) Pandey, Ritika; Mohler, George; Hill, James; Hasan, Mohammad Al; Mukhopadhyay, SnehasisPolice use of force has become a topic of significant concern, particularly given the disparate impact on communities of color. Research has shown that police officer involved shootings, misconduct and excessive use of force complaints exhibit network effects, where officers are at greater risk of being involved in these incidents when they socialize with officers who have a history of use of force and misconduct. Given that use of force and misconduct behavior appear to be transmissible across police networks, we are attempting to address if police networks can be altered to reduce use of force and misconduct events in a limited scope. In this work, we analyze a novel dataset from the Indianapolis Metropolitan Police Department on officer field training, subsequent use of force, and the role of network effects from field training officers. We construct a network survival model for analyzing time-to-event of use of force incidents involving new police trainees. The model includes network effects of the diffusion of risk from field training officers (FTOs) to trainees. We then introduce a network rewiring algorithm to maximize the expected time to use of force events upon completion of field training. We study several versions of the algorithm, including constraints that encourage demographic diversity of FTOs. The results show that FTO use of force history is the best predictor of trainee's time to use of force in the survival model and rewiring the network can increase the expected time (in days) of a recruit's first use of force incident by 8%. We then discuss the potential benefits and challenges associated with implementing such an algorithm in practice.Item Sampling Triples from Restricted Networks Using MCMC Strategy(ACM, 2014) Rahman, Mahmudur; Hasan, Mohammad Al; Department of Computer Science, IUPUIIn large networks, the connected triples are useful for solving various tasks including link prediction, community detection, and spam filtering. Existing works in this direction concern mostly with the exact or approximate counting of connected triples that are closed (aka, triangles). Evidently, the task of triple sampling has not been explored in depth, although sampling is a more fundamental task than counting, and the former is useful for solving various other tasks, including counting. In recent years, some works on triple sampling have been proposed that are based on direct sampling, solely for the purpose of triangle count approximation. They sample only from a uniform distribution, and are not effective for sampling triples from an arbitrary user-defined distribution. In this work we present two indirect triple sampling methods that are based on Markov Chain Monte Carlo (MCMC) sampling strategy. Both of the above methods are highly efficient compared to a direct sampling-based method, specifically for the task of sampling from a non-uniform probability distribution. Another significant advantage of the proposed methods is that they can sample triples from networks that have restricted access, on which a direct sampling based method is simply not applicable.Item Text Mining for Social Harm and Criminal Justice Applications(2020-08) Pandey, Ritika; Mohler, George; Hasan, Mohammad Al; Mukhopadhyay, SnehasisIncreasing rates of social harm events and plethora of text data demands the need of employing text mining techniques not only to better understand their causes but also to develop optimal prevention strategies. In this work, we study three social harm issues: crime topic models, transitions into drug addiction and homicide investigation chronologies. Topic modeling for the categorization and analysis of crime report text allows for more nuanced categories of crime compared to official UCR categorizations. This study has important implications in hotspot policing. We investigate the extent to which topic models that improve coherence lead to higher levels of crime concentration. We further explore the transitions into drug addiction using Reddit data. We proposed a prediction model to classify the users’ transition from casual drug discussion forum to recovery drug discussion forum and the likelihood of such transitions. Through this study we offer insights into modern drug culture and provide tools with potential applications in combating opioid crises. Lastly, we present a knowledge graph based framework for homicide investigation chronologies that may aid investigators in analyzing homicide case data and also allow for post hoc analysis of key features that determine whether a homicide is ultimately solved. For this purpose we perform named entity recognition to determine witnesses, detectives and suspects from chronology, use keyword expansion to identify various evidence types and finally link these entities and evidence to construct a homicide investigation knowledge graph. We compare the performance over several choice of methodologies for these sub-tasks and analyze the association between network statistics of knowledge graph and homicide solvability.