- Browse by Subject
Browsing by Subject "Clustering"
Now showing 1 - 10 of 14
Results Per Page
Sort Options
Item Comparing PSO-based clustering over contextual vector embeddings to modern topic modeling(Elsevier, 2022-05) Miles, Samuel; Yao, Lixia; Meng, Weilin; Black, Christopher M.; Miled, Zina Ben; Electrical and Computer Engineering, School of Engineering and TechnologyEfficient topic modeling is needed to support applications that aim at identifying main themes from a collection of documents. In the present paper, a reduced vector embedding representation and particle swarm optimization (PSO) are combined to develop a topic modeling strategy that is able to identify representative themes from a large collection of documents. Documents are encoded using a reduced, contextual vector embedding from a general-purpose pre-trained language model (sBERT). A modified PSO algorithm (pPSO) that tracks particle fitness on a dimension-by-dimension basis is then applied to these embeddings to create clusters of related documents. The proposed methodology is demonstrated on two datasets. The first dataset consists of posts from the online health forum r/Cancer and the second dataset is a standard benchmark for topic modeling which consists of a collection of messages posted to 20 different news groups. When compared to the state-of-the-art generative document models (i.e., ETM and NVDM), pPSO is able to produce interpretable clusters. The results indicate that pPSO is able to capture both common topics as well as emergent topics. Moreover, the topic coherence of pPSO is comparable to that of ETM and its topic diversity is comparable to NVDM. The assignment parity of pPSO on a document completion task exceeded 90% for the 20NewsGroups dataset. This rate drops to approximately 30% when pPSO is applied to the same Skip-Gram embedding derived from a limited, corpus-specific vocabulary which is used by ETM and NVDM.Item Continuum modeling of clustering of myxobacteria(IOP, 2013) Harvey, Cameron W.; Alber, Mark; Tsimring, Lev S.; Aranson, Igor S.; Medicine, School of MedicineIn this paper we develop a continuum theory of clustering in ensembles of self-propelled inelastically colliding rods with applications to collective dynamics of common gliding bacteria Myxococcus Xanthus. A multiphase hydrodynamic model that couples densities of oriented and isotropic phases is described. This model is used for the analysis of an instability that leads to spontaneous formation of directionally moving dense clusters within initially dilute isotropic "gas" of myxobacteria. Numerical simulations of this model confirm the existence of stationary dense moving clusters and also elucidate the properties of their collisions. The results are shown to be in a qualitative agreement with experiments.Item Data-driven clustering identifies features distinguishing multisystem inflammatory syndrome from acute COVID-19 in children and adolescents(Elsevier, 2021-08-31) Geva, Alon; Patel, Manish M.; Geva, Alon; Patel, Manish M.; Newhams, Margaret M.; Young, Cameron C.; Son, Mary Beth F.; Kong, Michele; Maddux, Aline B.; Hall, Mark W.; Riggs, Becky J.; Singh, Aalok R.; Giuliano, John S.; Hobbs, Charlotte V.; Loftis, Laura L.; McLaughlin, Gwenn E.; Schwartz, Stephanie P.; Schuster, Jennifer E.; Babbitt, Christopher J.; Halasa, Natasha B.; Gertz, Shira J.; Doymaz, Sule; Hume, Janet R.; Bradford, Tamara T.; Irby, Katherine; Carroll, Christopher L.; McGuire, John K.; Tarquinio, Keiko M.; Rowan, Courtney M.; Mack, Elizabeth H.; Cvijanovich, Natalie Z.; Fitzgerald, Julie C.; Spinella, Philip C.; Staat, Mary A.; Clouser, Katharine N.; Soma, Vijaya L.; Dapul, Heda; Maamari, Mia; Bowens, Cindy; Havlin, Kevin M.; Mourani, Peter M.; Heidemann, Sabrina M.; Horwitz, Steven M.; Feldstein, Leora R.; Tenforde, Mark W.; Newburger, Jane W.; Mandl, Kenneth D.; Randolph, Adrienne G.; Overcoming COVID-19 Investigators; Pediatrics, School of MedicineBackground Multisystem inflammatory syndrome in children (MIS-C) consensus criteria were designed for maximal sensitivity and therefore capture patients with acute COVID-19 pneumonia. Methods We performed unsupervised clustering on data from 1,526 patients (684 labeled MIS-C by clinicians) <21 years old hospitalized with COVID-19-related illness admitted between 15 March 2020 and 31 December 2020. We compared prevalence of assigned MIS-C labels and clinical features among clusters, followed by recursive feature elimination to identify characteristics of potentially misclassified MIS-C-labeled patients. Findings Of 94 clinical features tested, 46 were retained for clustering. Cluster 1 patients (N = 498; 92% labeled MIS-C) were mostly previously healthy (71%), with mean age 7·2 ± 0·4 years, predominant cardiovascular (77%) and/or mucocutaneous (82%) involvement, high inflammatory biomarkers, and mostly SARS-CoV-2 PCR negative (60%). Cluster 2 patients (N = 445; 27% labeled MIS-C) frequently had pre-existing conditions (79%, with 39% respiratory), were similarly 7·4 ± 2·1 years old, and commonly had chest radiograph infiltrates (79%) and positive PCR testing (90%). Cluster 3 patients (N = 583; 19% labeled MIS-C) were younger (2·8 ± 2·0 y), PCR positive (86%), with less inflammation. Radiographic findings of pulmonary infiltrates and positive SARS-CoV-2 PCR accurately distinguished cluster 2 MIS-C labeled patients from cluster 1 patients. Interpretation Using a data driven, unsupervised approach, we identified features that cluster patients into a group with high likelihood of having MIS-C. Other features identified a cluster of patients more likely to have acute severe COVID-19 pulmonary disease, and patients in this cluster labeled by clinicians as MIS-C may be misclassified. These data driven phenotypes may help refine the diagnosis of MIS-C.Item Distributed graph decomposition algorithms on Apache Spark(2018-04-20) Mandal, Aritra; Hasan, Mohammad Al; Mohler, George; Song, FengguangStructural analysis and mining of large and complex graphs for describing the characteristics of a vertex or an edge in the graph have widespread use in graph clustering, classification, and modeling. There are various methods for structural analysis of graphs including the discovery of frequent subgraphs or network motifs, counting triangles or graphlets, spectral analysis of networks using eigenvectors of graph Laplacian, and finding highly connected subgraphs such as cliques and quasi cliques. Unfortunately, the algorithms for solving most of the above tasks are quite costly, which makes them not-scalable to large real-life networks. Two such very popular decompositions, k-core and k-truss of a graph give very useful insight about the graph vertex and edges respectively. These decompositions have been applied to solve protein functions reasoning on protein-protein networks, fraud detection and missing link prediction problems. k-core decomposition with is linear time complexity is scalable to large real-life networks as long as the input graph fits in the main memory. k-truss on the other hands is computationally more intensive due to its definition relying on triangles and their is no linear time algorithm available for it. In this paper, we propose distributed algorithms on Apache Spark for k-truss and k-core decomposition of a graph. We also compare the performance of our algorithm with state-of-the-art Map-Reduce and parallel algorithms using openly available real world network data. Our proposed algorithms have shown substantial performance improvement.Item Efficient Inference and Dominant-Set Based Clustering for Functional Data(2024-05) Wang, Xiang; Wang, Honglang; Boukai, Benzion; Tan, Fei; Peng, HanxiangThis dissertation addresses three progressively fundamental problems for functional data analysis: (1) To do efficient inference for the functional mean model accounting for within-subject correlation, we propose the refined and bias-corrected empirical likelihood method. (2) To identify functional subjects potentially from different populations, we propose the dominant-set based unsupervised clustering method using the similarity matrix. (3) To learn the similarity matrix from various similarity metrics for functional data clustering, we propose the modularity guided and dominant-set based semi-supervised clustering method. In the first problem, the empirical likelihood method is utilized to do inference for the mean function of functional data by constructing the refined and bias-corrected estimating equation. The proposed estimating equation not only improves efficiency but also enables practically feasible empirical likelihood inference by properly incorporating within-subject correlation, which has not been achieved by previous studies. In the second problem, the dominant-set based unsupervised clustering method is proposed to maximize the within-cluster similarity and applied to functional data with a flexible choice of similarity measures between curves. The proposed unsupervised clustering method is a hierarchical bipartition procedure under the penalized optimization framework with the tuning parameter selected by maximizing the clustering criterion called modularity of the resulting two clusters, which is inspired by the concept of dominant set in graph theory and solved by replicator dynamics in game theory. The advantage offered by this approach is not only robust to imbalanced sizes of groups but also to outliers, which overcomes the limitation of many existing clustering methods. In the third problem, the metric-based semi-supervised clustering method is proposed with similarity metric learned by modularity maximization and followed by the above proposed dominant-set based clustering procedure. Under semi-supervised setting where some clustering memberships are known, the goal is to determine the best linear combination of candidate similarity metrics as the final metric to enhance the clustering performance. Besides the global metric-based algorithm, another algorithm is also proposed to learn individual metrics for each cluster, which permits overlapping membership for the clustering. This is innovatively different from many existing methods. This method is superiorly applicable to functional data with various similarity metrics between functional curves, while also exhibiting robustness to imbalanced sizes of groups, which are intrinsic to the dominant-set based clustering approach. In all three problems, the advantages of the proposed methods are demonstrated through extensive empirical investigations using simulations as well as real data applications.Item Genetic clustering on the hippocampal surface for genome-wide association studies(Springer Nature, 2013) Hibar, Derrek P.; Medland, Sarah E.; Stein, Jason L.; Kim, Sungeun; Shen, Li; Saykin, Andrew J.; de Zubicaray, Greig I.; McMahon, Katie L.; Montgomery, Grant W.; Martin, Nicholas G.; Wright, Margaret J.; Djurovic, Srdjan; Agartz, Ingrid; Andreassen, Ole A.; Thompson, Paul M.; Radiology and Imaging Sciences, School of MedicineImaging genetics aims to discover how variants in the human genome influence brain measures derived from images. Genome-wide association scans (GWAS) can screen the genome for common differences in our DNA that relate to brain measures. In small samples, GWAS has low power as individual gene effects are weak and one must also correct for multiple comparisons across the genome and the image. Here we extend recent work on genetic clustering of images, to analyze surface-based models of anatomy using GWAS. We performed spherical harmonic analysis of hippocampal surfaces, automatically extracted from brain MRI scans of 1254 subjects. We clustered hippocampal surface regions with common genetic influences by examining genetic correlations (rg) between the normalized deformation values at all pairs of surface points. Using genetic correlations to cluster surface measures, we were able to boost effect sizes for genetic associations, compared to clustering with traditional phenotypic correlations using Pearson's r.Item Genetic Spectrum and Distinct Evolution Patterns of SARS-CoV-2(Frontiers Media, 2020-09-25) Liu, Sheng; Shen, Jikui; Fang, Shuyi; Li, Kailing; Liu, Juli; Yang, Lei; Hu, Chang-Deng; Wan, Jun; Medical and Molecular Genetics, School of MedicineFour signature groups of frequently occurred single-nucleotide variants (SNVs) were identified in over twenty-eight thousand high-quality and high-coverage SARS-CoV-2 complete genome sequences, representing different viral strains. Some SNVs predominated but were mutually exclusively presented in patients from different countries and areas. These major SNV signatures exhibited distinguishable evolution patterns over time. A few hundred patients were detected with multiple viral strain-representing mutations simultaneously, which may stand for possible co-infection or potential homogenous recombination of SARS-CoV-2 in environment or within the viral host. Interestingly nucleotide substitutions among SARS-CoV-2 genomes tended to switch between bat RaTG13 coronavirus sequence and Wuhan-Hu-1 genome, indicating the higher genetic instability or tolerance of mutations on those sites or suggesting that major viral strains might exist between Wuhan-Hu-1 and RaTG13 coronavirus.Item Genomic clustering analysis identifies molecular subtypes of thymic epithelial tumors independent of World Health Organization histologic type(Impact Journals, 2021-06-08) Padda, Sukhmani K.; Gökmen-Polar, Yesim; Hellyer, Jessica A.; Badve, Sunil S.; Singh, Neeraj K.; Vasista, Sumanth M.; Basu, Kabya; Kumar, Ansu; Wakelee, Heather A.; Pathology and Laboratory Medicine, School of MedicineFurther characterization of thymic epithelial tumors (TETs) is needed. Genomic information from 102 evaluable TETs from The Cancer Genome Atlas (TCGA) dataset and from the IU-TAB-1 cell line (type AB thymoma) underwent clustering analysis to identify molecular subtypes of TETs. Six novel molecular subtypes (TH1-TH6) of TETs from the TCGA were identified, and there was no association with WHO histologic subtype. The IU-TAB-1 cell line clustered into the TH4 molecular subtype and in vitro testing of candidate therapeutics was performed. The IU-TAB-1 cell line was noted to be resistant to everolimus (mTORC1 inhibitor) and sensitive to nelfinavir (AKT1 inhibitor) across the endpoints measured. Sensitivity to nelfinavir was due to the IU-TAB-1 cell line’s gain-of function (GOF) mutation in PIK3CA and amplification of genes observed from array comparative genomic hybridization (aCGH), including AURKA, ERBB2, KIT, PDGFRA and PDGFB, that are known upregulate AKT, while resistance to everolimus was primarily driven by upregulation of downstream signaling of KIT, PDGFRA and PDGFB in the presence of mTORC1 inhibition. We present a novel molecular classification of TETs independent of WHO histologic subtype, which may be used for preclinical validation studies of potential candidate therapeutics of interest for this rare disease.Item Integrate Model and Instance Based Machine Learning for Network Intrusion Detection(2018-12) Ara, Lena; Luo, Xiao; King, Brian; El-Sharkawy, MohamedIn computer networks, the convenient internet access facilitates internet services, but at the same time also augments the spread of malicious software which could represent an attack or unauthorized access. Thereby, making the intrusion detection an important area to explore for detecting these unwanted activities. This thesis concentrates on combining the Model and Instance Based Machine Learning for detecting intrusions through a series of algorithms starting from clustering the similar hosts. Similar hosts have been found based on the supervised machine learning techniques like Support Vector Machines, Decision Trees and K Nearest Neighbors using our proposed Data Fusion algorithm. Maximal cliques of Graph Theory has been explored to find the clusters. A recursive way is proposed to merge the decision areas of best features. The idea is to implement a combination of model and instance based machine learning and analyze how it performs as compared to a conventional machine learning algorithm like Random Forest for intrusion detection. The system has been evaluated on three datasets by CTU-13. The results show that our proposed method gives better detection rate as compared to traditional methods which might overfit the data. The research work done in model merging, instance based learning, random forests, data mining and ensemble learning with regards to intrusion detection have been studied and taken as reference.Item Medical Imaging Centers in Central Indiana: Optimal Location Allocation Analyses(2016-01) Seger, Mandi J.; Banerjee, Aniruddha; Wilson, Jeffrey S.; Lulla, Vijay O.; Wiehe, Sarah ElizabethWhile optimization techniques have been studied since 300 B.C. when Euclid first considered the minimal distance between a point and a line, it wasn’t until 1966 that location optimization was first applied to a problem in healthcare. Location optimization techniques are capable of increasing efficiency and equity in the placement of many types of services, including those within the healthcare industry, thus enhancing quality of life. Medical imaging is a healthcare service which helps to determine medical diagnoses in acute and preventive care settings. It provides physicians with information guiding treatment and returning a patient back to optimal health. In this study, a retrospective analysis of the locations of current medical imaging centers in central Indiana is performed, and alternate placement as determined using optimization techniques is considered and compared. This study focuses on reducing the drive time experienced by the population within the study area to their nearest imaging facility. Location optimization models such as the P-Median model, the Maximum Covering model, and Clustering and Partitioning are often used in the field of operations research to solve location problems, but are lesser known within the discipline of Geographic Information Science. This study was intended to demonstrate the capabilities of these powerful algorithms and to increase understanding of how they may be applied to problems within healthcare. While the P-Median model is effective at reducing the overall drive time for a given network set, individuals within the network may experience lengthy drive times. The results further indicate that while the Maximum Covering model is more equitable than the P-Median model, it produces large sets of assigned individuals overwhelming the capacity of one imaging center. Finally, the Clustering and Partitioning method is effective at limiting the number of individuals assigned to a given imaging center, but it does not provide information regarding average drive time for those individuals. In the end, it is determined that a capacitated Maximal Covering model would be the preferred method for solving this particular location problem.