- Browse by Author
Browsing by Author "Department of Computer & Information Science, School of Science"
Now showing 1 - 10 of 25
Results Per Page
Sort Options
Item DMAP: a connectivity map database to enable identification of novel drug repositioning candidates(BioMed Central, 2015-09-25) Huang, Hui; Nguyen, Thanh; Ibrahim, Sara; Shantharam, Sandeep; Yue, Zongliang; Chen, Jake Yue; Department of Computer & Information Science, School of ScienceBACKGROUND: Drug repositioning is a cost-efficient and time-saving process to drug development compared to traditional techniques. A systematic method to drug repositioning is to identify candidate drug's gene expression profiles on target disease models and determine how similar these profiles are to approved drugs. Databases such as the CMAP have been developed recently to help with systematic drug repositioning. METHODS: To overcome the limitation of connectivity maps on data coverage, we constructed a comprehensive in silico drug-protein connectivity map called DMAP, which contains directed drug-to-protein effects and effect scores. The drug-to-protein effect scores are compiled from all database entries between the drug and protein have been previously observed and provide a confidence measure on the quality of such drug-to-protein effects. RESULTS: In DMAP, we have compiled the direct effects between 24,121 PubChem Compound ID (CID), which were mapped from 289,571 chemical entities recognized from public literature, and 5,196 reviewed Uniprot proteins. DMAP compiles a total of 438,004 chemical-to-protein effect relationships. Compared to CMAP, DMAP shows an increase of 221 folds in the number of chemicals and 1.92 fold in the number of ATC codes. Furthermore, by overlapping DMAP chemicals with the approved drugs with known indications from the TTD database and literature, we obtained 982 drugs and 622 diseases; meanwhile, we only obtained 394 drugs with known indication from CMAP. To validate the feasibility of applying new DMAP for systematic drug repositioning, we compared the performance of DMAP and the well-known CMAP database on two popular computational techniques: drug-drug-similarity-based method with leave-one-out validation and Kolmogorov-Smirnov scoring based method. In drug-drug-similarity-based method, the drug repositioning prediction using DMAP achieved an Area-Under-Curve (AUC) score of 0.82, compared with that using CMAP, AUC = 0.64. For Kolmogorov-Smirnov scoring based method, with DMAP, we were able to retrieve several drug indications which could not be retrieved using CMAP. DMAP data can be queried using the existing C2MAP server or downloaded freely at: http://bio.informatics.iupui.edu/cmaps CONCLUSIONS: Reliable measurements of how drug affect disease-related proteins are critical to ongoing drug development in the genome medicine era. We demonstrated that DMAP can help drug development professionals assess drug-to-protein relationship data and improve chances of success for systematic drug repositioning efforts.Item Enhancing and Implementing Fully Transparent Internet Voting(IEEE, 2015-08) Butterfield, Kevin; Li, Huian; Zou, Xukai; Li, Feng; Department of Computer & Information Science, School of ScienceVoting over the internet has been the focus of significant research with the potential to solve many problems. Current implementations typically suffer from a lack of transparency, where the connection between vote casting and result tallying is seen as a black box by voters. A new protocol was recently proposed that allows full transparency, never obfuscating any step of the process, and splits authority between mutually-constraining conflicting parties. Achieving such transparency brings with it challenging issues. In this paper we propose an efficient algorithm for generating unique, anonymous identifiers (voting locations) that is based on the Chinese Remainder Theorem, we extend the functionality of an election to allow for races with multiple winners, and we introduce a prototype of this voting system implemented as a multiplatform web application.Item FS3: A Sampling based method for top-k Frequent Subgraph Mining(2015) Saha, Tanay Kumar; Al Hasan, Mohammad; Department of Computer & Information Science, School of ScienceMining labeled subgraph is a popular research task in data mining because of its potential application in many different scientific domains. All the existing methods for this task explicitly or implicitly solve the subgraph isomorphism task which is computationally expensive, so they suffer from the lack of scalability problem when the graphs in the input database are large. In this work, we propose FS3, which is a sampling based method. It mines a small collection of subgraphs that are most frequent in the probabilistic sense. FS3 performs a Markov Chain Monte Carlo (MCMC) sampling over the space of a fixed-size subgraphs such that the potentially frequent subgraphs are sampled more often. Besides, FS3 is equipped with an innovative queue manager. It stores the sampled subgraph in a finite queue over the course of mining in such a manner that the top-k positions in the queue contain the most frequent subgraphs. Our experiments on database of large graphs show that FS3 is efficient, and it obtains subgraphs that are the most frequent amongst the subgraphs of a given size.Item The Infinite Mixture of Infinite Gaussian Mixtures(2015) Yerebakan, Halid Z.; Rajwa, Bartek; Dundar, Murat; Department of Computer & Information Science, School of ScienceDirichlet process mixture of Gaussians (DPMG) has been used in the literature for clustering and density estimation problems. However, many real-world data exhibit cluster distributions that cannot be captured by a single Gaussian. Modeling such data sets by DPMG creates several extraneous clusters even when clusters are relatively well-defined. Herein, we present the infinite mixture of infinite Gaussian mixtures (I2GMM) for more flexible modeling of data sets with skewed and multi-modal cluster distributions. Instead of using a single Gaussian for each cluster as in the standard DPMG model, the generative model of I2GMM uses a single DPMG for each cluster. The individual DPMGs are linked together through centering of their base distributions at the atoms of a higher level DP prior. Inference is performed by a collapsed Gibbs sampler that also enables partial parallelization. Experimental results on several artificial and real-world data sets suggest the proposed I2GMM model can predict clusters more accurately than existing variational Bayes and Gibbs sampler versions of DPMG.Item LOCALIZED TEMPORAL PROFILE OF SURVEILLANCE VIDEO(IEEE, 2014-07) Bagheri, Saeid; Zheng, Jiang Yu; Department of Computer & Information Science, School of ScienceSurveillance videos are recorded pervasively and their retrieval currently still relies on human operators. As an intermediate representation, this work develops a new temporal profile of video to convey accurate temporal information in the video while keeping certain spatial characteristics of targets of interest for recognition. The profile is obtained at critical positions where major target flow appears. We set a sampling line crossing the motion direction to profile passing targets in the temporal domain. In order to add spatial information to the temporal profile to certain extent, we integrate multiple profiles from a set of lines with blending method to reflect the target motion direction and position in the temporal profile. Different from mosaicing/montage methods for video synopsis in spatial domain, our temporal profile has no limit on the time length, and the created profile significantly reduces the data size for brief indexing and fast search of video.Item Modeling and Implementation of an Asynchronous Approach to Integrating HPC and Big Data Analysis(Elsevier, 2016-06) Fu, Yuankun; Song, Fengguang; Zhu, Luoding; Department of Computer & Information Science, School of ScienceWith the emergence of exascale computing and big data analytics, many important scientific applications require the integration of computationally intensive modeling and simulation with data-intensive analysis to accelerate scientific discovery. In this paper, we create an analytical model to steer the optimization of the end-to-end time-to-solution for the integrated computation and data analysis. We also design and develop an intelligent data broker to efficiently intertwine the computation stage and the analysis stage to practically achieve the optimal time-to-solution predicted by the analytical model. We perform experiments on both synthetic applications and real-world computational fluid dynamics (CFD) applications. The experiments show that the analytic model exhibits an average relative error of less than 10%, and the application performance can be improved by up to 131% for the synthetic programs and by up to 78% for the real-world CFD application.Item Monitoring Routing Topology in Dynamic Wireless Sensor Network Systems(IEEE, 2015) Liu, Rui; Liang, Yao; Zhong, Xiaoyang; Department of Computer & Information Science, School of ScienceIn large-scale multi-hop wireless sensor networks (WSNs) for data collection, the ability of monitoring per-packet routing paths at the sink is essential in better understanding network dynamics, and improving routing protocols, topology control, energy conservation, anomaly detection, and load balance in WSN deployments. In this study, we consider this important problem under tremendous WSN routing dynamics, which cannot be addressed by previous methods based on a routing tree model. We formulate the WSN topology inference as a novel optimization problem, and devise efficient decoding algorithms to effectively recover WSN routing topology at the sink in real-time using a small fixed-size path measurement attached to each packet. Rigorous complexity analysis of the devised algorithms is given. Performance evaluation is conducted via extensive simulations. The results reveal that our approach significantly outperforms other state-of-the-art methods including MNT, Pathfinder, and CSPR. Furthermore, we validate our approach intensively with a real-world outdoor WSN deployment running collection tree protocol for environmental data collection.Item A Naïve Bayesian Classifier in Categorical Uncertain Data Streams(IEEE, 2014-10) Ge, Jiaqi; Xia, Yuni; Wang, Jian; Department of Computer & Information Science, School of ScienceThis paper proposes a novel naïve Bayesian classifier in categorical uncertain data streams. Uncertainty in categorical data is usually represented by vector valued discrete pdf, which has to be carefully handled to guarantee the underlying performance in data mining applications. In this paper, we map the probabilistic attribute to deterministic points in the Euclidean space and design a distance based and a density based algorithms to measure the correlations between feature vectors and class labels. We also devise a new pre-binning approach to guarantee bounded computation and memory cost in uncertain data streams classification. Experimental results in real uncertain data streams prove that our density-based naive classifier is efficient, accurate, and robust to data uncertainty.Item Name Disambiguation from link data in a collaboration graph using temporal and topological features(Springer, 2015-12) Saha, Tanay Kumar; Zhang, Baichuan; Al Hasan, Mohammad; Department of Computer & Information Science, School of ScienceIn a social community, multiple persons may share the same name, phone number or some other identifying attributes. This, along with other phenomena, such as name abbreviation, name misspelling, and human error lead to erroneous aggregation of records of multiple persons under a single reference. Such mistakes affect the performance of document retrieval, web search, database integration, and more importantly, improper attribution of credit (or blame). The task of entity disambiguation partitions the records belonging to multiple persons with the objective that each partition is composed of records of a unique person. Existing solutions to this task use either biographical attributes, or auxiliary features that are collected from external sources, such as Wikipedia. However, for many scenarios, such auxiliary features are not available, or they are costly to obtain. Besides, the attempt of collecting biographical or external data sustains the risk of privacy violation. In this work, we propose a method for solving entity disambiguation task from timestamped link information obtained from a collaboration network. Our method is non-intrusive of privacy as it uses only the graph topology of an anonymized network. Experimental results on two real-life academic collaboration networks show that the proposed method has satisfactory performance.Item A non-parametric Bayesian model for joint cell clustering and cluster matching: identification of anomalous sample phenotypes with random effects(Springer (Biomed Central Ltd.), 2014) Dundar, Murat; Akova, Ferit; Yerebakan, Halid Z.; Rajwa, Bartek; Department of Computer & Information Science, School of ScienceBACKGROUND: Flow cytometry (FC)-based computer-aided diagnostics is an emerging technique utilizing modern multiparametric cytometry systems.The major difficulty in using machine-learning approaches for classification of FC data arises from limited access to a wide variety of anomalous samples for training. In consequence, any learning with an abundance of normal cases and a limited set of specific anomalous cases is biased towards the types of anomalies represented in the training set. Such models do not accurately identify anomalies, whether previously known or unknown, that may exist in future samples tested. Although one-class classifiers trained using only normal cases would avoid such a bias, robust sample characterization is critical for a generalizable model. Owing to sample heterogeneity and instrumental variability, arbitrary characterization of samples usually introduces feature noise that may lead to poor predictive performance. Herein, we present a non-parametric Bayesian algorithm called ASPIRE (anomalous sample phenotype identification with random effects) that identifies phenotypic differences across a batch of samples in the presence of random effects. Our approach involves simultaneous clustering of cellular measurements in individual samples and matching of discovered clusters across all samples in order to recover global clusters using probabilistic sampling techniques in a systematic way. RESULTS: We demonstrate the performance of the proposed method in identifying anomalous samples in two different FC data sets, one of which represents a set of samples including acute myeloid leukemia (AML) cases, and the other a generic 5-parameter peripheral-blood immunophenotyping. Results are evaluated in terms of the area under the receiver operating characteristics curve (AUC). ASPIRE achieved AUCs of 0.99 and 1.0 on the AML and generic blood immunophenotyping data sets, respectively. CONCLUSIONS: These results demonstrate that anomalous samples can be identified by ASPIRE with almost perfect accuracy without a priori access to samples of anomalous subtypes in the training set. The ASPIRE approach is unique in its ability to form generalizations regarding normal and anomalous states given only very weak assumptions regarding sample characteristics and origin. Thus, ASPIRE could become highly instrumental in providing unique insights about observed biological phenomena in the absence of full information about the investigated samples.
- «
- 1 (current)
- 2
- 3
- »