ScholarWorksIndianapolis
  • Communities & Collections
  • Browse ScholarWorks
  • English
  • Català
  • Čeština
  • Deutsch
  • Español
  • Français
  • Gàidhlig
  • Italiano
  • Latviešu
  • Magyar
  • Nederlands
  • Polski
  • Português
  • Português do Brasil
  • Suomi
  • Svenska
  • Türkçe
  • Tiếng Việt
  • Қазақ
  • বাংলা
  • हिंदी
  • Ελληνικά
  • Yкраї́нська
  • Log In
    or
    New user? Click here to register.Have you forgotten your password?
  1. Home
  2. Browse by Author

Browsing by Author "Zhang, Baichuan"

Now showing 1 - 10 of 10
Results Per Page
Sort Options
  • Loading...
    Thumbnail Image
    Item
    Bayesian Non-Exhaustive Classification A Case Study: Online Name Disambiguation using Temporal Record Streams
    (ACM, 2016-10) Zhang, Baichuan; Dundar, Murat; Al Hasan, Mohammad; Department of Computer and Information Science, School of Science
    The name entity disambiguation task aims to partition the records of multiple real-life persons so that each partition contains records pertaining to a unique person. Most of the existing solutions for this task operate in a batch mode, where all records to be disambiguated are initially available to the algorithm. However, more realistic settings require that the name disambiguation task be performed in an online fashion, in addition to, being able to identify records of new ambiguous entities having no preexisting records. In this work, we propose a Bayesian non-exhaustive classification framework for solving online name disambiguation task. Our proposed method uses a Dirichlet process prior with a Normal x Normal x Inverse Wishart data model which enables identification of new ambiguous entities who have no records in the training data. For online classification, we use one sweep Gibbs sampler which is very efficient and effective. As a case study we consider bibliographic data in a temporal stream format and disambiguate authors by partitioning their papers into homogeneous groups. Our experimental results demonstrate that the proposed method is better than existing methods for performing online name disambiguation task.
  • Loading...
    Thumbnail Image
    Item
    A Combined Representation Learning Approach for Better Job and Skill Recommendation
    (ACM, 2018-10) Dave, Vachik S.; Al Hasan, Mohammad; Zhang, Baichuan; AlJadda, Khalifeh; Korayem, Mohammed; Computer and Information Science, School of Science
    Job recommendation is an important task for the modern recruitment industry. An excellent job recommender system not only enables to recommend a higher paying job which is maximally aligned with the skill-set of the current job, but also suggests to acquire few additional skills which are required to assume the new position. In this work, we created three types of information net- works from the historical job data: (i) job transition network, (ii) job-skill network, and (iii) skill co-occurrence network. We provide a representation learning model which can utilize the information from all three networks to jointly learn the representation of the jobs and skills in the shared k-dimensional latent space. In our experiments, we show that by jointly learning the representation for the jobs and skills, our model provides better recommendation for both jobs and skills. Additionally, we also show some case studies which validate our claims.
  • Loading...
    Thumbnail Image
    Item
    Feature Selection for Classification under Anonymity Constraint
    (2017) Zhang, Baichuan; Mohammed, Noman; Dave, Vachik S.; Al Hasan, Mohammad; Computer and Information Science, School of Science
    Over the last decade, proliferation of various online platforms and their increasing adoption by billions of users have heightened the privacy risk of a user enormously. In fact, security researchers have shown that sparse microdata containing information about online activities of a user although anonymous, can still be used to disclose the identity of the user by cross-referencing the data with other data sources. To preserve the privacy of a user, in existing works several methods (k-anonymity, l-diversity, differential privacy) are proposed that ensure a dataset which is meant to share or publish bears small identity disclosure risk. However, the majority of these methods modify the data in isolation, without considering their utility in subsequent knowledge discovery tasks, which makes these datasets less informative. In this work, we consider labeled data that are generally used for classification, and propose two methods for feature selection considering two goals: first, on the reduced feature set the data has small disclosure risk, and second, the utility of the data is preserved for performing a classification task. Experimental results on various real-world datasets show that the method is effective and useful in practice.
  • Loading...
    Thumbnail Image
    Item
    Incremental eigenpair computation for graph Laplacian matrices: theory and applications
    (Springer, 2018-12) Chen, Pin-Yu; Zhang, Baichuan; Al Hasan, Mohammad; Computer and Information Science, School of Science
    The smallest eigenvalues and the associated eigenvectors (i.e., eigenpairs) of a graph Laplacian matrix have been widely used for spectral clustering and community detection. However, in real-life applications, the number of clusters or communities (say, K) is generally unknown a priori. Consequently, the majority of the existing methods either choose K heuristically or they repeat the clustering method with different choices of K and accept the best clustering result. The first option, more often, yields suboptimal result, while the second option is computationally expensive. In this work, we propose an incremental method for constructing the eigenspectrum of the graph Laplacian matrix. This method leverages the eigenstructure of graph Laplacian matrix to obtain the Kth smallest eigenpair of the Laplacian matrix given a collection of all previously computed
  • Loading...
    Thumbnail Image
    Item
    Name Disambiguation from link data in a collaboration graph
    (Office of the Vice Chancellor for Research, 2015-04-17) Zhang, Baichuan; Saha, Tanay Kumar; Al Hasan, Mohammad
    Abstract—The entity disambiguation task partitions the records belonging to multiple persons with the objective that each decomposed partition is composed of records of a unique person. Existing solutions to this task use either biographical attributes, or auxiliary features that are collected from external sources, such as Wikipedia. However, for many scenarios, such auxiliary features are not available, or they are costly to obtain. Besides, the attempt of collecting biographical or external data sustains the risk of privacy violation. In this work, we propose a method for solving entity disambiguation task from link information obtained from a collaboration network. Our method is nonintrusive of privacy as it uses only the timestamped graph topology of an anonymized network. Experimental results on two reallife academic collaboration networks show that the proposed method has satisfactory performance.
  • Loading...
    Thumbnail Image
    Item
    Name Disambiguation from link data in a collaboration graph using temporal and topological features
    (Springer, 2015-12) Saha, Tanay Kumar; Zhang, Baichuan; Al Hasan, Mohammad; Department of Computer & Information Science, School of Science
    In a social community, multiple persons may share the same name, phone number or some other identifying attributes. This, along with other phenomena, such as name abbreviation, name misspelling, and human error lead to erroneous aggregation of records of multiple persons under a single reference. Such mistakes affect the performance of document retrieval, web search, database integration, and more importantly, improper attribution of credit (or blame). The task of entity disambiguation partitions the records belonging to multiple persons with the objective that each partition is composed of records of a unique person. Existing solutions to this task use either biographical attributes, or auxiliary features that are collected from external sources, such as Wikipedia. However, for many scenarios, such auxiliary features are not available, or they are costly to obtain. Besides, the attempt of collecting biographical or external data sustains the risk of privacy violation. In this work, we propose a method for solving entity disambiguation task from timestamped link information obtained from a collaboration network. Our method is non-intrusive of privacy as it uses only the graph topology of an anonymized network. Experimental results on two real-life academic collaboration networks show that the proposed method has satisfactory performance.
  • Loading...
    Thumbnail Image
    Item
    Name Disambiguation in Anonymized Graphs using Network Embedding
    (ACM, 2017) Zhang, Baichuan; Al Hasan, Mohammad; Computer and Information Science, School of Science
    In real-world, our DNA is unique but many people share names. This phenomenon often causes erroneous aggregation of documents of multiple persons who are namesake of one another. Such mistakes deteriorate the performance of document retrieval, web search, and more seriously, cause improper attribution of credit or blame in digital forensic. To resolve this issue, the name disambiguation task is designed which aims to partition the documents associated with a name reference such that each partition contains documents pertaining to a unique real-life person. Existing solutions to this task substantially rely on feature engineering, such as biographical feature extraction, or construction of auxiliary features from Wikipedia. However, for many scenarios, such features may be costly to obtain or unavailable due to the risk of privacy violation. In this work, we propose a novel name disambiguation method. Our proposed method is non-intrusive of privacy because instead of using attributes pertaining to a real-life person, our method leverages only relational data in the form of anonymized graphs. In the methodological aspect, the proposed method uses a novel representation learning model to embed each document in a low dimensional vector space where name disambiguation can be solved by a hierarchical agglomerative clustering algorithm. Our experimental results demonstrate that the proposed method is significantly better than the existing name disambiguation methods working in a similar setting.
  • Loading...
    Thumbnail Image
    Item
    NOUS: Construction and Querying of Dynamic Knowledge Graphs
    (IEEE, 2017-04) Choudhury, Sutanay; Agarwal, Khushbu; Purohit, Sumit; Zhang, Baichuan; Pirrung, Meg; Smith, Will; Thomas, Mathew; Computer and Information Science, School of Science
    The ability to construct domain specific knowledge graphs (KG) and perform question-answering or hypothesis generation is a transformative capability. Despite their value, automated construction of knowledge graphs remains an expensive technical challenge that is beyond the reach for most enterprises and academic institutions. We propose an end-toend framework for developing custom knowledge graph driven analytics for arbitrary application domains. The uniqueness of our system lies A) in its combination of curated KGs along with knowledge extracted from unstructured text, B) support for advanced trending and explanatory questions on a dynamic KG, and C) the ability to answer queries where the answer is embedded across multiple data sources.
  • Loading...
    Thumbnail Image
    Item
    Predicting interval time for reciprocal link creation using survival analysis
    (Springer, 2018-12) Dave, Vachik S.; Al Hasan, Mohammad; Zhang, Baichuan; Reddy, Chandan K.; Computer and Information Science, School of Science
    The majority of directed social networks, such as Twitter, Flickr and Google+, exhibit reciprocal altruism, a social psychology phenomenon, which drives a vertex to create a reciprocal link with another vertex which has created a directed link toward the former. In existing works, scientists have already predicted the possibility of the creation of reciprocal link—a task known as “reciprocal link prediction”. However, an equally important problem is determining the interval time between the creation of the first link (also called parasocial link) and its corresponding reciprocal link. No existing works have considered solving this problem, which is the focus of this paper. Predicting the reciprocal link interval time is a challenging problem for two reasons: First, there is a lack of effective features, since well-known link prediction features are designed for undirected networks and for the binary classification task; hence, they do not work well for the interval time prediction; Second, the presence of ever-waiting links (i.e., parasocial links for which a reciprocal link is not formed within the observation period) makes the traditional supervised regression methods unsuitable for such data. In this paper, we propose a solution for the reciprocal link interval time prediction task. We map this problem to a survival analysis task and show through extensive experiments on real-world datasets that survival analysis methods perform better than traditional regression, neural network-based models and support vector regression for solving reciprocal interval time prediction.
  • Loading...
    Thumbnail Image
    Item
    Simplicity of Kmeans versus Deepness of Deep Learning: A Case of Unsupervised Feature Learning with Limited Data
    (IEEE, 2015-12) Dundar, Murat; Kou, Qiang; Zhang, Baichuan; He, Yicheng; Rajwa, Bartek; Department of Computer and Information Sciences, School of Science
    We study a bio-detection application as a case study to demonstrate that Kmeans -- based unsupervised feature learning can be a simple yet effective alternative to deep learning techniques for small data sets with limited intra-as well as inter-class diversity. We investigate the effect on the classifier performance of data augmentation as well as feature extraction with multiple patch sizes and at different image scales. Our data set includes 1833 images from four different classes of bacteria, each bacterial culture captured at three different wavelengths and overall data collected during a three-day period. The limited number and diversity of images present, potential random effects across multiple days, and the multi-mode nature of class distributions pose a challenging setting for representation learning. Using images collected on the first day for training, on the second day for validation, and on the third day for testing Kmeans -- based representation learning achieves 97% classification accuracy on the test data. This compares very favorably to 56% accuracy achieved by deep learning and 74% accuracy achieved by handcrafted features. Our results suggest that data augmentation or dropping connections between units offers little help for deep-learning algorithms, whereas significant boost can be achieved by Kmeans -- based representation learning by augmenting data and by concatenating features obtained at multiple patch sizes or image scales.
About IU Indianapolis ScholarWorks
  • Accessibility
  • Privacy Notice
  • Copyright © 2025 The Trustees of Indiana University