Enumerating k-cliques in a large network using Apache Spark

Date
2017
Language
American English
Embargo Lift Date
Department
Committee Chair
Committee Members
Degree
M.S.
Degree Year
2017
Department
Grantor
Purdue University
Journal Title
Journal ISSN
Volume Title
Found At
Abstract

Network analysis is an important research task which explains the relationships among various entities in a given domain. Most of the existing approaches of network analysis compute global properties of a network, such as transitivity, diameter, and all-pair shortest paths. They also study various non-random properties of a network, such as graph densifi cation with shrinking diameter, small diameter, and scale-freeness. Such approaches enable us to understand real-life networks with global properties. However, the discovery of the local topological building blocks within a network is an important task, and examples include clique enumeration, graphlet counting, and motif counting. In this paper, my focus is to fi nd an efficient solution of k-clique enumeration problem. A clique is a small, connected, and complete induced subgraph over a large network. However, enumerating cliques using sequential technologies is very time-consuming. Another promising direction that is being adopted is a solution that runs on distributed clusters of machines using the Hadoop mapreduce framework. However, the solution suffers from a general limitation of the framework, as Hadoop's mapreduce performs substantial amounts of reading and writing to disk. Thus, the running times of Hadoop-based approaches suffer enormously. To avoid these problems, we propose an e cient, scalable, and distributed solution, kc-spark , for enumerating cliques in real-life networks using the Apache Spark in-memory cluster computing framework. Experiment results show that kc-spark can enumerate k-cliques from very large real-life networks, whereas a single commodity machine cannot produce the same desired result in a feasible amount of time. We also compared kc-spark with Hadoop mapreduce solutions and found the algorithm to be 80-100 percent faster in terms of running times. On the other hand, we compared with the triangle enumeration with Hadoop mapreduce and results shown that kc-spark is 8-10 times faster than mapreduce implementation with the same cluster setup. Furthermore, the overall performance of kc-spark is improved by using Spark's inbuilt caching and broadcast transformations.

Description
Indiana University-Purdue University Indianapolis (IUPUI)
item.page.description.tableofcontents
item.page.relation.haspart
Cite As
ISSN
Publisher
Series/Report
Sponsorship
Major
Extent
Identifier
Relation
Journal
Rights
Source
Alternative Title
Type
Thesis
Number
Volume
Conference Dates
Conference Host
Conference Location
Conference Name
Conference Panel
Conference Secretariat Location
Version
Full Text Available at
This item is under embargo {{howLong}}