Enumerating k-cliques in a large network using Apache Spark

Dheekonda, Raja Sekhar Rao

Enumerating k-cliques in a large network using Apache Spark

Files

RajaDheekonda Thesis Final Copy_approved.pdf (613.03 KB)

Date

2017

Authors

Dheekonda, Raja Sekhar Rao

Language

American English

Committee Chair

Al Hasan, MOHAMMAD

Degree

M.S.

Degree Year

2017

Grantor

Purdue University

Abstract

Network analysis is an important research task which explains the relationships among various entities in a given domain. Most of the existing approaches of network analysis compute global properties of a network, such as transitivity, diameter, and all-pair shortest paths. They also study various non-random properties of a network, such as graph densifi cation with shrinking diameter, small diameter, and scale-freeness. Such approaches enable us to understand real-life networks with global properties. However, the discovery of the local topological building blocks within a network is an important task, and examples include clique enumeration, graphlet counting, and motif counting. In this paper, my focus is to fi nd an efficient solution of k-clique enumeration problem. A clique is a small, connected, and complete induced subgraph over a large network. However, enumerating cliques using sequential technologies is very time-consuming. Another promising direction that is being adopted is a solution that runs on distributed clusters of machines using the Hadoop mapreduce framework. However, the solution suffers from a general limitation of the framework, as Hadoop's mapreduce performs substantial amounts of reading and writing to disk. Thus, the running times of Hadoop-based approaches suffer enormously. To avoid these problems, we propose an e cient, scalable, and distributed solution, kc-spark , for enumerating cliques in real-life networks using the Apache Spark in-memory cluster computing framework. Experiment results show that kc-spark can enumerate k-cliques from very large real-life networks, whereas a single commodity machine cannot produce the same desired result in a feasible amount of time. We also compared kc-spark with Hadoop mapreduce solutions and found the algorithm to be 80-100 percent faster in terms of running times. On the other hand, we compared with the triangle enumeration with Hadoop mapreduce and results shown that kc-spark is 8-10 times faster than mapreduce implementation with the same cluster setup. Furthermore, the overall performance of kc-spark is improved by using Spark's inbuilt caching and broadcast transformations.

Description

Indiana University-Purdue University Indianapolis (IUPUI)

Keywords

k-clique, Apache Spark, Enumeration, Distributed, Graph Mining

Rights

Type

Thesis

Permanent Link

https://hdl.handle.net/1805/12282
http://dx.doi.org/10.7912/C2/2341

Collections

Computer & Information Science Department Theses and Dissertations

Full item page