Benchmarking authorship attribution techniques using over a thousand books by fifty Victorian era novelists

Gungor, Abdulmecit

Benchmarking authorship attribution techniques using over a thousand books by fifty Victorian era novelists

Files

abdulmecits-purdue-thesis.pdf (1.11 MB)

If you need an accessible version of this item, please submit a remediation request.

Date

2018-04-03

Authors

Gungor, Abdulmecit

Language

American English

Committee Chair

Dundar, Murat

Degree

M.S.

Degree Year

2018

Grantor

Purdue University

Abstract

Authorship attribution (AA) is the process of identifying the author of a given text and from the machine learning perspective, it can be seen as a classification problem. In the literature, there are a lot of classification methods for which feature extraction techniques are conducted. In this thesis, we explore information retrieval techniques such as Word2Vec, paragraph2vec, and other useful feature selection and extraction techniques for a given text with different classifiers. We have performed experiments on novels that are extracted from GDELT database by using different features such as bag of words, n-grams or newly developed techniques like Word2Vec. To improve our success rate, we have combined some useful features some of which are diversity measure of text, bag of words, bigrams, specific words that are written differently between English and American authors. Support vector machine classifiers with nu-SVC type is observed to give best success rates on the stacked useful feature set.

The main purpose of this work is to lay the foundations of feature extraction techniques in AA. These are lexical, character-level, syntactic, semantic, application specific features. We also have aimed to offer a new data resource for the author attribution research community and demonstrate how it can be used to extract features as in any kind of AA problem. The dataset we have introduced consists of works of Victorian era authors and the main feature extraction techniques are shown with exemplary code snippets for audiences in different knowledge domains. Feature extraction approaches and implementation with different classifiers are employed in simple ways such that it would also serve as a beginner step to AA. Some feature extraction techniques introduced in this work are also meant to be employed in different NLP tasks such as sentiment analysis with Word2Vec or text summarization. Using the introduced NLP tasks and feature extraction techniques one can start to implement them on our dataset. We have also introduced several methods to implement extracted features in different methodologies such as feature stack engineering with different classifiers, or using Word2Vec to create sentence level vectors.

Description

Indiana University-Purdue University Indianapolis (IUPUI)

Keywords

Authorship Attribution, Word2Vec, Doc2Vec, Word2Vec Inversion, Unsupervised Feature Learning in Text Mining, Word Scoring

Rights

Attribution-NonCommercial-NoDerivs 3.0 United States

Type

Thesis

Permanent Link

https://hdl.handle.net/1805/15938
http://dx.doi.org/10.7912/C2/2352

Collections

Computer & Information Science Department Theses and Dissertations

Full item page