Benchmarking authorship attribution techniques using over a thousand books by fifty Victorian era novelists

Gungor, Abdulmecit

Benchmarking authorship attribution techniques using over a thousand books by fifty Victorian era novelists

dc.contributor.advisor	Dundar, Murat
dc.contributor.author	Gungor, Abdulmecit
dc.date.accessioned	2018-04-26T21:16:11Z
dc.date.available	2018-04-26T21:16:11Z
dc.date.issued	2018-04-03
dc.degree.date	2018	en_US
dc.degree.grantor	Purdue University	en_US
dc.degree.level	M.S.	en_US
dc.description	Indiana University-Purdue University Indianapolis (IUPUI)	en_US
dc.description.abstract	Authorship attribution (AA) is the process of identifying the author of a given text and from the machine learning perspective, it can be seen as a classification problem. In the literature, there are a lot of classification methods for which feature extraction techniques are conducted. In this thesis, we explore information retrieval techniques such as Word2Vec, paragraph2vec, and other useful feature selection and extraction techniques for a given text with different classifiers. We have performed experiments on novels that are extracted from GDELT database by using different features such as bag of words, n-grams or newly developed techniques like Word2Vec. To improve our success rate, we have combined some useful features some of which are diversity measure of text, bag of words, bigrams, specific words that are written differently between English and American authors. Support vector machine classifiers with nu-SVC type is observed to give best success rates on the stacked useful feature set. The main purpose of this work is to lay the foundations of feature extraction techniques in AA. These are lexical, character-level, syntactic, semantic, application specific features. We also have aimed to offer a new data resource for the author attribution research community and demonstrate how it can be used to extract features as in any kind of AA problem. The dataset we have introduced consists of works of Victorian era authors and the main feature extraction techniques are shown with exemplary code snippets for audiences in different knowledge domains. Feature extraction approaches and implementation with different classifiers are employed in simple ways such that it would also serve as a beginner step to AA. Some feature extraction techniques introduced in this work are also meant to be employed in different NLP tasks such as sentiment analysis with Word2Vec or text summarization. Using the introduced NLP tasks and feature extraction techniques one can start to implement them on our dataset. We have also introduced several methods to implement extracted features in different methodologies such as feature stack engineering with different classifiers, or using Word2Vec to create sentence level vectors.	en_US
dc.identifier.doi	10.7912/C21T01
dc.identifier.uri	https://hdl.handle.net/1805/15938
dc.identifier.uri	http://dx.doi.org/10.7912/C2/2352
dc.language.iso	en_US	en_US
dc.rights	Attribution-NonCommercial-NoDerivs 3.0 United States
dc.rights.uri	http://creativecommons.org/licenses/by-nc-nd/3.0/us/
dc.subject	Authorship Attribution	en_US
dc.subject	Word2Vec	en_US
dc.subject	Doc2Vec	en_US
dc.subject	Word2Vec Inversion	en_US
dc.subject	Unsupervised Feature Learning in Text Mining	en_US
dc.subject	Word Scoring	en_US
dc.title	Benchmarking authorship attribution techniques using over a thousand books by fifty Victorian era novelists	en_US
dc.type	Thesis	en
thesis.degree.discipline	Computer & Information Science	en

Files

Original bundle

Now showing 1 - 1 of 1

Name:: abdulmecits-purdue-thesis.pdf
Size:: 1.11 MB
Format:: Adobe Portable Document Format
Description:: Full Thesis

Download

License bundle

Now showing 1 - 1 of 1

Name:: license.txt
Size:: 1.99 KB
Format:: Item-specific license agreed upon to submission
Description:

Download

Collections

Computer & Information Science Department Theses and Dissertations