Benchmarking authorship attribution techniques using over a thousand books by fifty Victorian era novelists

dc.contributor.advisorDundar, Murat
dc.contributor.authorGungor, Abdulmecit
dc.date.accessioned2018-04-26T21:16:11Z
dc.date.available2018-04-26T21:16:11Z
dc.date.issued2018-04-03
dc.degree.date2018en_US
dc.degree.grantorPurdue Universityen_US
dc.degree.levelM.S.en_US
dc.descriptionIndiana University-Purdue University Indianapolis (IUPUI)en_US
dc.description.abstractAuthorship attribution (AA) is the process of identifying the author of a given text and from the machine learning perspective, it can be seen as a classification problem. In the literature, there are a lot of classification methods for which feature extraction techniques are conducted. In this thesis, we explore information retrieval techniques such as Word2Vec, paragraph2vec, and other useful feature selection and extraction techniques for a given text with different classifiers. We have performed experiments on novels that are extracted from GDELT database by using different features such as bag of words, n-grams or newly developed techniques like Word2Vec. To improve our success rate, we have combined some useful features some of which are diversity measure of text, bag of words, bigrams, specific words that are written differently between English and American authors. Support vector machine classifiers with nu-SVC type is observed to give best success rates on the stacked useful feature set. The main purpose of this work is to lay the foundations of feature extraction techniques in AA. These are lexical, character-level, syntactic, semantic, application specific features. We also have aimed to offer a new data resource for the author attribution research community and demonstrate how it can be used to extract features as in any kind of AA problem. The dataset we have introduced consists of works of Victorian era authors and the main feature extraction techniques are shown with exemplary code snippets for audiences in different knowledge domains. Feature extraction approaches and implementation with different classifiers are employed in simple ways such that it would also serve as a beginner step to AA. Some feature extraction techniques introduced in this work are also meant to be employed in different NLP tasks such as sentiment analysis with Word2Vec or text summarization. Using the introduced NLP tasks and feature extraction techniques one can start to implement them on our dataset. We have also introduced several methods to implement extracted features in different methodologies such as feature stack engineering with different classifiers, or using Word2Vec to create sentence level vectors.en_US
dc.identifier.doi10.7912/C21T01
dc.identifier.urihttps://hdl.handle.net/1805/15938
dc.identifier.urihttp://dx.doi.org/10.7912/C2/2352
dc.language.isoen_USen_US
dc.rightsAttribution-NonCommercial-NoDerivs 3.0 United States
dc.rights.urihttp://creativecommons.org/licenses/by-nc-nd/3.0/us/
dc.subjectAuthorship Attributionen_US
dc.subjectWord2Vecen_US
dc.subjectDoc2Vecen_US
dc.subjectWord2Vec Inversionen_US
dc.subjectUnsupervised Feature Learning in Text Miningen_US
dc.subjectWord Scoringen_US
dc.titleBenchmarking authorship attribution techniques using over a thousand books by fifty Victorian era novelistsen_US
dc.typeThesisen
thesis.degree.disciplineComputer & Information Scienceen
Files
Original bundle
Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
abdulmecits-purdue-thesis.pdf
Size:
1.11 MB
Format:
Adobe Portable Document Format
Description:
Full Thesis
License bundle
Now showing 1 - 1 of 1
No Thumbnail Available
Name:
license.txt
Size:
1.99 KB
Format:
Item-specific license agreed upon to submission
Description: