BENCHMARKING AUTHORSHIP ATTRIBUTION TECHNIQUES
USING OVER A THOUSAND BOOKS
BY FIFTY VICTORIAN ERA NOVELISTS

A Thesis
Submitted to the Faculty
of
Purdue University
by
Abdulmecit Gungor

In Partial Fulfillment of the
Requirements for the Degree
of
Master of Science

May 2018
Purdue University
Indianapolis, Indiana

ii

THE PURDUE UNIVERSITY GRADUATE SCHOOL
STATEMENT OF COMMITTEE APPROVAL

Dr. Murat Dundar, Chair
Department of Computer & Information Science
Dr. George Mohler
Department of Computer & Information Science
Dr. Mihran Tuceryan
Department of Computer & Information Science

Approved by:
Dr. Shiaofen Fang
Head of the Graduate Program

iii

Dedicated to my inspiring parents, colleagues, friends and teachers,
for being the role models, cheer-leading squad and sounding boards
I have needed in my life.

iv

ACKNOWLEDGMENTS
First and foremost, I would like to thank my parents, Cengiz and Fatma Gungor
for providing me with the support to pursue my degree in Computer Science. Without
their support I may not have found myself at Purdue University, nor had the courage
to engage in this task and see it through. I’m very grateful to have such a supportive
family.
Importantly, I would like to thank my thesis advisor, Dr. Murat Dundar for the
guidance, advice, and hours of struggling through this challenging process. Without
his assistance, I wouldn’t have found myself at Purdue University nor focus on the
topic of Natural Language Processing. In my thesis, he has helped me to set the
aim of our work as change and our methodology to achieve our aim as reflecting on
experience and goal setting for future practice. Thank you Dr. Dundar, for making
me feel that working with you is a fruitful, exploratory, and joyful experience.
Lastly, I would like to thank my teammates and lab-mates whom I learned a lot
while working with them: Halid Ziya Yerebakan, Sarkhan Badirli, Yicheng Cheng,
Ziyin Wang, Sarun Gulyanon, Huiwen Cheng, Sepehr Farhand, Asimenia Dimokranitou. Halid, Sarkhan and I have teamed up to participate in Hack-Ohio challenge. We
have created a navigation system in which parking problem have been easily solved.
Our project has won the Best-Hack reward among more than seven hundred participants. On another occasion, Sarun, Sarkhan and I have finished second in Code4life
University Challenge organized by Roche. We have created a data driven and visually stunning chatbot. Thank you team, for your hard work and making our time at
Purdue worthwhile.

v

PREFACE
This basis for this research originally stemmed from our idea exchanges with Dr.
Dundar. Our passion has been developing better feature representation of sentences
and paragraphs that can beat the performance of traditional methods. As the world
moves further into the digital age, generating vast amounts of text data, there will
be a greater need to create new methodologies using predefined common norms that
can break down the lack of efficiency in different application domains. Hence, the
aim of this work is to create a research community around it by providing them with
the benchmark methodologies and a new dataset.

vi

TABLE OF CONTENTS
Page
LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii
LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix
ABSTRACT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

x

1 INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

1

1.1

Forensic Linguistic . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

2

1.2

Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

3

2 ADVANCES IN NATURAL LANGUAGE PROCESSING . . . . . . . . . .

7

2.1

NLP Tasks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

7

2.2

Data Availability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

2.3

Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

2.4

Cloud Computing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

3 DATASET PREPARATION . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.1

50-Author Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

3.2

3-Author Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

4 PRACTICAL FEATURE EXTRACTION . . . . . . . . . . . . . . . . . . . 30
4.1

Lexical Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

4.2

Character Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

4.3

Syntactic Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

4.4

Semantic Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

4.5

Application Specific Features . . . . . . . . . . . . . . . . . . . . . . . . 49

5 CLASSIFICATION METHODS IN AUTHORSHIP ATTRIBUTION . . . . 52
5.1

Working on Dataset Without Stop Words . . . . . . . . . . . . . . . . . 53

5.2

Feature Engineering with Different Classifiers . . . . . . . . . . . . . . 58

5.3

Sentence and Paragraph Generating Model . . . . . . . . . . . . . . . . 64

vii
Page
5.4

Ensemble Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68

5.5

Defining Sentence Vectors . . . . . . . . . . . . . . . . . . . . . . . . . 69

5.6

Specific Word Usage Score . . . . . . . . . . . . . . . . . . . . . . . . . 72

5.7

Unsupervised Feature Learning . . . . . . . . . . . . . . . . . . . . . . 76

5.8

Inversion with Word Embeddings . . . . . . . . . . . . . . . . . . . . . 80

6 SUMMARY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
7 RECOMMENDATIONS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
REFERENCES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86

viii

LIST OF TABLES
Table

Page

2.1

Positivity and Negativity Score of Sentences . . . . . . . . . . . . . . . . . 11

3.1

Author Book Number Distribution . . . . . . . . . . . . . . . . . . . . . . 25

4.1

N-Grams Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

4.2

Stylometric Feature Density Distributions . . . . . . . . . . . . . . . . . . 38

4.3

Highest Tf-Idf Pairs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

4.4

Highest Character N-gram . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

4.5

Top 5 Adjectives, Nouns, Verbs Usage . . . . . . . . . . . . . . . . . . . . 46

5.1

Mean F1 Scores for Known and Unknown Book Id

5.2

Mean F1 Scores for Different Experimentation Settings . . . . . . . . . . . 59

5.3

Mean F1 Scores for Sentence Vectors . . . . . . . . . . . . . . . . . . . . . 71

5.4

Mean F1 Scores for WSJL Algorithm . . . . . . . . . . . . . . . . . . . . . 76

5.5

UFL F1 Score on 50-Author Dataset . . . . . . . . . . . . . . . . . . . . . 79

. . . . . . . . . . . . . 57

ix

LIST OF FIGURES
Figure

Page

3.1

Distribution of Authors in Training Set . . . . . . . . . . . . . . . . . . . . 27

3.2

Distribution of Authors in Training Set . . . . . . . . . . . . . . . . . . . . 29

4.1

Oliver Twist, Charles Dickens . . . . . . . . . . . . . . . . . . . . . . . . . 32

4.2

Horse Tale, Mark Twain . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

4.3

Bigram Network Diagram for Author Id:1, 22, 24 . . . . . . . . . . . . . . 36

4.4

Vocabulary Diversity for 50-Author Dataset . . . . . . . . . . . . . . . . . 37

4.5

Power Word Density for 50-Author Dataset . . . . . . . . . . . . . . . . . 39

4.6

Function Words Density for 50-Author Dataset . . . . . . . . . . . . . . . 40

4.7

Apostrophe Usage for 3-Author Dataset . . . . . . . . . . . . . . . . . . . 44

4.8

Adjective, Verb, Noun Density for 50-Author Dataset . . . . . . . . . . . . 45

4.9

Positivity and Negativity Comparison . . . . . . . . . . . . . . . . . . . . . 48

4.10 Word2Vec 2-D Closest Words for ’listen’ . . . . . . . . . . . . . . . . . . . 50
5.1

Xgboost Feature Distribution . . . . . . . . . . . . . . . . . . . . . . . . . 62

5.2

Confusion Matrix for SVM Combined Features . . . . . . . . . . . . . . . . 64

5.3

Character Frequency for A. Doyle, C. Dickens, J. Baldwin . . . . . . . . . 67

5.4

Word Scoring Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72

5.5

One vs Others Most Common 20 Words . . . . . . . . . . . . . . . . . . . 74

x

ABSTRACT
Gungor, Abdulmecit M.S., Purdue University, May 2018. Benchmarking Authorship Attribution Techniques Using Over a Thousand Books by Fifty Victorian Era
Novelists. Major Professor: Murat Dundar.
Authorship attribution (AA) is the process of identifying the author of a given
text and from the machine learning perspective, it can be seen as a classification
problem. In the literature, there are a lot of classification methods for which feature
extraction techniques are conducted. In this thesis, we explore information retrieval
techniques such as Word2Vec, paragraph2vec, and other useful feature selection and
extraction techniques for a given text with different classifiers. We have performed
experiments on novels that are extracted from GDELT database by using different
features such as bag of words, n-grams or newly developed techniques like Word2Vec.
To improve our success rate, we have combined some useful features some of which
are diversity measure of text, bag of words, bigrams, specific words that are written
differently between English and American authors. Support vector machine classifiers
with nu-SVC type is observed to give best success rates on the stacked useful feature
set.
The main purpose of this work is to lay the foundations of feature extraction
techniques in AA. These are lexical, character-level, syntactic, semantic, application
specific features. We also have aimed to offer a new data resource for the author
attribution research community and demonstrate how it can be used to extract features as in any kind of AA problem. The dataset we have introduced consists of
works of Victorian era authors and the main feature extraction techniques are shown
with exemplary code snippets for audiences in different knowledge domains. Feature
extraction approaches and implementation with different classifiers are employed in

xi
simple ways such that it would also serve as a beginner step to AA. Some feature
extraction techniques introduced in this work are also meant to be employed in different NLP tasks such as sentiment analysis with Word2Vec or text summarization.
Using the introduced NLP tasks and feature extraction techniques one can start to
implement them on our dataset. We have also introduced several methods to implement extracted features in different methodologies such as feature stack engineering
with different classifiers, or using Word2Vec to create sentence level vectors.

1

1. INTRODUCTION
In a general point of view, the authorship attribution problems refer to all the issues
that arise in the association of an author to a specific document. A common example
is that when a piece of historical text is being discovered we would like to know who
wrote it and when it was written. In addition to that, we may also ask other questions
such as the nationality of the author, the characteristic styles of the author, genre of
the text.
One of examples of authorship attribution problems could be as follows: Given
three texts below from 19th century authors Edgar Allan Poe, Mary Shelley, and
Howard Phillips Lovecraft, we want to identify the author of each sentences.
(a) “But the Raven, sitting lonely on the placid bust, spoke only that one word, as
if his soul in that one word he did outpour.”
(b) “Again there is a sound as of a human voice, but hoarser; it comes from the
cabin where the remains of Frankenstein still lie.”
(c) “Great Cthulhu is their cousin, yet can he spy them only dimly. ”
Having a closer look at the sentences would help us capture a few useful words
that would pinpoint the authors of each text. As we know Edgar Allan Poe is famous
with his poem “The Raven”, Mary Shelley is famous with her book “The Modern
Prometheus” in which she tells a story about Frankenstein, and HP Lovecraft has
an imaginary cosmic creature Cthulhu in his book “The Call of Cthulhu”. The task
of naming the author for each text is not a challenging one in this case. It will,
however, become more challenging when an author is describing a daily activity with
most commonly used English words, in which case it will require more sophisticated
approaches to complete the task.

2
1.1

Forensic Linguistic
From a broader contextual perspective authorship attribution is also a part of

Forensic Linguistics Science and the history of Forensic Linguistic Science goes back
to 1968 when a professor of linguistics analyzed police statements of falsely accused
felon for the death of his wife and his infant daughter. The felon later was convicted
of these two murders and hanged. After three years from his trial, his downstairs
neighbor was found to be a serial killer who also had murdered six other women [1].
Back in those days, the authenticity of police statements was questioned due to the
specific format of statements rather than the suspect’s own words leaving out the
important details.
One of the key advantages of Forensic Linguistics in crime solving cases is that
it provides a list of suspects. A good example would be the Unabomber case, which
is an acronym derived from the combination of University and Airlines Bomber [2].
The suspect of the investigation was sending packages using USPS system targeting
the academic and technology leaders. After or before the bombings, the suspect was
sending his manifesto stating that technological improvement should be removed from
our life. It was FBI’s longest and costliest investigation in the late 20th century [3].
The suspect was active from 1978 to 1995, sending 16 bombs one of which was to an
airplane. All the evidence and writings of Unabomber collected in this time period
were used to create his profile yet it was not enough to identify the suspect. In his last
acquainted event he requested that his manifesto, “Industrial Society and Its Future”,
to be published in well-known newspapers or he would send another bomb to a plane
at the Los Angeles Airport [3]. By the Attorney Generals’ order his manifesto was
granted to be published, which later led to his brother sharing some of the suspect’s
writings with the law enforcement officers. The writings provided were from before
the suspect’s active bombings time and Linguistics analysis showed that the author
of the essay papers and the Unabomber’s manifesto was almost the same.

3
Forensic linguistics has a diverse range of topics and its application varies vastly.
The authenticity of emergency calls, suicide letters, ransom demands and social media
statements analysis are a few examples of forensic linguistics. The question that is
tackled in Forensic Linguistic is mainly about “who wrote a specific document”. The
concept of linguistic fingerprinting put forward by some scholars also led to new
applications of AA in law enforcement.

1.2

Related Work
There are different types of authorship attribution studies in the literature such

as predicting the date of authorship of historical texts or text genre detection [4], [5].
Vast majority of previous works focuses on authorship identification by taking into
consideration the stylistic features of authors such as use of grammar, function words,
frequent word allocations [4], [5], [6], [7], [8], [9]. Some of the well-known problems
in authorship attribution are disputed Federalist Papers classification, Shakespearean
Authorship Dispute, Author of New Testament, and Author of The Dark Tower.
The Federalist Papers are a collection of 85 articles and essays written by Alexander Hamilton, James Madison, and John Jay to persuade the citizens of New York
to ratify the U.S. Constitution. Authorship of twelve of these papers has been in
dispute. To address this problem, using linear support vector machines as classifier
and relative frequencies of words as features a study identified these papers to be
written by James Madison [10].
Another dispute in authorship attribution among scholars across the world is
whether William Shakespeare wrote the works attributed to him or not. It was argued
that Shakespeare wasn’t even educated and more than 80 authors were suggested to
be the author of the writings that were under the name of Shakespeare. Christopher
Marlowe is considered the most likely candidate to write these works under the name
of Shakespeare when he was in jail. In order to analyze the stylistic fingerprint of
Shakespeare and Marlowe and non-Shakespearean authors, namely Chapman, Jonson,

4
Middleton, a corpus has been put together. By taking into account “stop words,
Part-of Speech (POS) tags, bigram probabilities” the following two questions were
analyzed:
• With these features can non-Shakespeare-authors be classified accurately?
• If these features are useful and Shakespeare is truly “Late Marlowe”, then classifying other texts by Shakespeare and Marlowe should not be easy using the
same features.
The classification results for non-Shakespearean author candidates turned out to
be highly accurate (Johnson % 100, Chapman % 92.9 and Middleton %88.9). The
results supported the hypothesis that writing styles of Marlowe and Shakespeare were
as distinguishable as other authors unless Marlowe did not show a linear change in
style over time. Meaning, Marlowe has found not to be the authors of Shakespearean
writings [11].
One major application of authorship attribution has also been identifying the
author of newly found texts and there has been substantial work done in related
areas. One of the earliest papers in this area is about newly found poem called
“Taylor Poem” and whether it is actually written by Shakespeare or not. Thisted
and Efron have suggested three tests for authorship: based upon the number of new
words observed, number of rare words observed and a slope test that uses Poisson
regression to combine data. The result of their test is that Taylor poem is found
to fit previous Shakespeare writings [12]. Thompson and Rasp have also suggested
adding “uniformity test of various p-values” that is computed directly from Poisson
probability distribution in addition to Thisted and Efron’s approaches, and then
applied to The Dark Tower which was attributed to C.S. Lewis when it was published.
Their conclusion is also consistent with the claim that Lewis is not the author of
The Dark Tower unless he had written the book in a bad day or his writing style
was changed drastically [6]. Another interesting study on the unknown texts is also
done based on word-level features, vocabulary richness and syntactic features by using

5
Liblinear SVM for classification purposes [13]. Even though the classification accuracy
results are not as high as other related works features like “number of unique words”
should be noted for use in any attribution problem.
Authorship of biblical texts has also been a part of previous studies. The New
Testament of the Bible that consists of 27 books, 13 of which are contributed by
St. Paul is computationally studied by using topic modeling and affinity propagation
clustering. All Pauline letters are found to be divided into six groups which also
match with church tradition. An anonymous letter in the New Testament, the letters
to Hebrews, is also compared with Pauline letters and it was found that the book
of Hebrews was not authored by Paul [14]. More detailed analysis of “the Letters
to Hebrews” is done by considering the word recurrence interval technique, trigram
Markov method, stylometric measures such as the frequency of function words, and
multiple discriminant analysis of frequency of function words. Mahalanobis distance
is used for the comparison of text centroids. It is concluded that Hebrews is not likely
to be written by Paul, Matthew, Mark, Luke, or John yet it’s stylistically very similar
to texts written by Barnabas [15].
The performances of these various features and classification methods are also
analyzed and experiments are performed on newspaper articles. Stylometric analysis
of texts such as number of sentences, words, commas, colons, semicolons, incomplete
sentences, periods in an article, vocabulary diversity or richness measure, bag of words
representation and frequency of function words are extracted as features. During
these experiments histogram method, K-nearest method and parzen windows, Bayes
Classifier, k-means clustering, support vector machines with tf − idf approach and
combination of classifiers are applied as classifiers. Best results are observed with
Gaussian classifiers by using stylometric features and function words. Support vector
machine classifiers also perform well with bag of words feature set [8].
Usefulness of function words in authorship attribution is introduced by Mosteller
and Wallace in their work on Federalist papers [16]. Argamon and Levitan has compared the characteristic features of frequent words, pairs and collocations using the

6
SMO algorithm, and implemented it for two class (American or British) author nationality classification problem. Their results conclude that function words are useful
as stylistic text attribution and frequent words are the best features among others.
The reason behind it is that a given same size frequent collocations has less different
words comparing to frequent words so it carries less discriminatory features [7].
In summary, there has been substantial work done in authorship attribution and
mainly people in forensic linguistic or computer scientists aim to build “stylistic fingerprint of author” by using several features of a given text such as function words,
stylometry. It is a classification problem and several classifiers are used such as Naı̈veBayes, SVM. Among them SVM is observed to fit best for these kinds of problems.

7

2. ADVANCES IN NATURAL LANGUAGE PROCESSING
Natural language processing (NLP) is the machine activity of analyze, understand,
alter, or generate natural language data. One of the earliest studies was the Georgetown experiment in which Russian sentences were translated into English by machine
interpretations [17].
Authorship attribution problems are also a part of NLP and the advances in NLP
proportionally affect the developments of new algorithms. In this chapter, recently
developed NLP techniques, publicly available dataset will be introduced to get the
attention of researchers in different domains.

2.1

NLP Tasks
In order to extract features from a given text and use it in computations, the

preprocessing of the text must be done. The outline of the work can be application
specific. On this regard, the most recent developed techniques will be introduced.

Stemming
Because of the grammatical reasons, in text documents words coming from same
root utilized differently based on their usage as adjectives, adverbs, nouns, tenses of
the verbs. For example, “am, is, are” come from the same root as “be”, “democracy,
democratic, democratization” are different deviation of same word “democracy”, or
“go, went, gone” are different inflicted forms of “go”.The objective of stemming is
to reduce the variation of related words to same stem. Using Porter’s algorithm in
Python [18], stemming can be accomplished as follows:

8
#Use "pip install stemming" to install the package
from stemming.porter2 import stem
stem("democratization")
>>Out: 'democrat'

Lemmatization
In computational linguistics, Lemmatization is closely related to stemming. The
major difference is that stemming operates without the knowledge of the context of
the word whilst in lemmatization task it depends on identifying the meaning of word
by utilizing “part of speech”. Python’s NLTK library can be deployed for this task.
The “pos” variable in the function can be changed and same task can also be achieved
by using different libraries such as “spacy”, or “Standford NLP”.
from nltk.stem import WordNetLemmatizer
my_lemmatizer = WordNetLemmatizer()
my_list = ['good', 'better', 'best']
for word in my_list:
print(my_lemmatizer.lemmatize(word, pos="n"))

>>Out: good
better
best

Part of Speech Tagging
It is process of analyzing sequence of words and attaching a category for each
word in the sequence. It is an important task during text-to-speech processing since
the pronunciation of the word needs to be known before the speech task.

9
import nltk
nltk.download('averaged_perceptron_tagger')
from nltk import word_tokenize
my_text = word_tokenize(
"I am writing my thesis to graduate from my master degree")

nltk.pos_tag(my_text)
>>Out:
[('I', 'PRP'),
('am', 'VBP'),
('writing', 'VBG'),
('my', 'PRP$'),
('thesis', 'NN'),
('to', 'TO'),
('graduate', 'NN'),
('from', 'IN'),
('my', 'PRP$'),
('master', 'NN'),
('degree', 'NN')]
Some of the abbreviations in the result are present tense verb (VBP), noun (NN),
preposition (IN) that give information regarding the grammatical structure of the
sentence. “NLTK”, or “spacy” can both be used to perform POS tagging task.

Named Entity Recognition
It is the task of identifying entities in a given text and categorizing each of these
entities as person, organization, date, location, time, etc. It is a crucial task since the
word “apple” in “I want to eat apple” refers to the fruit but in “I want to work for
Apple” refers to the company.

10
from nltk.tag import StanfordPOSTagger
from nltk import word_tokenize
jar = 'path/stanford-postagger.jar'
model = 'path/english-left3words-distsim.tagger'
pos_tagger = StanfordPOSTagger(model, jar, encoding='utf8')
sentence1 = "I want to eat apple now and "
sentence2 = "I want to work at Apple in the future."
sentence = sentence1 + sentence2
my_text = pos_tagger.tag(word_tokenize(sentence))
print(my_text)

#Or Second method
nltk.download('maxent_ne_chunker')
nltk.download('words')
from nltk import word_tokenize, pos_tag, ne_chunk
print (ne_chunk(pos_tag(word_tokenize(sentence))))
>>Out:
(S

I/PRP,

want/VBP,

and/CC, I/PRP,

to/TO,

eat/VB,

apple/NN,

want/VBP,

to/TO,

work/VB,

(ORGANIZATION Apple/NNP),

in/IN,

the/DT,

now/RB,

at/IN,
future/NN, ./.)

Sentiment Analysis
It is a vast range of applications in NLP where the positivity, negativity or neutrality of a sentence is being investigated. The applications could be movie reviews
of “rotten tomatoes”, customer product review or determining the mood of a speaker
via voice analysis.In order to showcase an example, using Naive-Bayes Classifier a
simple model is being built. In the equation 2.1, the important task is identifying the

11
number of words whether they’re positive, negative, or neutral. To achieve this goal,
we can create a simple database of our own choosing to identify the words.
Number of Positive Words
Number of All Words
Number of Negative Words
N egativity =
Number of All Words
P ositivity =

(2.1)
(2.2)

There is an example case of two sentence and their sentiment analysis measurement
result using the algorithm developed here 1 .
Table 2.1.: Positivity and Negativity Score of Sentences
Sentences

Positivity

Negativity

“Awesome movie, great actors, I liked it”

0.71

0.14

“The sound effects were bad, terrible movie”

0.27

0.43

Semantic Text Similarity
It is the task of measuring the degree of equivalence in the underlying semantics of
given text pieces. Words can also be similar in two ways: lexically and semantically.
The categorization of text similarity approaches can be put into three as: Stringbased, Corpus-based and Knowledge-based similarities [19]. The work of Wael and
Aly is a good survey on introducing the available text similarity approaches [19]. You
can look up to words or compare sentence similarities using both NLTK or Gensim.
The following word snippet is simply showing you the words that are close to “book”.
from nltk.corpus import wordnet as wn
print(wn.synsets('book', 'n'))
>>Out:
1

https://github.com/agungor2/Authorship_Attribution/blob/master/Sentiment_
Analysis.ipynb

12
[Synset('book.n.01'), Synset('book.n.02'), Synset('record.n.05'),
Synset('script.n.01'), Synset('ledger.n.01'), Synset('book.n.06'),
Synset('book.n.07'), Synset('koran.n.01'), Synset('bible.n.01'),
Synset('book.n.10'), Synset('book.n.11')]

Text Summarization
It is the process of shortening a given long text while keeping the informative part.
The main goal of the task is to keep as much information as possible from the long text
and represent it in as short format as possible. The repository represented by Google
Brain team is an extensive work in text summarization using sequence to sequence
model and rouge score is being utilized to measure the model performance [20]. It is
also implemented in Gensim module and easy use case is being provided below.
from gensim.summarization import summarize
sentence = "User choice" #Replace it with your choice
print(summarize(sentence))

Word Embeddings
The area of word embeddings has been vastly growing since its first introduction
by Mikolov [21]. He introduced Skip-gram model which was an efficient method
for learning high quality vector representations of words from large amounts of text
data. One of the main advantages of Skip-Gram model is that it can capture different
semantics of same words such as Apple as a company or as a fruit. Secondly and most
importantly, it allows words to be represented in a vector form which would then allow
some fascinating data manipulation to be discovered. For example, the result of a
vector calculation vec(“Madrid”) - vec(“Spain”) + vec(“France”) can be calculated
with word embeddings and the resulting vector can also be searched within the vector
space to find the nearest word to it. In Mikolov’s work [21], the result has been found

13
that the resulting vector is closer to vec(“Paris”) than to any other word. Some of
the fundamental data handling techniques with word embeddings are given below as
a sample.
#1. Load pre-trained models: It allows dealing with
#previously trained text data and using their word vectors.
#The vectors are more accurately represented because
#they're being trained on large corpus of text data.
#Some of the available pre-trained sets:
#Google's Word2Vec
#Glove Word2Vec(dimension of Word2Vec is from 50 to 300)
#fastText
#LexVec
#Meta-Embeddings
from gensim.models import Word2Vec
model = gensim.models.KeyedVectors.load_word2vec_format(
'GoogleNews-vectors-negative300.bin', binary=True)

#2. Extracting word vectors from the imported model
print(model['NLP'])

#3. We can do vector operation and find the closest vector
#vec(\Madrid") - vec(\Spain") + vec(\France")
#The numbers following the words show the similarity measure
print(model.most_similar(positive=['Madrid', 'France'],
negative=['Spain']))

>>Out: [('Paris', 0.75), ('Marseille', 0.61),
('French', 0.60), ('Colombes', 0.59),
('Hopital_Europeen_Georges_Pompidou', 0.58),

14
('Toulouse', 0.57), ('Parisian', 0.57),
('Cergy_Pontoise', 0.56), ('Marseilles', 0.55),
('Strasbourg', 0.54)]

#4. Cluster analysis
print(model.doesnt_match("breakfast lunch dinner milk".split()))
>>Out:milk

print(model.doesnt_match("China Turkey France Italy".split()))
>>Out:China

#5. Train your own model
import pandas as pd
my_text = pd.read_csv("my_text.txt", header=["text"])
model = gensim.models.Word2Vec(my_text, min_count=1,
size=300,workers=4)
The techniques introduced here are some of the fundamental ones that can help
beginners kick start their analysis. Using these strategies one can build application
specific more sophisticated NLP toolbox of his own.

2.2

Data Availability
In every data analytics task, one of the main necessities to get a perfect result is

to have a large enough and suitable dataset that can be worked on. Thanks to the
advances in social media, optical character reader machines, and several projects that
aim to make documents available online. We are filled with large corpuses of text
data sets publicly available. In order to inspire and inform the readers the existence
of such big corpuses, they are categorized and stated as follows.

15
• Text Classification: It refers to labeling texts documents such as filtering
emails as spam or movie reviews sentiment analysis.
– Apache Software Foundation Public Mail Archives
– Stanford Collection of Amazon Reviews
– Reddit Comments
– Wesbury Lab Usenet Corpus
– Reuters Newswire Topic Classification Data
– Yelp Reviews
– IMDB Movie Review
– Stackoverflow and Twitter Sentiment analysis data
• Language Modeling: There are several application specific texts that are
used to generate models.
– Common Crawl
– Yahoo! N-Grams
– Google Books Ngram Corpus
– American National Corpus
– Wikipedia XML Data or Dbpedia
– ClueWeb09
– Personae Corpus
– Turkish National Corpus
– NTU-Multilingual Corpus
– Open Multilingual Wordnet
• Authorship Attribution: It is, as described before, the task of identifying
the author of a given text.

16
– Project Gutenberg
– GDELT
– C50 (subset of RCV1) and PAN dataset
• Text Summarization: It is the task of creating short and meaningful text
after analyzing a large document.
– TAC Dataset
– Australian Legal Cases
– RTLTDS Dataset
– 17-timelines Summarization Dataset
– ArXiv Summarization Dataset

2.3

Neural Networks
A neural network is a computational nonlinear model that is inspired from the

neural structure of the human brain. It consists of three interconnected layers as
input, hidden and output layers. Every neuron has weighted inputs, an activation
function which could be a linear, step, sigmoid, or rectified linear function, and one
output.
One of earliest studies of language models using neural network techniques to
calculate the joint probability function of sequence of words was introduced more
than a decade ago by Bengio [22]. In his work, he states that one of the difficult
problems of language modeling is the curse of dimensionality when one wants to
find out the joint distribution of consecutive words because of the fact that there are
enormous amount of free parameters to consider. The cure to this problem lies within
the structure of sentences. A language model can be represented by the conditional
probability of the next word given all the previous ones in the sequence [22]. His
idea has then led to the development of several different NLP techniques that are
categorized as follows.

17
Multi Layer Perceptron
It has three or more layers and helps classify data that is not linearly separable. It
is mainly utilized in speech recognition and machine translation tasks. Using MLPs
for phonetic event detection system is introduced by researchers from Microsoft in [23]
and the performance improvement of speech recognition systems on the Broadcast
News task has been achieved using Segmental Conditional Random Fields.
In order to show case the usage of multi layer perceptron on a text classification
task, a sample tutorial workload has been created 2 . The main idea is to create
a sentence vector representation of a given text by utilizing Google’s pre-trained
datasets and “Gensim” to semantically analyze comments. In order to build the
model, hidden layer size has been set to 5. To train the model the following sentences
are being used as training set “I hate this movie. It is horrible.”, and “Great movie
and amazing actors.”. The negative sentence is classified as “0” and positive one as
“1”. After training the model, to test the model “Terrible and useless movie.”, and
“Fantastic. I love it.” are used to predict. Their results were found to be matching
the expected outcome.

Convolutional Neural Network
It contains one or more convolutional layers and uses a variation of multilayer
perceptrons. The layers in the network use a convolution operation to the input
passing the result to the next layer which then makes the network to be deeper with
less parameters. In [24], the author presents a model built on top of Word2Vec,
tests the model with several other benchmarks such as movie reviews, or Standford
Sentiment Treebank. With little tuning of hyperparameters, a simple CNN structure
with one layer of convolution improve the performance [24]. Building CNN model
2

https://github.com/agungor2/Authorship_Attribution/blob/master/Classification_
MLP.ipynb

18
for a text classification task can be easily possible by Mxnet, Tensorflow, or Keras
libraries.

Recurrent Neural Network
Unlike a feed-forward neural network, connections between neurons make a directed cycle which results in the output to be affected both by present inputs and
the previous step’s neuron state. In the earliest work of Mikolov, he introduced his
language model based on the RNN structure [25]. He later described his Skip-gram
and CBOW language model which outperformed better than RNN model [21]. RNN
structure can also be used to perform text classification task as in [26]. Their model
outperformed traditional methods such as SVM, LDA, or CNN on the experimented
4 datasets that are 20Newsgroups, Fudan set, ACL Anthology Network, and Stanford
Sentiment Treebak.

Long Short-Term Memory
It is also a RNN structure but it is designed to model temporal sequences and
their long-range dependencies more accurately than normal RNN. It does not have an
activation function nor the gradient vanishes during the training. It is implemented
in the units of “blocks”. It is a highly appreciated method by the researchers in
different domains and in [27] authors show the achievement of the state of-the-art
performance for large scale acoustic modeling using LSTM. Using Keras library and
IMDB movie review dataset a simple LSTM structure has been built to test out the
model. More details of the model can be found on this tutorial
3

3

https://github.com/agungor2/Authorship_Attribution/blob/master/LSTM_
classification.py

19
Sequence to Sequence Models
It can be imagined as an encoder-decoder architecture system where both of encoder and decoder is a RNN structure. It is mainly used in chat-bots and machine
translation systems. Utilizing LSTM on both side of the structure with limited data
is being implemented in [28] and the performance of the system has outperformed
traditional systems with large-scale training data. This suggests that it should do
well on many other sequence learning problems. Building a sequence to sequence
model is simple using Tensorflow. The example below shows how to create a simple
encoder and decoder structure in Tensorflow.
import tensorflow as tf
# Build RNN cell for encoder
encoder_cell = tf.nn.rnn_cell.BasicLSTMCell(num_units)
# Build RNN cell for decoder
decoder_cell = tf.nn.rnn_cell.BasicLSTMCell(num_units)
# Build Helper
helper = tf.contrib.seq2seq.TrainingHelper(
decoder_emb_inp, decoder_lengths, time_major=True)
# Build Decoder
decoder = tf.contrib.seq2seq.BasicDecoder(
decoder_cell, helper, encoder_state,
output_layer=projection_layer)
#For the optimizer, you can use AdamOptimizer as a start

2.4

Cloud Computing
Some of the applications in text mining or NLP might require more storage than

an individual computer can handle or it might need more powerful tools. To ease the
pain of overcoming such large scale operation issues, cloud services and APIS provide
solutions to researchers. Top providers of such services are Microsoft Azure, IBM

20
Bluemix, Amazon Elastic Compute Cloud (AWS), Verizon Cloud and Google Cloud
platform.
There are multiple different ways one can make effective use of such services
in text mining tasks. For example, using Amazon’s Natural Language Processing
service, also called Comprehend, one can start analyzing the datasets such as support
emails, social media posts, online comments, telephone transcriptions to understand
the positivity and negativity factors or it can be used to build a semantic search
engines rather than basic keyword one. AWS also provides Machine learning services
that anybody can build their classifiers or regression models based on the desired
task. IBM Bluemix can also be used to analyze text data to extract meta-data from
content such as concepts, entities, keywords, categories, relations and semantic roles.
Another practical use case of Bluemix is that it could be used to develop data-driven
chatbot systems. Dialogflow is also another great tool to start building a data-driven
chatbot system.
AWS, Bluemix, Dialogflow are tools that are meant for entry level NLP analysts
to build their own models. In order to further make use of Cloud computing systems
one can make use of Kaggle website or Google Apis. Kaggle has servers that can help
you analyze gigabytes of data in fast pace and build your model on the language of
your choosing. This would enable you not to worry about operating system library
installation issues, gives you large storage availability and help you run your program
faster than your own machine. Google has also provided several Apis such as text to
speech, Google Maps Api, Natural Language processing Api, Youtube Analytics Api.
In order to show case a simple use of Google Api with Python an example case is
shown below. It takes input parameter as the longitude and latitude of one person’s
starting point as well as the final destination point, then returns the time spent on
the overall trip. The workload below has been used by the author to create a parking
meter finder by the author and more can be found in the following link 4 .
4

https://youtu.be/H6A4cQFhHWo

21
import httplib2
import json

def getdistance(x1,y1,x2,y2,mode=""):
command = "https://maps.googleapis.com/maps/api/distancematrix"
+"/json?units=imperial&origins="+ str(y1) +","+ str(x1)
+"&destinations="+ str(y2) +","+ str(x2) +"&mode=" + mode +
"&key=put_your_key_here"

#Parse the web page
http = httplib2.Http()
status, response = http.request(command)

#print response
el = json.loads(response).get("rows")[0].get("elements")[0];

if ("duration_in_traffic" in el.keys()):
return int(el.get("duration_in_traffic").get("value"))
else:
return int(el.get("duration").get("value"))
Google NLP Api can also be used to do sentiment analysis in Python. A simple
example model has been built in this tutorial to help readers take the initial step in
building Api supported tools 5 .

5

https://github.com/agungor2/Authorship_Attribution/blob/master/Google_NLP_api.py

22

3. DATASET PREPARATION
From a machine learning perspective, AA is a multi-class, single-label text classification task. However, it is a bit different from other text classification problems since
the style of writing and text content are important, too. One of the challenges here
is to identify and create a dataset without introducing any bias to the problem to be
studied. Writing style of an author can vary due to following top reasons:
• The century that the author is living and writing his works: General writing
styles may change significantly even throughout a century due to changes in
literature or word usages.
• The idea movement that the author has been inspired from: Authors are usually
being affected by movements such as romanticism, modernism, or symbolist
movement which then also affect their writing style.
• The language in which the author is writing: Every language has its own grammatical structures. The writing style of a multi-lingual author is different than
the mono-lingual writer. If a book is also being translated from its original
language to another one, this translation also creates linguistic differences.
• The topics that the author likes to write about such as science fiction, love,
drama: Usage of words changes across different topics, which could make it
easier to identify authors based on word usage.
In this context, there are different problems that can be studied such as the classification of a given text based on the century it was written, classification of a given text
as translated or original, or simply choosing authors from the same century, same
idea movement, same topics, and classify text according to their authors.

23
In the beginning of this thesis workload, we aimed at achieving three goals. Firstly,
we wanted to create practical features using traditional and new methods that could
help us with author classification task. Secondly, we wanted to create a translated
book identification problem by collecting books that are translated from non-English
language text domains to English. Lastly, we plan to do an exploratory analysis on the
style differences of authors from 19th century to 20th century. However, due to time
limitations we have focused only one side of the problem that is the identification of
the author of a given text. If the researchers are interested in conducting an analysis
of English translated texts from French, German, Russian, or mixed language, they
can find more details here 1 and can contact the author of this thesis for the whole raw
dataset. The list provided consists of book names translated from other languages
and the number of words each book has in it. There are totally 200 French, 200
German, 448 Russian, and 89 mixed languages translated book in the whole text
corpus.

3.1

50-Author Dataset
The GDELT Project is one of the largest publicly available digitized book database

which has more than 3.5 million books published from 1800-2015. The GDELT
Project is an open platform for research and analysis of global society and thus all
datasets released by the GDELT Project are available for unlimited and unrestricted
use for any academic, commercial, or governmental use of any kind without any
fee [29]. The whole digitized dataset is publicly available and interested researchers
can freely perform SQL queries using the Google big query platform. For example;
the book names, publication year, quotations, themes, the original text of the book
of “Mark Twain” which were written between 1890 to 1900 can be found as follows
using the Big query platform of Google.
1

https://github.com/agungor2/Authorship_Attribution/blob/master/Translated_list.
txt

24
SELECT Themes, V2Themes, Quotations, AllNames, TranslationInfo,
BookMeta_Identifier, BookMeta_Title, BookMeta_Creator,
BookMeta_Subjects, BookMeta_Year,

FROM (TABLE_QUERY([gdelt-bq:internetarchivebooks],
'REGEXP_EXTRACT(table_id, r"(\d{4})") BETWEEN "1890" AND "1900"'))

WHERE
BookMeta_Creator CONTAINS "Mark Twain"
LIMIT 50
To decrease the bias and create a reliable dataset the following criteria have been
chosen to filter out authors: English language writing authors, authors that have
enough books available (at least 5), 19th century authors. With these criteria 50
authors have been selected and their books were queried through Big Query Gdelt
database. The next task has been cleaning the dataset due to OCR reading problems
in the original raw form. To achieve that, firstly all books have been scanned through
to get the overall number of unique words and each words frequencies. While scanning
the texts, the first 500 words and the last 500 words have been removed to take out
specific features such as the name of the author, the name of the book and other word
specific features that could make the classification task easier. After this step, we have
chosen top 10, 000 words that occurred in the whole 50 authors text data corpus. The
words that are not in top 10, 000 words were removed while keeping the rest of the
sentence structure intact2 . Afterwards, the words are represented with numbers from
1 to 10, 000 reverse ordered according to their frequencies. The entire book is split
into text fragments with 1000 words each. We separately maintained author and
book identification number for each one of them in different arrays. Text segments
with less than 1000 words were filled with zeros to keep them in the dataset as well.
1000 words make approximately 2 pages of writing, which is long enough to extract
2

https://github.com/agungor2/Authorship_Attribution/blob/master/Clean_data.py

25
a variety of features from the document. The reason why we have represented top
10, 000 words with numbers is to keep the anonymity of texts and allow researchers
to run feature extraction techniques faster. Dealing with large amounts of text data
can be more challenging than numerical data for some feature extraction techniques.
Table 3.1.: Author Book Number Distribution
Author ID

Author Names

Book Numbers

Total Words

1

Arthur Conan Doyle

16

1394980

2

Charles Darwin

10

513085

3

Charles Dickens

7

368314

4

Edith Wharton

27

2487978

5

George Eliot

22

1064661

6

Horace Greeley

9

624680

7

Jack London

16

2216225

8

James Baldwin

76

10220446

9

Jane Austen

12

1617346

10

John Muir

20

1162637

11

Joseph Conrad

8

484102

12

Mark Twain

12

988713

13

Nathaniel Hawthorne

15

815377

14

Ralph Emerson

28

3936896

15

Robert Louis Stevenson

21

1976693

16

Rudyard Kipling

10

402301

17

Sinclair Lewis

14

1051828

18

Theodore Dreiser

16

1839202

19

Thomas Hardy

22

2078753

20

Walt Whitman

10

839168

21

Washington Irving

62

3426646
Continued on next page

26
Table 3.1 – continued from previous page
Author ID

Author Names

Book Numbers

Total Words

22

William Carleton

16

696567

23

Albert Ross

16

844634

24

Anne Manning

9

827145

25

Arlo Bates

18

1531925

26

Bret Harte

64

6406984

27

Catharine Maria Sedgwick

18

453158

28

Charles Reade

20

1170720

29

Edward Eggleston

6

806316

30

Fergus Hume

20

1291603

31

Frances Hodgson Burnett

43

3464120

32

George Moore

9

1174176

33

George William Curtis

19

2180926

34

Helen Mathers

17

806600

35

Henry Rider Haggard

13

923255

36

Isabella Lucy Bird

20

1170895

37

Jacob Abbott

32

3316171

38

James Grant

30

1620580

39

James Payn

54

3667172

40

John Kendrick Bangs

12

665969

41

John Pendleton Kennedy

14

1251338

42

John Strange Winter

13

1358015

43

Lucas Malet

12

1677565

44

Marie Corelli

16

797761

45

Oliver Optic

36

3484099

46

Sarah Orne Jewett

16

1105534

47

Sarah Stickney Ellis

74

5296691
Continued on next page

27
Table 3.1 – continued from previous page
Author ID

Author Names

Book Numbers

Total Words

48

Thomas Anstey Guthrie

30

2740952

49

Thomas Nelson Page

15

1217070

50

William Black

18

1597066

In Table 3.1, author id number in the dataset, author name, total number of books,
total number of wordings after filtering out less occurred words have been listed out.
The distribution of author text pieces in the training data has been provided in Figure
3.1. In the training data, it has found that James Baldwin has the most text pieces
with 6914 whilst Rudyard Kipling has the least with 183 pieces.

Figure 3.1.: Distribution of Authors in Training Set

In order to make the AA problem more realistic, when splitting the training and
testing dataset the writings of George Eliot (5), Jack London (7), Frances Hodgson
Burnett (31), Sarah Stickney Ellis (47), Thomas Nelson Page (49) have been removed

28
from the training and added to testing set. This creates a non-exhaustive training set
with 45 authors while the test set contains 50 authors. In total this leads to 53678
training data instances, 39922 testing data instances. Each of these instances consist
of 1000 words text fragment. Another important aspect we have considered while
splitting training and testing data is to keep all fragments of the same book either in
the training or testing dataset. This way we do not end up training and testing on the
same books. Without this restriction the classification task would be much simpler
and simple bag of words representation with SVM would give much higher F-1 scores.
To study the sentence structure and writing style from a different perspective, the
same steps also have been applied to create another dataset in which stop words have
been removed while keeping the word order same for the remaining words.

3.2

3-Author Dataset
The dataset we have provided consists of 50 authors, 93600 instances where each

instance is a 1000 word of text document. Due to the scale of our data, some analysis
take long time to train and test it. To address this problem, we were able to find a
smaller author identification data on Kaggle server [30].
Using CoreNLP’s MaxEnt sentence tokenizer, Kaggle has chunked the works of
Edgar Allan Poe (EAP), HP Lovecraft (HPL) and Mary Shelley (MWS) into sentences. Some sentences are 2-3 words and some are up to 100 words, so every instance
in this dataset is very small comparing to our 50-author dataset. There are 19579 of
instances for training and 8392 for testing. In the 50-author data, one of the challenges is the missing author problem but for this 3-author dataset both training and
test sets have the same three authors. Another setting that matches to our earlier
criteria regarding author selection is that the works of these authors are categorized
as fiction writing and they all lived in 19th century.

29

Figure 3.2.: Distribution of Authors in Training Set

The distribution of authors in the text data is shown in Figure 3.2 using seaborn
and matplotlib 3 . There is a slight difference between HPL and MWS whereas EAP
is dominating the other two authors.

3

https://github.com/agungor2/Authorship_Attribution/blob/master/Author_
Distribution.py

30

4. PRACTICAL FEATURE EXTRACTION
In order to identify the authorship of an unknown text document using machine
learning the document needs to be quantified first. The simple and natural way
to characterize a documents is to consider it as a sequence of tokens grouped into
sentences where each token can be one of the three: word, number, punctuation mark.
To quantify the overall writing style of an author, stylometric features are defined and
studied in different domains. Mainly, computations of stylometric features can be
categorized into five groups as lexical, character, semantic, syntactic, and application
specific features. Lexical and character features mainly considers a text document
as a sequence of word tokens or characters, respectively. This makes it easier to do
computations comparing to other features. On the other hand, syntactic and semantic
features require deeper linguistic analysis and more computation time. Application
specific features are defined based on the text domains or languages. These five
features are studied and the methods to extract them are also provided for interested
readers.

4.1

Lexical Features
Lexical features relate to the words or vocabulary of a language. It is the very plain

way of representing a sentence structure that consists of words, numbers, punctuation
marks. These features are very first attempts to attribute authorship in earlier studies
[7], [11], [13], [14], [31]. The main advantage of Lexical features is that it is universal
and can be applied to any language easily. These features consist of bag of words
representation, word N-grams, vocabulary richness, number of punctuation marks,
average number of words in a sentence, and many more. Even though the number of
lexical features can vary a lot, not all of them are good for every authorship attribution

31
problem. That is why, it is important to know how to extract these features and try
out different combinations on different classifiers.

Bag of Words
It is the representation of a sentence with frequency of words. It is a simple and
efficient solution but it disregards word-order information. In order to apply it on 50author dataset, a model has been built here 1 . In the approach, for each text fragment
the number of instances of each unique word is found to create a vector representation
of word counts. Since there are uniquely 10, 000 words in the whole dataset you can
choose the range of the words you want to use in your feature vector. We have also
shown how to extract top words usage author-wise in every author’s text corpus.
As expected top words are determiners that every writer use while constructing an
English sentence. For example, for Arthur Conan Doyle top 20 words are “the, to,
of, he, a, and, i, that, in, you, was, it, she, his, her, had, as, not, with, for” but for
Charles Dickens they are “the, to, of, i, and, a, in, that, it, is, he, she, be, her, you,
was, as, not, with, for” in decreasing order. Even though the two sets are mostly the
same the orders are different for most authors.
The main assumption with authorship attribution problems is that every authors
word usage and content differs and based on these differences the work of one author
can be differentiated from the other. In order to illustrate this assumption, the content
and word usages for every book can be seen as a picture of some forms as illustrated for
two sample books in the 50-author dataset in Figure 4.1 & 4.2 for Charles Dickens’s
famous book Oliver Twist and Mark Twain’s book Horse Tale. Afterwards, stop
words (common words) are removed and using the frequency of remaining words we
have plotted it on a boy’s figure and a horse’s figure, respectively in Figure 4.1 & 4.2
using word-cloud library here2 . In both figures, most frequent words are written in
bigger fonts and these words are recognized firstly when looking at the figures.
1
2

https://github.com/agungor2/Authorship_Attribution/blob/master/BOW_50author.html
https://github.com/agungor2/Authorship_Attribution/blob/master/oliver_horse.py

32

Figure 4.1.: Oliver Twist, Charles Dickens

The same methodology can be applied to 3-author dataset as well. Using Vectorizer3 a simple model is being shown on the tutorial. It simply takes the whole corpus
and vectorize it based on the frequency of each word. It is found that in the training
set, there are 25068 unique words whereas in the testing set the number of unique
words is 17546. It is also shown how to extract total number of appearance in the
whole corpus. As an example a search has been done on “Frankenstein” and it is
found to appear “6301” times. In order to apply the same methodology to 50-author
dataset a conversion algorithm has been provided for readers 4 .
3
4

https://github.com/agungor2/Authorship_Attribution/blob/master/bow_3authors.py
https://github.com/agungor2/Authorship_Attribution/blob/master/convert_data.py

33

Figure 4.2.: Horse Tale, Mark Twain

One important aspect of the authors in our dataset is that they are English language writers. However, some of these authors like Charles Dickens or William Black
are originally from United Kingdom and some like Thomas Nelson Page are from
United States. There are slightly differences between British English and American
English. These differences also affect the usage of some of the words in the author’s
work. Words in American English such as “humor, color, honor, endeavor, theater”
are used as “humour, colour, honour, endeavour, theatre” in British English. Using
these features could improve the accuracy of the model and a simple algorithm has
been provided to use it with Word2Vec here 5 .
5

https://github.com/agungor2/Authorship_Attribution/blob/master/word2vec_extract_
data.py

34
Word N-grams
It is a type of probabilistic language model for predicting the next item in the
form of a (n-1) order. Considering n-grams are useful since Bag of words miss out the
word order when considering a text. For example, a verb “take on” can be missed
out by Bag of words representation which considers “take” and “on” as two separate
words. N-gram also establishes the approach of “Skip-gram” language model. An
N-gram is a consecutive subsequence of length N of some sequence of sentence while
a Skip-gram is a N-length subsequence where the components occur at a distance of
at most k from each other [21].
In order to extract N-grams from a given text data a model has been built and
tested on 50-author and 3-author dataset6 . Stop-words have not been considered
while constructing the N-grams. In 3-author dataset, for Edgar Allan Poe “upon,
let us, general john b, general john b c” are the most occurrent uni-gram, bi-gram,
three-gram, and four-gram respectively whilst in Mary Shelley they are “ one, lord
raymond, let us go, nearest town took post” and in HP Lovecraft they are “one, old
man, heh heh heh, oonai city lutes dancing”. Features such as “lord raymond” or
“heh heh heh” could be a good identifiers as one mentions about the specific character
in the book whilst the other one shows a preference of the author when it comes to
phrasing an emotion.
Table 4.1.: N-Grams Distribution
Author id

1-gram

2-gram

3-gram

A. Doyle

would

young man

love gone astray

A. Doyle

said

sugar princess

sugar princess chapter

A. Doyle

one

marriage bond

drew long breath

W. Carleton

said

mr george

said mr george
Continued on next page

6

https://github.com/agungor2/Authorship_Attribution/blob/master/n-grams.py

35
Table 4.1 – continued from previous page
Author id

1-gram

2-gram

3-gram

W. Carleton

would

yes said

said yes said

W. Carleton

one

said mr

yes sir said

A. Manning

mr

sir john

original hymns poems

A. Manning

one

years ago

two three years

A. Manning

would

mr hill

sir walter scott

Table 4.1 contains uni-grams, bi-grams, three-grams from authors Arthur Conan
Doyle, William Carleton, Anne Manning. It also provides characteristic features
about the persona in their books such as “sugar princess, mr george, sir john”.
Same task can be also be achieved with considering the stop-words and can be
better visualized as in Figure 4.3. In this task we have simply scanned through all
consecutive word pairs of two and saved it in a dictionary. After sorting the dictionary
based on the frequency of each bi-grams we have plotted it on a network diagram7 .

Vocabulary Richness
It is also referred as vocabulary diversity. It attempts to quantify the diversity
of the vocabulary text. It is simply the ratio of V /N where V refers to the total
number of unique tokens and N refers to the total number of tokens in the considered
texts [31]. In order to apply this feature to both 50-author dataset and 3-author
dataset a model has been built here 8 . For 3-author dataset due to the large number
of texts of Edgar Allan Poe, it was dominating the overall richness number but when
normalized with the overall text number for each author the value became around
0.9. As for 50-author dataset the vocabulary richness has been found the lowest
7
8

https://github.com/agungor2/Authorship_Attribution/blob/master/n-grams2.py
https://github.com/agungor2/Authorship_Attribution/blob/master/vocab_diversity.py

36

Figure 4.3.: Bigram Network Diagram for Author Id:1, 22, 24

for William Carleton and the highest for Henry Haggard. Overall distribution of
vocabulary richness is plotted in Figure 4.4.

Stylometric features
These are features such as number of sentences in a text piece, number of words in a
text piece, average number of words in a sentence, average word length in a text piece,
number of periods, number of exclamation marks, number of commas, number of
colons, number of semicolons, number of incomplete sentences, number of uppercase,
title case, camel case, lower case letters. During the preprocessing of 50-author dataset
we have excluded punctuations from the dataset. Hence, calculating stylometric

37

Figure 4.4.: Vocabulary Diversity for 50-Author Dataset

features are not available at the current stage for 50-author dataset. However, we
can compute these features for 3-author dataset as in here 9 . Overall distribution of
some of the features introduced here are applied and the resulting density measures
are calculated for each author and shown in Table 4.2. Among these five features
introduced, number of punctuations and number of stop words usage varies the most
among the authors and hence they can be better distinguisher comparing to other
feature sets.

Power Words
These are the words that affect the readers emotionally. It could be shown under
the category of “semantic features”, too. These words give emotional excitements or
9

https://github.com/agungor2/Authorship_Attribution/blob/master/stylometric_
features.py

38

Table 4.2.: Stylometric Feature Density Distributions

Features

Authors
EAP

HPL

MWS

Number of Punctuations

4.10

3.21

3.83

Number of Title Case

2.10

2.33

2.12

Upper Case Words

0.55

0.50

0.75

Average Words Length

4.64

4.63

4.60

Number of Stopwords

12.62

12.94

13.74

fears and serve as an emotional roller coaster. A great example could be given from
the Winston Churchill.
“We have before us an ordeal of the most grievous kind. We have before us many,
many long months of struggle and of suffering. You ask, what is our policy? I can say:
It is to wage war, by sea, land and air, with all our might and with all the strength
that God can give us; to wage war against a monstrous tyranny, never surpassed in
the dark, lamentable catalog of human crime. That is our policy. You ask, what is
our aim? I can answer in one word: It is victory, victory at all costs, victory in spite
of all terror, victory, however long and hard the road may be; for without victory,
there is no survival.”
These frequency of such words in the database can also be studied to see how the
authors are affectively using such words

10

. The power words that are in 50-author

dataset is also provided and their frequency for every author has been calculated.
Then their density value which is the total number of power words divided by the
text fragments for every author has been recorded. Figure 4.5 shows the distribution
of these recordings for every author. The writings of Jane Austen has the most power
words density whilst William Carleton has the lowest usage of such words.
10

https://github.com/agungor2/Authorship_Attribution/blob/master/power_words.py

39

Figure 4.5.: Power Word Density for 50-Author Dataset

Function Words
Function words are the words that have little meaning on their own but they’re
necessary to construct a sentence in English language. They express grammatical
relationships among other words within a sentence, or specify the attitude or mood of
the speaker. Some of the examples of function words might be prepositions, pronouns,
auxiliary verbs, conjunctions, grammatical articles. Words that are not functions
words are called as content words and they can also be studied to further analysis
the use case in the authorship attribution problems. In order to implement the use
case of function words in our dataset a model is built

11

. The list of commonly used

function words are chosen from Robert Layton’s book on data mining [32] and the
overall function word density measure is plotted on Figure 4.6. Among the authors
11

https://github.com/agungor2/Authorship_Attribution/blob/master/function_word.py

40
in training set, George Moore has the lowest usage of function words whereas Bret
Harte has the highest function words density.

Figure 4.6.: Function Words Density for 50-Author Dataset

Tf-Idf
It stands for term frequency-inverse document frequency. It is often used as a
weight in feature extraction techniques. The reason why Tf-Idf is a good feature
can be explained in an example. Let’s assume that a text summarization needs to
be done using few keywords. One strategy is to pick the most frequently occurring
terms meaning words that have high term frequency (tf ). The problem here is that,
the most frequent word is a less useful metric since some words like ’a’, ’the’ occur
very frequently across all documents. Hence, a measure of how unique a word across
all text documents needs to be measured as well (idf ). Hence, the product of tf x idf
of a word gives a measure of how frequent this word is in the document multiplied by
how unique the word is with respect to the entire corpus of documents. Words with

41
a high tf-idf score are assumed to provide the most information about that specific
text [31].
Number of times term t appears in a document
Total number of terms in the document
Total number of documents
)
IDF (t) = loge (
Number of documents with term t in it
T F (t) =

(4.1)

T f − Idf = T F (t) ∗ IDF (t)
By considering every authors texts alone within the text corpus a Tf-Idf model has
been built both for 3 and 50-author datasets.

12

. In the model, not only the sin-

gle forms of word tokens but their n-grams are considered as well. As for the 3author dataset, “afforded means, aware fact, dungeon make, perfectly uniform wall”
are found to have highest Tf-Idf scores for Edgar Allan Poe, “fumbling mere mistake, occurred fumbling, mistake, fumbling” for HP Lovecraft, and “looked, beneath
speckled, cheering fair, cottages wealthier, counties spread, happy cottages wealthier,
lovely spring” for Mary Shelley. Table 4.3 provides the top 10 words and n-grams
with highest Tf-Idf scores for the 50-author dataset. Comparing between Table 4.1 &
4.3 new meaningful words have appeared that could serve as a new feature for each
author such as “row, canal, passenger” for A. Doyle, or “writer, virtue, tale” for H.
Greeley.

Table 4.3.: Highest Tf-Idf Pairs
A. Doyle

C. Darwin

C. Dickens

E. Wharton

H. Greeley

rowed

mr

priest

george

writer

canal

mrs

women

said george

virtue

time listen

sir john

half

girl

tale

boat

try nut

path

stray

neighbours

effect

fine lady

forest

hotel

expensive
Continued on next page

12

https://github.com/agungor2/Authorship_Attribution/blob/master/tfidf_example.py

42
Table 4.3 – continued from previous page
A. Doyle

C. Darwin

craft

old mr

dead

miss said george tale humble

city

mr says

old square

pump room

row

don know

old square church drive

preface

passenger

lady

settlement

pump

habits

noise

dr

square church

coach

pecuniary

4.2

C. Dickens

E. Wharton

H. Greeley

obligation

Character Features
Based on these features a sentence consists of a sequence of characters. Some of

the character-level features are alphabetic characters count, digit characters count,
uppercase and lowercase character counts, letter frequencies, character n-gram frequencies. This type of feature extraction techniques has been found quite useful to
quantify the writing style [33].
A more practical approach in character-level features are the extraction of n-gram
characters. This procedure of extracting such features are language independent and
requires less complex toolboxes. On the other hand, comparing to word n-grams
approach the dimensionality of these approaches are vastly increased and it has a
curse of dimensionality problem. A simple way of explaining what a character ngrams could be with the following example: assume that a word “student” is going
to be represented by 2 character grams. So, the resulting sets of points will be “st,
tu, ud, de, en, nt”.

43
Table 4.4.: Highest Character N-gram
A. Doyle C. Darwin

C. Dickens

E. Wharton

H. Greeley

the

the

the

the

the

and

and

and

and

and

her

ing

ing

ing

her

ing

her

her

her

ing

hat

you

hat

hat

hat

that

ther

that

that

that

ould

with

with

ther

ther

ther

that

ould

with

with

with

ould

ther

ould

tion

thin

said

thin

here

ould

In order to apply the character n-gram models an algorithm has been developed
and provided here 13 . Table 4.4 also shows the top five three and four character grams.
Since most common stop words have 3 or 4 letters when constructing character level
3 or 4-grams these words also appear in Table 4.4.
Another useful feature extraction method to consider is the usage of “n’t, or not”
and “is, or ’s”. Since there is no punctuation in 50-author dataset we only apply this
to the 3-author dataset. Using the moses tokenizer a model has been built for every
author and their respected usage of “n’t, not, is, ’s”

14

.

Figure 4.7 also shows the distribution of the usage in the training data across
authors. It is found that Mary Shelley does not use “n’t” but Edgar Allan Poe
prefers to use “is” more often than “’s”.
13

https://github.com/agungor2/Authorship_Attribution/blob/master/char_ngram.py
https://github.com/agungor2/Authorship_Attribution/blob/master/moses_tokinezer.
py
14

44

Figure 4.7.: Apostrophe Usage for 3-Author Dataset

4.3

Syntactic Features
For certain text grammatical and syntactic features could be more useful com-

pared to lexical or character level features. However, this kind of feature extraction
techniques requires specific usage of Part of Speech taggers. Some of these features
consider the frequency of nouns, adjectives, verbs, adverbs, prepositions, and tense
information (past tense,etc). The motivation for extracting these features is that
authors tend to use similar syntactic patterns unconsciously [31].
Some researchers are also interested in exploring different dialects of the same language and building classifiers based on features derived from syntactic characteristic
of the text. One great example is the work that aims to discriminate between texts
written in either the Netherlandic or the Flemish variant of the Dutch language [34].
The feature set in this case consists of lexical, syntactic and word-n grams build
on different classifiers and F1-score has been recorded for each cases. Employed
syntactic features are function words ratio, descriptive words to nominal words ra-

45
tio, personal pronouns ratio, question words ratio, question mark ratio, exclamation
mark ratio [34]. Some of these features can also be implemented by using 3-author
or 50-author dataset.
By making use of the part of speech tagging, a model has been built to analyze the
usage of adjectives, nouns, and verbs in every author’s training text corpus

15

. The

main steps of the model consists of tokenizing text pieces author-wise and searching
through the adjective, noun, verb list among the tokens. The measure of density is
being defined as number of searched element divided by the total number of tokens
for every author. In 3-author dataset, it is found that Edgar Allan Poe’s preferred
sets of top 3 adjectives, verbs, nouns list “little, other, more, was, is, had, time man
day”, HP Lovecraft uses “old, great, many, was, had, were, man, night, time”, and
Mary Shelley’s set is “own, other, many, was, had, be, life, heart, Raymond”.

Figure 4.8.: Adjective, Verb, Noun Density for 50-Author Dataset

Same methodology has also been implemented on the 50-author dataset. Figure
4.8 shows the density measure comparison across authors who are A. Doyle, C. Dar15

https://github.com/agungor2/Authorship_Attribution/blob/master/syntactic_
features.py

46
win, C. Dickens, E. Wharton, H. Greeley. List of top five adjectives, verbs, and nouns
used by every author has also been recorded in Table 4.5. Noun variations across different authors is much significant than adjective and verb variations as expected.
Table 4.5.: Top 5 Adjectives, Nouns, Verbs Usage
A. Doyle C. Darwin C. Dickens

E. Wharton

H. Greeley

young

good

little

little

little

little

little

good

other

poor

other

old

more

old

new

good

young

much

great

good

more

much

old

good

old

was

was

is

was

was

had

said

be

had

is

have

is

was

is

had

is

had

have

have

have

be

be

had

be

be

time

time

time

man

jane

mr

man

literature

time

mrs

man

mr

way

face

mother

way

day

mrs

eyes

man

nothing

mother

man

men

mr

47
4.4

Semantic Features
Features that we discussed so far aim at analyzing the structural concept of a text

such. Semantic feature extraction from text data is a bit challenging. That might
explain why there is limited work in this area. One example is the work of Yang who
has proposed combination of lexical and semantic features for short text classification
[35]. Their approach consists of choosing a broader domain related to target categories
and then applying topic models such as Latent Dirichlet Allocation to learn a certain
number of topics from longer documents. The most discriminative feature words
of short text are then mapped to corresponding topics in longer documents [35].
Their experimental results show significant improvements compared to other related
techniques studying short text classification.
Positivity, neutrality, and negativity index, and synonym usage preference are
good examples of semantic features. Distributed representation of words, Word2Vec,
is also an attempt to extract and represent the semantic features of a word, sentence,
and paragraph [25]. The usage of Word2Vec in authorship attribution tasks has not
yet been studied explicitly. Due to the application domain dependency of Word2Vec
features their usage will be introduced when discussing application specific feature
sets.

Positivity and Negativity Index
In order to understand the general mood and the preference of positive and negative sentence structure in each author’s work, a positivity and negativity score model
has been built

16

. In the algorithm, the sentences that have positive polarity score

have been labeled as positive and the negative polarity scored ones are labeled as
negative. In 3-author dataset, the most negative person has been found to be HP
Lovecraft whereas the most positive one is Mary Shelley.
16

https://github.com/agungor2/Authorship_Attribution/blob/master/P_N_index.py

48

Figure 4.9.: Positivity and Negativity Comparison

In the 50-author training set, again the works of the authors A. Doyle, C. Darwin,
C. Dickens, E. Wharton, H. Greeley have been chosen and their polarity scores have
been calculated. Figure 4.9 shows the calculated positivity and negativity index for
the chosen authors. Among these five authors the most negative one is found to be
E. Wharton and the most positive one is C. Darwin.

Synonym Usage
The preference to use synonyms and antonyms in different text structure could
be an identifiable feature in different tasks as well. However, extracting such features
and modeling it could not be an easy task. The simple approach could be creating
a domain knowledge where the pairs of synonyms and antonyms are paired. Then, a
simple brute force approach can be used to find such words within a specific window
size of sentences. Another approach is to employ the word vectors and represent

49
a sentence with the average of all word vectors in the sentence. Using these two
approaches an example model has been created

17

. In the example model, given

a synonym. its antonym set can be retrieved using NLTK wordnet library. Also, a
similarity score can be calculated comparing the average vector forms of two sentences.
Synonym and antonym word usage in 3 and 50-author dataset has not yet been studied
and applied. Interested researchers can use the example model to further explore their
affect in the authorship attribution problems.

4.5

Application Specific Features
When the application domain of the authorship attribution problems are different

such as email messages or online forum messages, author style can be better characterized using structural, content specific, and language specific features. In such
domains, the use of greetings and farewells, types of signatures, use of indentation,
paragraph lengths, font color, font size could be good features [31]. Word2Vec can
still be implemented in such domains as well as in 3 and 50-author dataset, but the
way to use such vector forms depend on the creativity of one’s approach.

Vector embeddings of words (Word2Vec)
It gives the ability to represent a word in a vector dimension of your choosing.
The ways to make use of Word2Vec in 3-author and 50-author dataset is various.
For example, a Word2Vec model can either be built by considering every authors
text data separately, or can be imported using previously trained word vectors on
other large text corpus. It can, then, be plotted into two dimensional vector space
by using dimensionality reduction techniques. We built a model based on profile
based Word2Vec training and using TSNE to decrease it to two dimensions18 . In the
17

https://github.com/agungor2/Authorship_Attribution/blob/master/synonym_example.
py
18
https://github.com/agungor2/Authorship_Attribution/blob/master/word2vec_tsne.py

50
model, our baseline approach is to extract A. Doyle and E. Wharton’s text data and
train Word2Vec on both of these authors text sets separately. Then, we have checked
the word closeness for “listen” in both of these authors using 300 dimensional word
vectors. Figure 4.10 shows the closest words in 2 dimensions for E. Wharton. The
same comparison can also be done between pre-trained word vectors of Google or
Glove to see the difference of usages in such words between an author and a pretrained word vector.

Figure 4.10.: Word2Vec 2-D Closest Words for ’listen’

Moving with the idea of training Word2Vec per author, one can also do a cosine
distance measure for the same word or same sentence. The measured cosine distance
for A. Doyle and E. Wharton regarding the usage of “listen” is 0.094. In order to
apply this strategy on sentence level, we can have a few ways to do so. One way is by
simply taking the scaled average of Word2Vec vectors in the sentence. Another one is
to employ Tf-Idf score of each word as a gain when calculating a sentence vector. We

51
then take the scaled average of all word vectors in the text piece. For the simplicity,
we only consider taking the average vector without Tf-Idf gain for now. In this case,
“her lips were parted” has been compared for both A. Doyle and E. Wharton. The
cosine distance has been recorded as 0.258 which is much larger than the distance for
the word “listen”. The reason is that “her lips were parted” is an exact phrase that
is extracted from A. Doyle whereas “listen” is a common verb for both authors. The
same comparison can also be done by considering the Google’s pretrained Word2Vec.
The cosine distance for “listen” between A. Doyle and pretrained set is found as 0.015
whereas for E. Wharton, it is 0.012. As for “her lips were parted”, the cosine difference
for A. Doyle and pretrained set is -0.009, and for E. Wharton, it is 0.012. This implies
that E. Wharton uses “listen” close to the pretrained set which was trained on large
corpus of text data. As for sentence comparison, the sentence average vector is closer
to A. Doyle stating that this sentence is more likely to be written by A. Doyle then
E. Wharton. The way to achieve this comparison criteria has been provided here for
readers to move forward with this methodology19 .

19

https://github.com/agungor2/Authorship_Attribution/blob/master/word2vec_cos_
distance.py

52

5. CLASSIFICATION METHODS IN AUTHORSHIP
ATTRIBUTION
In authorship attribution problems, there is a set of candidate authors and a set of
text samples in the training set covering some of the works of the authors. In the
test dataset, there are sample of texts and each of them needs to be attributed to
a candidate author. Considering 50-author dataset, we make the task of attributing
the author for a given test text sample by placing only 45 authors in the training set,
and by distributing the different books across training and testing set. However, for
the simplicity purposes 3-author dataset does not have unknown author in the testing
set.
In our approaches, we will distinguish the authorship attribution techniques based
on treating each training text individually or cumulatively (per author). By treating
every text pieces individually we mean by extracting features by not considering
the other available text samples in the training. In author-wise or profile based
approaches, we will keep all available texts per author in one file and try to extract
relevant features.
In order to compare results, one can follow different measures as shown in the
equation 5.1. In order to test out our training model, we can either create a validation
set and compare the performance of each model or since we have the correct labels
for both 3 and 50-author test dataset, we can simply try the model on the test data.

53
There will be a difference in both cases. That is why, we will only consider the
performance of the model in the test data with F1 score.
N
1 X
Logloss = −
[yi log pi + (1 − yi ) log (1 − pi )]
N i=1

Precision × Recall
F1 = 2 ×
Precision + Recall
PN
Accuracy for each class
M eanAccuracy = i=1
Total Class Number
5.1

(5.1)

Working on Dataset Without Stop Words
At the early stage of our work, we have considered taking out all the stop words

from the raw text data and keep the order of the rest of words. By doing so, we have
eliminated %76 of the whole word corpus and the size of training set has become
17153 and as for test data it has 12728 instances. Each text pieces again consist of
1000 words.

Feature Extraction
There are several features that are deployed to analyze their usefulness for this
dataset. In this application not only the content is important but also stylometry and
other features are useful. The features that are extracted to perform classification
are as follows:
• Vocabulary diversity
• Bag of words
• N-gram models: 2,3,4 gram models are built. However, it’s found that for 3gram model there are nearly 3500 texts and for the 4-gram model there are 700
texts that contain these features. When introduced three grams and four grams
into our SVM model, they are observed to serve as a noise in our model.

54
• Word2Vec: Build Word2Vec author wise (profile-based) or use pretrained Word2Vec
of Google directly.
• Paragrapvec: Use 8 as the window baseline and divide the text into 8 different
segments. Average the Word2Vec of each word with their Tf-Idf score multiplication in the 1000 words text fragment.
• Nonoccurence words list: We can define a set of union features that contains
the union of all nonoccurence words for each author.
While running Word2Vec one observation is that there are some words that can’t
be found at Google’s pretrained Word2Vec model. Some of these are “theatre”, or
“centre”. The reason is because of the minor difference of writings between American
and British English. This also gave me an idea about the existence of English and
American authors in the dataset and how I can make use of these features.
To compare the usage of every unique words for each author we created a window
size of 10 and evaluated the weights by adding

1
|n|+1

where n ranges from -4 to 4.

Next, we added 1 to word itself. With this method, we can create a similarity matrix
for every word and compare their usage in different authors. It can be useful while
building one versus all classifier to identify unknown authors in our list.

Building Classifiers
Various classifiers can be applied to our specific task. The followings are used for
comparison reason to find a best fitting classifier for our dataset.
• Naı̈ve Bayes Classifier: It’s the first classifier that was applied to our dataset.
Due to better performance with SVM, we didn’t continue to further explore
this classifier. However, a better performance comparing to what we have tried
could be achieved by defining prior probabilities as the class percentage for each
author in the training set.

55
• Support Vector Machines: They are always advised to be useful in text classification and authorship attribution problems [1], [6], [7], [8], [9]. One very
important characteristic of SVM is the training datasets normalization techniques before feeding it into SVM model. The boundaries of SVM are strictly
affected by the feature dataset. After finding the proper normalization technique (min-max normalization) we also need to tune “C” parameters which
would also fit best to our feature sets.
• Ensemble Methods: RUSBoost, random forest, Adaboost are used to build our
classifiers. In order to decrease computation time, PCA and data whitening
technique is applied but due to low cross validation scores we didn’t continue
on building our analyses based on them. The reason why we choose RUSBoost
is that it’s suggested to perform better for imbalanced dataset.
• K-means clustering is used to find out the nearest words by taking their cosine
differences. It’s also made useful in unknown author classification task.
• Image Classification: We also think of paragraph vectors or averaged sum of TfIdf multiplied Word2Vec as pixels of an image and try to apply spatial descriptor
which makes use of estimation and spectral and coarsely localized relationship
of pixels [36].
• Convolutional Neural Networks: Using Google’s pretrained Word2Vec with one
convolution layer Kim Yoon’s CNN-nonstatic model is modified to multi class
classification problem and CV score of a 0.49 is achieved with considering the
book id distribution in classes. It’s not a bad result comparing to other classifiers. However, we still proceed with SVM classifiers due to their faster and
higher performance results [24].

56
Experimentation
We have identified two significant problems for our dataset. One of them is the
imbalanced data and another one is the unknown authors in our test set. In order
to address imbalanced class distribution, we have followed two different approaches.
Firstly, the training sets are provided with book ids that are identical for each text
and the order of these texts are not being modified. This means that we can combine
the texts that share the same book id and reproduce training texts that are from these
books again. Alternatively, we can randomly increase the number of minority classes
by using randsample of Matlab or using SMOTE which is observed not to perform well
on this dataset. To identify the weight of each class in overall f1 score performance we
have dropped out each class one by one and used 5 folds cross validation to capture
mean F1 scores with bag of words representation. Maximum captured mean F1 score
is 0.81 with Albert Ross’s work and the minimum one is 0.75 for Lucas Malet. Second
problem is to identify right cross validation technique 1 . One way to solve is that while
taking the instances for cross validation we can take the same book id as a reference
for the books of all authors which would make it closer to final test score.
• Stylometric Features with SVM: Bag of words representation, diversity of a text
which is defined as the unique number of words over total number of words,
2,3,4 grams, British & American writing differences of some specific words are
extracted. To find out best classifier we’ve used only bag of words representation
with the top 3000 most occurred words. Naive Bayes, Random Forest and
Ensemble techniques perform 0.56, 0.52 and 0.49 with unknown book id cross
validation technique. SVM outperforms them all with a score of 0.61.
To further improve the performance of SVM we have tried different normalization technique as log of nonzero elements, row and column wise normalization
with different number of bag of word selection. The best normalization technique is min max normalization for our dataset. When taking min-max normal1

https://github.com/agungor2/Authorship_Attribution/blob/master/check_
oversampling.m

57
ization we have considered training and testing together. After normalization,
we split them again. We also have chosen the words that are labeled from 7000
to 10000 and compared with the all bag of words representation feature set. All
words representation feature set has proven a better result with 0.84 on known
book id cross validation calculation whilst the other one performs 0.78.

Table 5.1.: Mean F1 Scores for Known and Unknown
Book Id
Features

Known Unknown

Bow 3000

0.78

0.68

Bow all

0.84

0.71

Bow all, & bigrams

0.87

0.73

Bow all, 2&3 grams

0.86

0.72

Bow all, &2,3,4 grams 0.83

0.69

All useful features

0.73

0.87

As summarized in Table 5.1 the best scores are achieved with concatenating all
bag of words representation, bigrams, 74 different used words between English
and American author, and vocabulary diversity measure. During the experimentation C parameters are also tuned and the best score is achieved with
C=3. During the experimentations, the cross-validation indexes are kept same
for each experiment since different configuration of cross validation value may
vary minor differences. Since these results are cross validation results, F1 scores
on the test data will be closer to unknown book id settings but it will be lower
than them.
• Word2Vec: Using pretrained words, taking their averages for all 1000 words in
a text segment, and averaging with Tf-Idf for each text gives a cross validation

58
score of 0.57 with known book ids. The same experiment is also done with
Tf-Idf divided feature set and performed 0.54. Using a window size of 8 with
the suggestion of Mikolov’s work [25], we have defined paragraph vectors of 125
words of text piece. However, we didn’t further proceed with this idea due to
low results in the first Word2Vec approaches.
Even though Word2Vec or CNN’s are well defined techniques in semantic feature
selection they are not as useful as other simple techniques such as bag of words or
n-grams in our dataset that is removed stop words. The reason is that their fundamental working principle bases on sequential order of words in each text whereas in
our dataset we have lost this information. However, dealing with them have helped
us found out about the existence of American and British authors in our dataset.
Simple bag of words with n-grams and vocabulary diversity is good features for this
classification problem. The way to improve this technique is done by trying out different normalization as SVM boundaries depend on them. The increase was observed
with 0.05 on mean F1 score with just pursuing a different normalization technique
which wasn’t observed even with Word2Vec or CNN.

5.2

Feature Engineering with Different Classifiers
In order to work on the semantic and syntactic features of text we have put back

the stop words in the dataset. In this setting, again there are 1000 word pieces of
texts and the total number of unique words in the whole corpus is set to be 10,000.
Total number of training set 53678 whilst total number of test set is 38809. In the
training set, there are 45 authors and in the test set there are 50 authors. G. Eliot,
J. London, F. H. Burnett, S. Ellis, T. Page are the missing authors in the training
set which consists of %34 of all testing data.

59
Table 5.2.: Mean F1 Scores for Different Experimentation
Settings
Dataset

Features

Classifier

F1 Score

3-author

Tf-Idf train, test separate

Logistic R.

0.806

3-author

Tf-Idf train, test together

Logistic R.

0.808

3-author

Countvectorizer train, test separate

Logistic R.

0.801

3-author

Countvectorizer train, test together

Logistic R.

0.801

50-author Tf-Idf train, test separate

Logistic R.

0.46

50-author Tf-Idf train, test together

Logistic R.

0.47

50-author Tf-Idf train, test together svd 120

SVM

0.52

50-author Tf-Idf train, test together svd 120

Xgboost

0.496

50-author Tf-Idf train, test together scaled svd 120 Xgboost

0.495

50-author BOW words 9501:10000

SVM

0.5416

50-author BOW words 7000:10000

SVM

0.5958

50-author BOW words 5000:10000

SVM

0.6110

50-author BOW words 3000:10000

SVM

0.6172

50-author BOW words 7000:10000

Xgboost

0.516

50-author combined features only

SVM

0.017

50-author Combined, BOW words 7000:10000

SVM

0.6104

50-author Combined, BOW words 3000:10000

SVM

0.6361

50-author BOW words 1:10000

SVM

0.6410

50-author Combined, BOW words 1:10000

SVM

0.6425

In most of NLP tasks, Tf-Idf and bag of words representations are the first features
to implement into a classifier of choice. To test out the performance of these two
modules, we have implemented on 3-author dataset using Tf-Idf vectorizer and Count
vectorizer. In both of these two settings, we consider the number of unique words,

60
bi-grams and tree grams. To test out their performance we can do two adjustments.
Before feeding the text into Tf-Idf or Count vectorizer, it is possible to combine all
training and test set together to feed in or get these vectors by separately considering
training and test texts. To take the measure of F1 scores for both of these two settings,
we have implemented it on Logistic Regression classifier by keeping the C=1. The
result for Tf-Idf vectors which are considered separately for the training and test
set is 0.806 whilst when considered together it is 0.808. When we tested the same
experiment on Count vectors, F1 scores are 0.801 in both settings.
We observed not much of a difference for using Tf-Idf or Count vectors as a
feature, so we picked one of them, Tf-Idf, and implemented for 50-author dataset
using Logistic Regression Classifier. We had two settings as before when building TfIdf vectors. The advantage of combining training and testing set for feature selection
is that it serves as a semi-supervised learning. We saw better performance for semisupervised learning case with 0.47 mean F1 score comparing to separated training
and testing set. By considering the performance of semi-supervised learning setting,
we have applied it on SVM and Xgboost Classifiers as well. Since the dimension of
Tf-Idf vectorizer is huge and it is sparse matrix, we have implemented Singular value
decomposition (svd) and selected 120 components for both training and testing set.
From our experimentation we record that a good range of svd component choosing for
this problem is between 120 to 200. For SVM we choose C=1 and for Xgboost max
depth is 7, number of estimators are 200, and learning rate is chosen as 0.1. With svd
120 components, SVM and Xgboost performed 0.52 and 0.496 respectively. We scaled
and standardized 120 svd features by removing the mean and scaling to unit variance.
The scaled dataset has then been implemented to Xgboost with same settings and
has not been observed a performance improvement. We chose 120 components from
svd but it could go up to 200 components for better results. The models that have
been implemented so far are provided here 2 .
2

https://github.com/agungor2/Authorship_Attribution/blob/master/feature_
engineering.py

61
Comparing the result of SVM and Xgboost classifier, Xgboost did not perform
good enough. There is a way to increase the performance by hyper-parameter tuning
or so called Grid search. The parameters we choose can be optimized by testing
out different settings one by one with each other and creating a stop condition to
see if there is a change up to 50 rounds of iteration. Hashing vectorizer can also
be used instead of count vectorizer or Tf-Idf vectorizer. The difference is that with
hashing vectorizer it uses the hashing trick to find the token string name to feature
integer index mapping. It is very low memory scalable to large datasets as there is no
need to store a vocabulary dictionary in memory. We can also choose the number of
components we want to store with this technique which allows smaller size of sparse
matrix. Another major task to improve the score is to create probability of the text
per author and feed this as a feature to classifier of choice.
As mentioned, using count vectorizer is fast for 3-author dataset but is challenging for 50-author dataset due to size of 50-author dataset. In order to implement
feature extraction techniques simply, we have built the dataset by representing each
word with a unique number and storing it in array. To apply the bag of words representation a simple model is built here 3 . In this model we can select the word of
our choice as a feature and measure the frequency of these words per text. Since
the boundaries of SVM heavily relies on the feature data, we have also implemented
min-max normalization before feeding the data into SVM. Even by only checking the
most occurring 500 words, SVM performs a lot more than previous settings with a
score of 0.5416. We see a performance improvement as we increase the number of
features to 3000, 5000, and 7000 and the measured scores follow as 0.5958, 0.6110,
and 0.6172 respectively.
We also checked the performance of top 3000 frequencies with Xgboost by considering the same parameters as in previous cases. We measured 0.516 F1 score and
Figure 5.1 shows the importance of the top 20 words on the F1 score measure. Us3

https://github.com/agungor2/Authorship_Attribution/blob/master/stylometric_
feature_svm.html

62

Figure 5.1.: Xgboost Feature Distribution

ing the word ids provided here one can decrease the number of features and retrain
the Xgboost. Another comparison was also done by choosing the top 500 important
words and implementing it in both of Xgboost and SVM. After experimentation, we
have not observed any difference.
To increase the F1 score we can understand that we can not keep increasing the
number of bag of words feature as it contributes less when increased. One way to
address this issue is to go through the model weight of each word features and take
out the ones that has less weight. Another way is to introduce more features as
a distinguisher. We have listed out several lexical, syntactic, and character level
difference. In order to implement some of them to 50-author dataset a model has

63
been built 4 . These features are number of unique words, characters, stop words,
adjectives, nouns, verbs, and polarity & subjectivity measures (sentiment scoring).
We have implemented only these features to see the affect in overall mean F1 score
with SVM. It gives a score of 0.017. When implemented these feature with top
3000 bag of words we see an increase in the score from 0.5958 to 0.6104. As you
can see the difference between the results are almost same as unique contribution of
the stylometric features. We have also implemented it with top 7000 bag of words
representation and have recorded better measure of 0.6361. Figure 5.2 shows the
confusion matrix of this experiment and we can see the misclassified areas near by
the authors G. Eliot, J. London, F. H. Burnett, S. Ellis, T. Page because they are
the missing classes in the test data. We also observe some misclassification for R.
Kipling and A. Manning. There are 13173 texts misclassified that are written by
G. Eliot, J. London, F. H. Burnett, S. Ellis, T. Page in the test data. 23464 are
labeled to true classes and 2172 are misclassified that are member of 45 authors.
The misclassified cases that are part of 45 authors consist of %5.6 of the test data.
This is the room to improve the performance of our approach for further studies. It
could be possible by hyper parameter tuning with Grid search approach and adding
additional discriminative features. In the final stage, we also have implemented using
the frequency of all words in data set and we have achieved 0.6410 F1 score. All bag
of words features are also combined with the other useful features and have recorded
a performance of 0.6425.
We have not implemented syntactic n-grams (sn-grams) that are defined in this
paper [37]. Sn-grams are combination of Part of Speech tagging and n-grams in which
the advantage of sn-grams is that they are based on syntactic relations of words and,
thus, each word is bound to its real neighbors. This allows ignoring the arbitrariness
that is introduced by the surface structure [37]. Sn-grams are expected to bring
4

https://github.com/agungor2/Authorship_Attribution/blob/master/feature_
engineering4.py

64

Figure 5.2.: Confusion Matrix for SVM Combined Features

grammatical perspective of the text that could serve as a good discriminative feature
set.

5.3

Sentence and Paragraph Generating Model
The methods we have introduced so far have been made use of frequency of words,

characters and other stylometric distinguisher as a feature to stack them together and
implement with different classifiers. Another possible usage of such information can

65
be to create character or sentence based language models using very simple conditional
probability distributions.
P (W ord) = P (c1 , c2 , c3 , ..., cn )
P (c1 , c2 , c3 , ..., cn ) =

n
Y

P (ci )

i=1

(5.2)

P (P aragraph) = P (w1 , w2 , w3 , ..., wn )
P (w1 , w2 , w3 , ..., wn ) =

n
Y

P (wi )

i=1

The probability of a word can be simply written as the multiplication of all characters probability in the word and same description can be defined for a sentence or
a paragraph, too. As for paragraph or sentence we can also think of combination of
words and their probability can also be redefined.
P (word) = P (c1 , c2 , c3 , ..., cn )
P (word|Author) =

P (Author|word) ∗ P (Author)
P (word)

(5.3)

P (P aragraph) = P (w1 , w2 , w3 , ..., wn )
P (P aragraph|Author) =

P (Author|P aragraph) ∗ P (Author)
P (P aragraph)

We can then simply define the probability of a word being written by one of the
authors as the conditional probability. The same settings can be applied for paragraph
level creation considering from the word base. Here we can think of P (Author|word)
as the normalized frequency of each word in training set and P (Author|P aragraph)
can be considered as the normalized distribution of the text pieces per author in the
training set.
Predicted Author = argmax {P (c1 , c2 , c3 , ..., cn |Author)P (Author)}

(5.4)

Predicted Author = argmax {P (w1 , w2 , w3 , ..., wn |Author)P (Author)}
We can then simply choose the author by picking the maximum probability for
each word or character cases as shown in the equation 5.4. It is also possible to define

66
a Markov Chain model by storing the probabilities of characters or words in different
Markov Chain steps as described in equation 5.5 and 5.6 respectively.
P (c1 , c2 , c3 , ..., cn ) = P (c1 )

n
Y

P (ci |ci−1 , ci−2 )

i=2

P (c1 , c2 , c3 , ..., cn ) = P (c1 )P (c2 |c1 )

n
Y

(5.5)
P (ci |ci−1 , ci−2 )

i=3

P (w1 , w2 , w3 , ..., wn ) = P (w1 )

n
Y

P (wi |wi−1 , wi−2 )

i=2

P (w1 , w2 , w3 , ..., wn ) = P (w1 )P (w2 |w1 )

n
Y

(5.6)
P (wi |wi−1 , wi−2 )

i=3

In equation 5.5 and 5.6 steps we need to store the probability so that when defining
conditional probability we can use them again. However, when considering the size
of the corpus the character level Markov Model is computationally less expensive to
implement comparing to word level. Hence it is the reason why, for our computations
we have not considered storing prior probabilities. For the simplicity, we will consider
4 settings. In the first two settings, we have defined a text piece as a combination
of words and the conditional probability of that text piece being written by one of
the authors is defined as in equation 5.3. In order to consider to the probability of
authors we have taken two options. First one is to take them as equal, and second
one is to consider them as their normalized text pieces number in the training set 5 .
In the second setting, we have considered the 1000 words text piece as a combination
of characters and calculated their conditional probabilities 6 .
Figure 5.3 shows the difference of each characters in the entire training corpus
for Arthur Doyle, Charles Dickens, and James Baldwin. We can see the dissimilarities among them for different characters. The same steps can also be done when
considering the usage of punctuation to extract more features.
5

https://github.com/agungor2/Authorship_Attribution/blob/master/word_freq_guess.py
https://github.com/agungor2/Authorship_Attribution/blob/master/Character_level_
distribution.py
6

67

Figure 5.3.: Character Frequency for A. Doyle, C. Dickens, J. Baldwin

Another method using the Character distribution knowledge is to create a character level Bag of characters considering the 27 unique characters in the dataset and
their bi-grams, tri-grams, and four-grams to feed it into any classifier as we have
shown in the previous experiments. We have not implemented this methodology but
interested readers could pursue this part or deploying the next steps of Markov Chain
Model as described before.
When doing the experimentation of (a), (b), (c), (d) we have firstly calculated the
character level and word level distribution per author and then used this knowledge to
calculate the conditional probability per author for each text. The scored measures
are comparably a lot lower than simple bag of words approach. The reason why
the measured scores low is because of large number of words in a given text piece.
However, intended goal with this methodology is to lay the foundation of the model
and help researchers implement such models. In order to improve scores, Markov
Model steps are needed to get correlated relationship between characters and words.

68
Another method to improve scores is to use Markov Models and the conditional
probabilities we use as a feature to train classifiers with other useful features to
follow the steps of feature stacking methodologies.
(a) In the first experiment, we have considered the character level distribution to
create the whole text pieces and the prior probability of each author is taken
equal.
(b) In the second one, we have considered the same setting as in (a) but changed
the prior probabilities with author normalized distribution in the training set.
(c) Instead of considering characters, we took the word distribution as the probability per author and experimented when considering same probability for each
author.
(d) In the last experiment, we look at the performance of (c) when considered the
normalized distribution of authors in the training set as prior probabilities.

5.4

Ensemble Methods
As shown in Table 5.2, every model has a different performance on the same set of

feature. We have introduced grid search strategy to improve the overall performance
for each classifier. We have also introduced feature engineering strategies in which
adding and concatenating extra discriminative features could help build better classifiers (feature ensemble). In the feature ensemble models, we have not introduced
weight per feature which could also be a way to improve the performance.
There is also an aspect of ensemble methods in which multiple different models
can be built and calibrated to get better performance. By using multinomial Naive
Bayes, SVM, Multivariate Bernoulli model, and Logistic Regression, an ensemble
example case is built 7 . We have also used classifier calibration which makes use of
7

https://github.com/agungor2/Authorship_Attribution/blob/master/ensembling_
methods.py

69
a cross-validation generator and estimates for each split the model parameter on the
train samples and the calibration of the test samples. The probabilities predicted
for the folds are then averaged. In this setting, we can also define gains per model
to try out different experimentation. The experimentation of 50-author dataset has
decreased on Tf-Idf feature set to 0.432. However, it is expected to improve score when
choosing with the well fitted weights and best parameters for each classifiers.The same
methodology can be implemented on the ensembled feature set to see the performance
change in further studies.

5.5

Defining Sentence Vectors
We have introduced five possible usage cases of word embeddings in different

NLP tasks. Despite the vast implementation growth in word embeddings, it is still
a research topic on how these word vectors can be used to represent a sentence or a
paragraph [38]. For constructing sentence embeddings, naively using averaged word
vectors was recently shown to outperform LSTMs [39], and using CBOW Word2Vec
training objective to train sentence instead of word embeddings have also recently
being studied [38]. In our approaches, we will introduce three ways to represent a
sentence vector.

Paragraph Vector
The idea of paragraph vectors are firstly introduced by Lee and Mikolov [40] as
a continuation of their model of CBOW on Word2Vec. The idea is straightforward:
act as if a paragraph (or document) is just another vector like a word vector, but we
will call it a paragraph vector. This idea has then been implemented with gensim
and sent2vec that is built on top of fasttext [38].
When it comes to applying this methodology to 50-author dataset, due to the size
of our dataset it is computational expensive unless the C library implementation of
Word2Vec has been built and CPU core computation has been allowed. In order to

70
create paragraph vectors we can train the training and testing part of our dataset
separately or train them together. Since every text piece has 1000 words defining
paragraph can be possible by dividing the text into small paragraph pieces. For simplicity in our implementation we have defined very 1000 text piece as a paragraph and
trained testing and training separately 8 . During the training process of paragraph
vectors it is also possible to manually set the learning rate at each iteration and train
a better model.

Average of Word Embeddings
A sentence vector can also be simply defined as the average of all word vectors in
the sentence. There are also two methods to define a word vector here. One way is
training a Word2Vec model on every 1000 text pieces and recording the word vectors,
then taking the average. This method is a bit controversial since the order of the
words in different sentences vary and it is questionable that the trained word vectors
are good representation of them in vector space. This problem could be addressed
by feeding the pretrained word vectors of Google’s or Glove’s as an embedding to
the neural network. Second method is simply using a pretrained Word2Vec model to
extract each word vectors and use them to define a sentence vector. In our approach
we deployed Glove’s 300 dimensional pretrained model and scaled the Word2Vec when
defining the sentence vectors by dividing them with the squared root sum of all vectors
square 9 .

Average of Word2Vec with Tf-Idf
The main difference with the previous case is that we have deployed every Tf-Idf
words measure as a weight before defining the paragraph vectors. Each Word2Vec is
multiplied with their Tf-Idf scores and then scaled the sum of all word vectors. In this
8
9

https://github.com/agungor2/Authorship_Attribution/blob/master/doc2vec_example.py
https://github.com/agungor2/Authorship_Attribution/blob/master/word2vec_xgb.py

71
approach it gives more gains to the words that are more meaningful in the sentence
vector calculation.
Table 5.3.: Mean F1 Scores for Sentence Vectors
Dataset

Setting

Classifier

F1 Score

50-author Doc2Vec

Logistic R.

0.085

50-author Doc2Vec

Simple XGB

0.140

50-author Doc2Vec

XGB Paramter Tuning

0.150

50-author Doc2Vec

SVM

0.090

50-author Average Word2Vec

Logistic R.

0.284

50-author Average Word2Vec

Simple XGB

0.367

50-author Average Word2Vec

XGB Paramter Tuning

0.434

50-author Average Word2Vec

SVM

0.371

50-author Average Word2Vec with Tf-Idf

Logistic R.

0.302

50-author Average Word2Vec with Tf-Idf

Simple XGB

0.384

50-author Average Word2Vec with Tf-Idf

XGB Parameter Tuning

0.445

50-author Average Word2Vec with Tf-Idf

SVM

0.397

During the experimentation of defined sentence vectors, we have used 300 dimensional representation of Word2Vec to represent a feature of a given text. The
experimented results are summarized in Table 5.3. Doc2vec has not performed well
despite of the different adjustments. One of the reasons is that every text piece has
1000 words which makes it a challenging job to represent as a vector. One way to
address this issue could be by splitting every 1000 text piece by 10 fragments and each
fragment is a combination of 100 words which could be a good number to represent
a paragraph. It is possible to take the scaled average of 10 fragments to represent for
each vector. Average Word2Vec has performed better when representing a text piece.

72
We can also see a performance boot with parameter tuning for Xgboost from 0.367 to
0.434. By introducing Tf-Idf gains when defining a sentence vector outperformed the
other approaches and it increased mean F1 scores slightly for each case. The way to
improve the scores for defining a sentence vector in average Word2Vec approach could
again be done by dividing the text pieces into small text fragments. Each fragment
can then be concatenated to create a feature set. Another approach to check is to
try out point-wise multiplication or convolution technique with Word2Vec to see if it
improves the overall scoring when defining a paragraph vector.

5.6

Specific Word Usage Score
The common use case of word vectors have been so far aimed at defining a sentence

vector from the pre-trained or newly trained word vectors. In this approach, we
would like to see the usage of some specific words for every author. As described
before, authors prefer to use some specific words such as power words, stop words,
flagging sentiment words like surprise, fear, jovial feelings at a different frequency.
By considering these unique word usage we aim at defining a specific word score. It
is also possible to select random words to analyze. However, for experimentation we
have used power words and stop words.

Figure 5.4.: Word Scoring Model

W2 score =

|d(w1 , w3 )|
|d(w1 , w2 )| + |d(w2 , w3 )|

(5.7)

73
Let’s take a sentence of five words shown in Figure 5.4 and assume w2 is the word
we are trying to give a score measure. We have defined the scoring for window of
3 words by taking the ratio of the distance of words that are surrounding w2 with
distance of w2 to w1 and w3 as in equation 5.7. It is also possible to extend the
word numbers within this range to your choosing. For example, with 5 words window
scoring measure can be defined for w3 as in equation 5.8. The distance measure
between vectors here is cosine distance.
W3 score =

|d(w1 , w3 )| + |d(w3 , w5 )|
|d(w1 , w2 )| + |d(w2 , w3 )| + |d(w3 , w4 )| + |d(w4 , w5 )|

(5.8)

After defining the word score, we can check distribution of this scoring across all
the text documents in the training set for every author. The t test would tell us how
significant the differences are in these distributions. T test can be done by considering
two settings: one vs one, and one vs all fashion. In one vs one setting, we can compare
the t test result by taking one author from the training set and comparing it with the
rest of the authors one by one. In the second method, we can compare each author
with the remaining all author’s distribution. In each t test we record the p values for
every experimentation. In order to see if this approach is promising we need to do t
test to see how authors use these words different than the others. Figure 5.5 shows
the p values that are smaller than 0.05 for top 20 words. The words that are shown
in white squares are used differently comparing the other authors. These t test have
been implemented on the most common words from word id 9981 to 10000 choosing
20 of them, but one can choose to it on power words or flagging words to see the
usage difference

10

.

Now that we know there is a difference of usage in choosing some of the top
common words for each author, we also would like to apply Bhattacharyya distance
to define the author usage. Bhattacharyya distance calculation is provided in equation
5.9. In the equation, p, q are two different distributions, µp is the mean of the p-th
distribution, and σp2 is the variance of the p-th distribution.
10

https://github.com/agungor2/Authorship_Attribution/blob/master/ttest_one_all.m

74

Figure 5.5.: One vs Others Most Common 20 Words

1
DB (p, q) = ln
4

  2



1 (µp − µq )2
1 σp σq2
+
+2
+
4 σq2 σp2
4
σp2 + σq2

(5.9)

The reason why we do these two tests is to identify the different word choosing
of authors and how these authors have used them. We choose the words in t test
that have p values less than 0.05 or for Bhattacharyya distance we take the indexes
that are greater than the mean value of the Bhattacharyya distance distribution for
authors.
After identifying the words of choosing we calculate the mean and variance of
each words for every author in the training set. When identifying the author for a
text piece in the test data, we find the distribution of word scorings measure for the
words we identify after analyzing with Bhattacharyya distance and t test measure. In
the final stage, we calculate the joint likelihood for every author given the calculated
mean and variance values for every word in the training set. The maximum of the
calculated joint likelihood is being labeled as the author in the testing set. The
testing algorithm for this whole process has been implemented here
11

11

. Pseudo code

https://github.com/agungor2/Authorship_Attribution/blob/master/Bhat_5word_
distance.m

75
of algorithm steps are also provided below for future developments. In order to test
Algorithm 1 Word Scoring Joint Likelihood Algorithm (WSJL)
1:
2:

procedure WSJL part1(words, authors)
for range(length(words)) do

3:

for range(length(authors)) do

4:

Calculate word scoring

5:
6:

Bhattacharyya distance
return Bhattacharyya distance

7:

Choose words that are worth investigating

8:

procedure WSJL part2(worthy words)

9:
10:

for range(length(worthy words)) do
Calculate word scoring

12:

Calculate mean and variance

14:
15:
16:
17:
18:
19:
20:

. In training set

for range(length(authors)) do

11:

13:

. Or do t test

return mean and variance
procedure WSJL part3(mean and variance)

. In testing set

for range(length(testing set)) do
for range(length(authors)) do
for range(length(worthy words)) do
Calculate word scoring
Calculate likelihood
Label authors with index of maximum joint likelihood

out the performance of this algorithm 50 author dataset has been modified and only
5 author are chosen for this task. These authors are W. Irving, F. H. Burnett, J.
Abbott, J. Payn, O. Optic. The reason why these authors are chosen is because of
their number of text pieces close to each other. In the testing set, there was no class
imbalance or unknown authors.

76
Table 5.4.: Mean F1 Scores for WSJL Algorithm
id

Text Number

100 words 200 words 1000 words

J. Payn

3694

0.34

0.28

0.41

o. Optic

3504

0.15

0.28

0.26

F. H. Burnett 3487

0.12

0.10

0.17

W. Irving

3455

0.27

0.26

0.21

J. Abbott

3332

0.06

0.29

0.37

Table 5.4 summarizes the experimentation on this algorithm performance. In the
settings we have tested out for 100, 200, 1000 words. At each setting these words
are chosen from the most common word lists. However, one important note is that
the number of word decreases when selecting worthy words. For example, in the 100
words setting number of worthy words is 40 and in the 200 words it is 132. After
identifying the worthy words among the selected words, we have implemented joint
likelihood. We can see the overall F1 score increase when increased the number of
words that we have selected when defining word scores.

5.7

Unsupervised Feature Learning
There are publicly many algorithms available to learn features from the unla-

beled dataset. However, training such algorithms can be tricky and difficult for some
datasets. It has been found that K-means clustering can be used as a fast alternative
training method [41]. The resulting features performs effective for learning large-scale
representations of images.
In this approach, windows and strides are defined. Windows represent the number
of words chosen for the calculation and strides are the sliding window that next words
are chosen in the given text pieces. Then, we combine all training and testing data

77

Algorithm 2 Unsupervised Feature Learning (UFL)
1:

procedure UFL part1(window, stride)

2:

Combine Training and Testing Set

3:

for range(1000/s*length(combined data))-1 do

. Divide text piece

4:

Calculate sent2vec and create new data

. Could use Tf-Idf

5:

return new data

6:

Normalize new data

7:

Find K cluster center points

8:

Normalize cluster center points

9:

procedure UFL part2(cluster center points)

10:

for range(length(train or test)) do

11:

Create new data as in part 1 using w and s

12:

for range(length(new data)) do

13:
14:
15:
16:

. Both train & test data

for range(length(K Cluster)) do
Calculate distance
Set the distance greater than mean to zero
Create feature set from the distance and normalize it

17:

return feature set

18:

Future Set can be normalized again

19:

Apply classifier

78
together. For every 1000 words text piece, the window is chosen to create a sentence
vector by averaging the Word2Vec of all words in the window. During the calculation
of sentence vectors one can also implement it with using Tf-Idf average. By sliding
through stride numbers a new data is being created. For example, for a 1000 words
text piece by choosing window as 10 and stride as 5 we create 201 new sentence
vectors. This process is done for all 1000 text pieces. Overall number of new data
points with window size 10 and stride size 5 then becomes 18813600. After creating
the new data, we take the row wise normalization of it. By using the vlfeat K-means
clustering algorithm we find the K number of cluster center points. Identified K
cluster center points are then used in training and testing data to find the cluster
distance to the chosen window and stride points. For each data point, the distances
that are greater than the mean distance set to zero. These distances are then used
to train classifiers like SVM.
Using Word2Vec and Google pre-trained word vectors we take w=10 (imagine 10
words make a sentence) and s=5 (strides). Then we create new dataset with all the
points, run vl-kmeans and find out 1000 points, see them as our topics. Then we
check the likelihood of each sentence to be written by these authors.
We have shown the useful representation of sentence vectors from Word2Vec,
especially using with Tf-Idf averaging. In this approach, we can think of dividing
the whole data points into small sentence pieces and identifying clusters (think of
each cluster as a document topic) among these sentence pieces. To implement this
algorithm we have used pre-trained Google vectors to calculate sentence vectors. In
order to see the performance of this algorithm, text pieces are chosen from the works
of W. Irving, F. H. Burnett, J. Abbott, J. Payn, O. Optic and there is no unknown
author in the test data. In this setting, our implementation of Unsupervised feature
learning has performed %92 accuracy by choosing window size as 8 and stride size as 4.
When we also implement window size as 10, stride size as 5, and concatenating these
two features have performed %97 accuracy. In the same experimentation setting, the

79
bag of words accuracy has been recorded as %99 12 . One of the main reasons why bag
of words and unsupervised feature learning have performed well on the test data is
because of the test and train data split. As noted before, when splitting train and test
data if the book ids are not uniquely distributed then the classification task becomes
an easy job.

Table 5.5.: UFL F1 Score on 50-Author Dataset
Setting

K=100

K=1000

W=10, s=5

0.16

0.37

W=20, s=10

0.14

0.31

W=50, s=25

0.12

0.27

Concatenated 0.23

0.38

To further investigate this approach due to its good performance we have implemented on 50-author dataset with keeping the same settings as 45 authors in the
training and 50-author in the testing. In our test set, we have experimented on window size of 10, 20, 50 with 5, 10, 25 stride numbers. After extracting the features,
we also have concatenated all distances into one feature set and implemented SVM
classifier. Table 5.5 shows the performance of each case scenario. We have concluded
that the small window sizes are good at identifying the number of good features.
Increasing the number of clusters also increase the overall F1 score. However, computational time complexity and overall performance of unsupervised feature learning
is not as good as bag of words.
12

https://github.com/agungor2/Authorship_Attribution/blob/master/Unsupervised_
feature_learning.m

80
5.8

Inversion with Word Embeddings
Distributed language models that map words to vector space are rich in informa-

tion about word choice and composition. Using the Bayes rule, a distributed language
model can be turned into a probability model. The application of this approach has
been studied on Yelp review dataset to do text classification [42]. The dataset consists
of reviews that are rated from 1 to 5 and there are over 2 million sentences. Inversion
method has been noted to perform as well as or better than complex purpose-built
algorithms [42].
The steps of the algorithm consist of training profile based Word2Vec model for
every author in the training set and creating a probability score for text pieces in
the testing set per author. Then, by choosing the maximum likelihood we identify the author of the given text. To apply Bayes rule to go from P (text|authors)
to P (authors|text) we make use of gensim score model. In the implementation of
Word2Vec score model, a binary Huffman tree is employed to calculate each word
probability [42]. The measurement steps are provided below.
Algorithm 3 Inversion Word2vec (IW)
1:
2:

procedure IW
for range(Unique(Authors)) do

3:

Calculate Word2Vec on training set

4:

for range(test data) do

5:
6:

. Author wise Train

Measure P (authors|text)
return P (authors|text)

7:

P (authors|text) to 0-1 range

8:

Pick the highest likelihood

There is one thing about the size of our data that might cause a problem when
calculating P (authors|text). Since each text piece consists of 1000 words, overall
probability of an author having to write that sentence would go down drastically.

81
One way to address this problem is again by using window and stride methodology.
By defining different length of window and strides we can obtain probability for each
window and by summing log of all the probabilities in the divided text piece we can
redefine a probability measure for all authors.It is also possible to train document
vectors for every author and get a probability score for every author, too [40].
We have implemented this methodology on 50-author dataset

13

. In the experi-

mentation, we have defined window=1000, stride=500, and window=500, stride=250.
In the first experimentation measured F1 score is recorded as 0.13 and in the second
is 0.12 respectively. The simplicity of the approach makes it easier to apply to our
dataset but the scale of our text piece is big enough to introduce noises when doing
experimentation hence the reason why it performs low.
Another simple way to test out document vectors for identifying the author of a
text in the testing data can be done by employing the document vectors that are being
trained in 5.5 and creating cluster center points by simply choosing the average of
paragraph vectors in the training data per author and looking up the cosine distance
of each instance in the testing set to every author. In the final stage, we can label
the text piece to the closest author. This simple approach gives a good starting
accuracy measure on 50-author dataset based on our experimentation. In order to
further improve this methodology, one way is to create multiple center points in every
authors and do the same distance measure among all authors.

13

https://github.com/agungor2/Authorship_Attribution/blob/master/inversion_
word2vec.py

82

6. SUMMARY
From the traditional NLP methods to sophisticated ones using Word2Vec in authorship attribution problems have been introduced. In the first chapter, an extended
literature review with previously studied methodologies are introduced. In the second chapter, to create a knowledge base and inspire new ideas to rise, NLP tasks with
available NLP toolboxes are introduced by providing multiple examples. These NLP
tasks are crucial steps to start analyzing a text document. In the third chapter, a
detailed information regarding the used datasets is given. In the next chapter, feature
extractions strategies that are categorized as lexical, character, syntactic, semantic
and application specific have been explicitly implemented on both of the datasets and
the source code is available for future development. The fifth chapter analyzes these
features on different classification experiment settings. Some novel approaches using
Word2Vec is also introduced to lay the groundwork on this set of problems.
One of the main objects in this work was to create author signature using the
common used words with publicly available pretrained word vectors. The several
attempts we have tried by word vector averaging, paragraph vector creating, specific
word usage score, unsupervised feature learning from word vectors, word2vec inversion
have been around this objective. These new techniques in authorship attribution have
not outperformed the usage of traditional methods such as bag of words, n-grams,
or other author specific features. The strong power of SVM and simple bag of words
with stacked features out performed all other experimentations leading the highest
score. However, there is still room for further development on the methodologies
introduced. Hence it is the reason why, feature extraction methods and analyzing
them have been made publicly available to allow new researchers to pick up where
we have left off.

83
The methodologies introduced in this work have also been around the idea of
creating the largest available Authorship Attribution data set for researchers and
testing out the various methodologies to create a benchmarking strategies. Feature
extraction techniques and author signature models introduced here can be applied in
different datasets and text mining domains as well.

84

7. RECOMMENDATIONS
We have introduced several methods to tackle authorship attribution problems. However, there are several other techniques that still wait to implement on 50-author
dataset or other available ones. One of the possible future directions of this workload
is the applications of neural network on our dataset. CNN has been simply implemented on dataset without stop words, but due to the computation time we have
not further investigated. To proceed with Deep Learning methods, useful features
introduced in this work can be easily implemented by stacking them and decreasing
the feature dimensionality to give as an input to simple convolutional neural network or an LSTM. The way to implement such neural network algorithms are also
introduced. Another take off point of this dataset is to implement Markov Models
on word and character level. We have introduced the simple Markov model implementation on our dataset but to further improve the performance of this method,
conditional distributions of characters and words need to be stored and redone the
experimentation.
One major advantage of our dataset is the large number of books that are available
to reproduce. The text pieces given can be used to recombine the books together to
apply in different domains such as topic modeling or sentiment analysis per author.
It is also possible to create one’s own dataset using the provided resources depending
on the application. Some of the suggested tasks are twitter author classification,
Bitcoin founder identification, and applications of authorship attribution in Forensic
science. The problems with Bitcoin founder is that it is still anonymous and there
are possible candidates in the web. These are Nick Szabo, Hal Finney, Adam Back,
Wei Dai, Craig Wright. These are real personalities and have available large texts
on their own forums. By scrapping the writings of these suspects one can create a
dataset to follow up the steps we introduced and compare their writing styles with

85
the unknown founder of Bitcoin. There are also publicly unsolved cases that can be
turned into a authorship attribution problems. The robust and fast methodologies
can be implemented in suspect detection for law officers.
Last but not least, one problem we introduce in our dataset is the unknown
authors that makes % 34 of testing set. Non-exhaustive learning techniques have
not yet been studied on this dataset and using the best features we defined one can
further investigate into Bayesian non-exhaustive learning methods as in [43].

REFERENCES

86

REFERENCES

[1] J. Olsson, Forensic Linguistics: An Introduction to Language, Crime and the
Law, 2nd ed., London, 2008.
[2] Industrial Society and its Future. Washington Post, Sept. 22 1995.
[3] M. Haberfeld and A. V. Hassell, A New Understanding of Terrorism: Case Studies, Trajectories and Lessons Learned.
[4] A. Tausz, Predicting the Date of Authorship of Historical Texts, 2011.
[5] B. Kessler, G. Numberg, and H. Schutze, Automatic Detection of Text Genre.
ACL ’98 Proceedings of the 35th Annual Meeting of the Association for Computational Linguistics and Eighth Conference of the European Chapter of the
Association for Computational Linguistics Pages 32-38, 1997.
[6] J. R. Thompson and J. Rasp, Did C.S. Lewis write The Dark Tower?: An Examination of the Small-Sample Properties of the Thisted-Efron Tests of Authorship.
Austrian Journal Statistics, Volume 38 Number 2, 71-82, 2009.
[7] S. Argamon and S. Levitan, Measuring the Usefulness of Function Words for
Authorship Attribution. Proceedings of ACH/ALLC Conference, 2005.
[8] I. Bozkurt, O. Baglioglu, and E. Uyar, Authorship Attribution Performance of
Various Features and Classification Methods, 2007.
[9] S. Kim, H. Kim, T. Weninger, and J. Han, Authorship Classification: Syntactic
Tree Mining Approach. UP’10 Proceedings of the ACM SIGKDD Workshop on
Useful Patterns Pages 65-73, 2010.
[10] G. Fung, Authorship Classification: Syntactic Tree Mining Approach. The disputed federalist papers: SVM feature selection via concave minimization.
[11] N. Fox, O. Ehmoda, and E. Charniak, Statistical Stylometrics and the MarloweShakespeare Authorship Debate. Providence, RI: Brown University M.A Thesis,
2012.
[12] R. Thisted and B. Efron, Did Shakespeare write a newly discovered poem?, 1986.
[13] S. Stanko, D. Lu, and I. Hsu, Whose Book is it Anyway? Using Machine Learning
to Identify the Author of Unknown Texts, 2013.
[14] W. Hu, Study of Pauline Epistles in the New Testament Using Machine Learning.
Sociology Mind Vol.3, No.2, 193-203, 2013.

87
[15] T. Putnins, D. Signoriello, M. B. Samant Jain, and D. Abbot, Advanced Text
Authorship Detection Methods and Their Application to Biblical Texts. he International Society for Optical Engineering (pp. 1-13), 2006.
[16] F. Mosteller, Inference and Disputed Authorship: The Federalist, 1964.
[17] J. Hutchins, History of Machine Translation in a Nutshell, 2005.
[18] M. Porter, An algorithm for suffix stripping. MCB UP Ltd, Program , Vol. 14
Issue: 3, pp.130-137, 1980.
[19] W. H. Gomaa and A. A. Fahmy, A Survey of Text Similarity Approaches.
ternational Journal of Computer Applications Volume 68– No.13.

In-

[20] X. Pan and P. Liu, https://github.com/tensorflow/models/tree/master/research/textsum,
2016.
[21] T. Mikolov, K. Chen, G. Corrado, and J. Dean, Efficient Estimation of Word
Representations in Vector Space. ICLR Workshop, Jan.
[22] B. Yoshua, D. Rejean, and V. Pascal.
[23] S. Thomas, P. Nguyen, G. Zweig, and H. Hermansky, MLP Based Phoneme
Detectors for Speech Recognition. ICASSP, 2011.
[24] Y. Kim, Convolutional Neural Networks for Sentence Classification.
2014.

EMNLP,

[25] T. Mikolov, M. Karafiat, L. Burget, J. H. Cernock, and S. Khudanpur, Recurrent
Neural Network Based Language Model. Proceedings of Interspeech, 2010.
[26] S. Lai, L. Xu, K. Liu, and J. Zhao, Recurrent Convolutional Neural Networks for
Text Classification. AAAI, 2015.
[27] H. Sak, A. Senior, and F. Beaufays, Long Short-Term Memory Recurrent Neural
Network Architectures for Large Scale Acoustic Modeling. INTERSPEECH,
2014.
[28] I. Sutskever, O. Vinyals, and Q. V. Le, Sequence to Sequence Learning with
Neural Networks, NIPS, 2014.
[29] The GDELT Project. https://www.gdeltproject.org/about.html, 2017.
[30] Kaggle Spooky Author Identification.
author-identification/data, 2017.

https://www.kaggle.com/c/spooky-

[31] E. Stamatatos, A Survey of Modern Authorship Attribution Methods. Journal
of the American Society for Information Science and Technology, 2009.
[32] R. Layton, Learning Data Mining with Python, 2015.
[33] J. Grieve, Quantitative authorship attribution: An Evaluation of Techniques.
Literary and Linguistic Computing, 2007.
[34] C. Lee and A. Bosch, Exploring Lexical and Syntactic Features for Language
Variety Identification. VarDial, 2007.

88
[35] L. Yang, C. Li, and Q. D. L. Li, Combining Lexical and Semantic Features for
Short Text Classification. 17th International Conference in Knowledge Based
and Intelligent Information and Engineering Systems, 2013.
[36] A. O. and A. T., Modeling the shape of the scene: a holistic representation of
the spatial envelope. International Journal of Computer Vision, 2001.
[37] G. Sidorova, F. Velasquez, E. Stamatatos, A. Gelbukh, and L. C. Hernandez,
Syntactic N-grams as Machine Learning Features for Natural Language Processing. 11th Mexican International Conference on Artificial Intelligence, 2012.
[38] M. Pagliardini, P. Gupta, and M. Jaggi, Unsupervised Learning of Sentence Embeddings using Compositional n-Gram Features. NAACL, 2018.
[39] J. Wieting, M. Bansal, K. Gimpel, and K. Livescu, Towards universal paraphrastic sentence embeddings. International Conference on Learning Representations
(ICLR), 2016.
[40] Q. Lee and T. Mikolov, Distributed Representations of Sentences and Documents.
Proceedings of the 31 st International Conference on Machine Learning, 2014.
[41] A. Coates and A. Y. Ng, Learning Feature Representations with K-meansNeural
Networks: Tricks of the Trade, 2nd ed. Springer, 2012.
[42] M. Taddy, Document Classification by Inversion of Distributed Language Representations. ACL, 2015.
[43] M. Dundar, F. Akova, A. Qi, and B. Rajwa, Bayesian Nonexhaustive Learning
for Online Discovery and Modeling of Emerging Classes. ICML, 2012.