Introduction

According to the Oxford Dictionary, Clickbait “describes online content that is designed to encourage the user to click through to a certain web page”. It is a form of false advertisement designed to attract attention and entice users to follow a particular link and read, view, or listen to the linked piece of online content. Thought to be deceptive and misleading, clickbaits usually involves sensational headlines or images.

In this assignment I will build a Näive Bayes text classifier capable of discriminating between Clickbait and non-Clickbait headlines. The dataset, downloaded from https://github.com/bhargaviparanjape/clickbait, is an enlarged version of the dataset used in the following paper:

Abhijnan Chakraborty, Bhargavi Paranjape, Sourya Kakarla, and Niloy Ganguly. “Stop Clickbait: Detecting and Preventing Clickbaits in Online News Media”. In Proceedings of the 2016 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (ASONAM), San Fransisco, US, August 2016.

The clickbait corpus consists of 16000 article headlines from ‘BuzzFeed’, ‘Upworthy’, ‘ViralNova’, ‘Thatscoop’, ‘Scoopwhoop’ and ‘ViralStories’. Another 16000 non-clickbait article headlines were collected from ‘WikiNews’, ’New York Times’, ‘The Guardian’, and ‘The Hindu’. Hence, it is important to highlight that the authors assumed that all headlines from the former newspapers were clickbait, while all the headlines from the latter media were non-clickbait. In other words, I reckon that it has not been a human the one classifying the different headlines between clickbait and non-clickbait. This can have consequences for the classification, as I will show later on.

Dataset preprocessing

The dataset downloaded from GitHub cannot be used directly, as it consists of two files. One contains the clickbait headlines (one per row) and a second one contains the non-clickbait headlines. Thus, a number of transformations were needed.

# load files with clickbait and non-clickbait headlines
clickbait <- read.table("clickbait_data", header=FALSE, sep="\t", quote = "")
non_clickbait <- read.table("non_clickbait_data", header=FALSE, sep="\t", quote = "")
# Add column header and column with corresponding label ("Yes")
names(clickbait) <- c('Headline')
clickbait$Clickbait <- as.factor(rep("Yes",nrow(clickbait)))
# Add column header and column with corresponding label ("No")
names(non_clickbait) <- c('Headline')
non_clickbait$Clickbait <- as.factor(rep("No",nrow(non_clickbait)))
# bind the two dataframes together and shuffle the rows
click <- rbind(clickbait,non_clickbait)
# fix the random seed
set.seed(100417976)
click <- click[sample(nrow(click)),]
# reset row indices
rownames(click) <- NULL

The dataset was too big to allocate all the generated objects into memory (in R Markdown) and thus I used only 20000 headlines (out of 32000).

# select a subset
click <- click[1:20000,]

Once the dataset is ready, we can have a look at its first entries:

# print first entries of dataframe
head(click)
##                                                                                      Headline Clickbait
## 1                                    33 Reasons Why The South Of Italy Will Ruin You For Life       Yes
## 2                                          30 Things You Might Not Know About Emilie De Ravin       Yes
## 3                                             Usain Bolt sets new world record in 100m sprint        No
## 4 This Bride Treated Herself To McDonald's After Her Groom Went Paintballing At Their Wedding       Yes
## 5                     Group claims Fred Thompson lobbied for abortion-rights, Thompson denies        No
## 6                                                           The 24 Best Fiction Books Of 2015       Yes

The corpus

The following code will create and clean the corpus for the clickbait dataset.

library(tm)
# retrieve full corpus
corpus <- Corpus(VectorSource(click$Headline))
# Translate all letters to lower case:
clean_corpus <- tm_map(corpus, tolower)
# Remove numbers and punctuation
clean_corpus <- tm_map(clean_corpus, removeNumbers)
clean_corpus <- tm_map(clean_corpus, removePunctuation)
# Remove stop words and excess white spaces
clean_corpus <- tm_map(clean_corpus, removeWords, stopwords("en"))
clean_corpus <- tm_map(clean_corpus, stripWhitespace)

By using wordclouds we can visually inspect the 30 most common words for clickbait (left) and non-clickbait (right) headlines:

library(wordcloud)
# find indices of clickbait (CB) and non-clickbait (NCB) headlines
CB_indices <- which(click$Clickbait == "Yes")
NCB_indices <- which(click$Clickbait == "No")
# plot them
par(mfrow=c(1,2))
# wordcloud for clickbait headlines
wordcloud(clean_corpus[CB_indices], max.words = 30, scale=c(3,1))
# wordcloud for non-clickbait headlines
wordcloud(clean_corpus[NCB_indices], max.words = 30, scale=c(3,1))

It is not difficult to differentiate one from the other, as the wordcloud corresponding to non-clickbait headlines displays words that one would expect in a “serious” newspaper (“killed”, “obama”, “china”, “government”). In contrast, the clickbait headlines wordcloud highlights words that readers could find attractive (“zodiac”, “love”,“favorite”). This is an indication that a text classifier will not have, in principle, too many problems identifying the different headlines.

Build the Clickbait filter

Train and test partitions

I will begin by splitting the headlines dataset and the clean corpus into train and test partitions. In this case, a 75% of the instances go to the training partition and a 25% to the test partition.

# total number of headlines
n_t <- dim(click)[1]
# 75% to the train set
n_tr <- round(0.75*n_t)
# split headlines into train set and test set
click_train <- click[1:n_tr,]
click_test <- click[(n_tr+1):n_t,]
# split the clean corpus
corpus_train <- clean_corpus[1:n_tr]
corpus_test <- clean_corpus[(n_tr+1):n_t]

We can check that both partitions have a similar proportion of clickbait and non-clickbait headlines (approx. 50/50)

# for the train
table(click_train$Clickbait)/length(click_train$Clickbait)
## 
##       Yes        No 
## 0.4969333 0.5030667
# for the test
table(click_test$Clickbait)/length(click_test$Clickbait)
## 
##    Yes     No 
## 0.4934 0.5066

Document-term matrix (dtm)

The next step is to construct the document-term matrix (dtm) and split it into training and test partitions

# compute the frequency of terms
click_dtm <- DocumentTermMatrix(clean_corpus)
# split it in train and test sets
click_dtm_train <- click_dtm[1:n_tr,]
click_dtm_test <- click_dtm[(n_tr+1):n_t,]

Identify the words appearing at most 5 times and “clean” the previous document term matrices

five_times_words <- findFreqTerms(click_dtm_train, 5)
click_dtm_train <- DocumentTermMatrix(corpus_train,
                                    control=list(dictionary = five_times_words))
click_dtm_test <- DocumentTermMatrix(corpus_test,
                                   control=list(dictionary = five_times_words))

Convert the count information to “yes” or “no”

# define function to convert 1 and 0 to yes and no
convert_count <- function(x){
  y <- ifelse(x > 0, 1,0)
  y <- factor(y, levels=c(0,1), labels=c("No", "Yes"))
  y
}
# apply function to the train dtm
click_dtm_train <- apply(click_dtm_train, 2, convert_count)
# apply function to the test dtm
click_dtm_test <- apply(click_dtm_test, 2, convert_count)

Näive Bayes Classifier

Train a Näive Bayes classifier using the train dataset and the corresponding document-term matrix.

# load needed library
library(e1071)
# Train the Naive Bayes Classifier
NB.clas <- naiveBayes(click_dtm_train, click_train$Clickbait)

Next step is to evaluate the performance of the classifier on the test partition, and check the results with the confusion matrix. In this case, I have used the function confusionMatrix from the package caret because it provides very valuable information on the accuracy of the predictions.

# evaluate the performance of the classifier
NB_predictions <- predict(NB.clas, newdata=click_dtm_test)
# check predictions against reality
caret::confusionMatrix(NB_predictions, click_test$Clickbait)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  Yes   No
##        Yes 2310  185
##        No   157 2348
##                                           
##                Accuracy : 0.9316          
##                  95% CI : (0.9242, 0.9384)
##     No Information Rate : 0.5066          
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.8632          
##                                           
##  Mcnemar's Test P-Value : 0.1443          
##                                           
##             Sensitivity : 0.9364          
##             Specificity : 0.9270          
##          Pos Pred Value : 0.9259          
##          Neg Pred Value : 0.9373          
##              Prevalence : 0.4934          
##          Detection Rate : 0.4620          
##    Detection Prevalence : 0.4990          
##       Balanced Accuracy : 0.9317          
##                                           
##        'Positive' Class : Yes             
## 

As can be seen in the previous summary, the Näive Bayes classifier that I have trained does a wonderful job distinguishing between clickbait and non-clickbait headlines. In global, the accuracy goes up to a 93.2% and, in particular, it classifies correctly a 92.6% of clickbait headlines and a 93.7% of the non-clickbait ones.

Bayesian Näive Bayes Classifier

Even when the performance of the classical Näive Bayes Classifier is good enough, it might be marginally improved if the classifier is “taught” to deal with combinations of terms that it has not previously faced. This can be included in the previous classifier by performing a Laplacian Smoothing.

# Train the Bayesian Naive Bayes classifier incorporating Laplacian Smoothing
BNB.clas <- naiveBayes(click_dtm_train, click_train$Clickbait,laplace = 1)
# evaluate the performance of the new classifier
BNB.predictions <- predict(BNB.clas, newdata=click_dtm_test)
# check predictions against reality
caret::confusionMatrix(BNB.predictions, click_test$Clickbait)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  Yes   No
##        Yes 2339  106
##        No   128 2427
##                                          
##                Accuracy : 0.9532         
##                  95% CI : (0.947, 0.9589)
##     No Information Rate : 0.5066         
##     P-Value [Acc > NIR] : <2e-16         
##                                          
##                   Kappa : 0.9064         
##                                          
##  Mcnemar's Test P-Value : 0.1698         
##                                          
##             Sensitivity : 0.9481         
##             Specificity : 0.9582         
##          Pos Pred Value : 0.9566         
##          Neg Pred Value : 0.9499         
##              Prevalence : 0.4934         
##          Detection Rate : 0.4678         
##    Detection Prevalence : 0.4890         
##       Balanced Accuracy : 0.9531         
##                                          
##        'Positive' Class : Yes            
## 

As it was expected, the Bayesian filter gives better results than the classical one. With this classifier, the accuracy has been increased to a remarkable 95.3%. In particular, it correctly classifies a 95.7% of clickbait headlines and a 95% of the non-clickbait ones, which is an extremely good balanced. That 5% of the non-clickbait headlines are incorrectly classified as clickbait could mean that the classifier has its limitations and/or that “serious” newspapers fall, inadvertently or not, into bad practices. Recall that the headlines were classified as clickbait and non-clickbait only for belonging to a given medium, not because a human checked it.

As a final remark, it must be noted that in the original publication the authors reported an accuracy of the 93% using a Support Vector Machine (SVM), although with a smaller dataset. In this sense, training the Bayesian filter with just 10000 headlines (out of 32000), reduces the accuracy to a 93.8%, which is still higher than in the original paper.