• Warsame Words

How to build a fake news detection system with SOTA NLP

A hands-on tutorial using real news data, this article explores a range of SOTA NLP techniques to accurately classify news texts as either fake or real.

Table of Contents

  1. Introduction

  2. Raw Input Data

  3. Exploratory Data Analysis

  4. Feature Engineering

  5. Model Selection

  6. Conclusion

  7. References


Social media's rise as a byproduct of the internet exemplifies the most profound societal trend of recent times. Certainly, ever since widespread adoption, a considerable part of social life takes place online, while centuries of prior human exchange involved physical contact. 'Your network is your net worth' is a phrase which embodies exactly that, and illustrates how much power we ascribe to this new facet of modernity. From enabling users to build social networks, pools of knowledge and news sharing, community movements and subcultures, to unlocking incredible economic fortunes. It has forever transformed our lives. So much so that modern workforces have no choice but to acquire digital skills to tap into its potential.

Yet that is not the whole story. There isn't only glory and glamour that comes with social media. All the many outlets in existence today, such as Instagram, Facebook, Twitter, TikTok and scores more, aren't merely oases of connectedness. They don't just serve our world as virtual trade hubs entirely beneficial. As is always the case with innovative novelties, there are also negative sides to it; and this tutorial aims to explore one of them: namely, fake news dissemination. Truthfully, the technology powered proliferation of fake news has devastating consequences on society at large. It is therefore imperative that firms and governments further develop the currently available mechanisms for its detection and elimination.

But before jumping at the data, let us first pin down a definition of fake news. Often inappropriately used to dismiss news content disliked, it isn't in fact that easy to define. Hence why I've asked myself piling through a dozen academic papers: is there a universally accepted and applied definition? If so, how are unintentional blunders in reported facts or figures treated? Are they also considered fake news? 

In the end, I’ve come to realise there is no such widely accepted definition. But published studies in the text classification branch of data science offer some sensible views on this. Rubin et al, for instance, describe 3 types of fake news:

  • Serious fabrications which are not published in any mainstream news outlet — also, due to their rarity, these are harder to gather using web scraping.

  • Uniquely written creative hoaxes which appear on multiple platforms on account of their viral nature.

  • Humorous fake news created for entertainment purposes just as satire pieces. The authors argue that including this type makes it harder for algorithms to detect the former two types with conventional classification methods.

Consequently, we can conclude that fake news is entirely fabricated and partisan content presented by its propagators as factual — mostly circulated on the web, such as on blogs, news sites and social media platforms. Generally speaking, two criteria differentiate fake news from real: (1) firstly, the fabrication of events which never occurred in the reported form, and (2) secondly, an element of partisanship aimed at either praising or bringing into disrepute targeted persons, organisations and/or entities. Both operate in unison in the spreading of false information.

Raw Input Data

The data science learning and competitions website Kaggle offers hundreds of rich datasets that researchers can exploit for every thinkable purpose. All one needs to do is register an account, and off you go in search for interesting datasets. I have chosen this dataset linked below, a collection of news articles published around the time when Donald Trump and Hillary Clinton were competing for the highest office in the US: (https://www.kaggle.com/javagarm/trump-fake-news-dataset) 

I'm certain that there was an older submission with more detail on how the articles were scraped, and which exact time period they cover. In retrospect, we know that the fake news phenomenon gained the most traction in the wake of Trump's 2016 presidential election bid. His campaign was mired in allegations of Russian involvement, apparently in the form of armies of social media bots that helped him win. For this reason, it is safe to assume our dataset is virtually identical to this more prominent one used in two academic papers:(https://www.kaggle.com/clmentbisaillon/fake-and-real-news-dataset?select=True.csv)

Using the head and info methods of Pandas provides us with a good overview:

Exploratory Data Analysis

The very first step on this front should scrutinise whether or not we’re affected by a problem called class balance. In the same way, it is also worthwhile to examine if the data exhibits a pattern in the missing values. We can see in the plot below that the news article dataset is perfectly balanced, as half of the rows are fake news and the other half real, suggesting human intervention in its creation. Such a find is always a relief, considering that class imbalance poses tremendous challenges in the application of prediction algorithms.

Turning now to our endeavour aimed at finding a pattern in the missing values, let's use the below code snippet to colour missing rows yellow and perfect rows in blue. Such an approach is particularly useful when dealing with time-series data, as recurring patterns are immediately spotted with the naked eye.

There clearly is an abundance of yellow in the 'author' column, indicating the presence of missing values. It is an observation mentioned already in the previous section. What's more interesting, however, is the following question: are fake news articles more likely to lack a corresponding author name than real news articles? Let us investigate this and further advance our understanding of our dataset.

Looking at a horizontal bar chart visualisation of the most frequent names appearing in the author column, we can indeed confirm that the name occurring most often in fake news articles is 'unknown'. Important to note, however, is that I've encoded the missing values as 'Unknown' before running the visualisation.

The visualisation plot function is a tweaked version of Yellowbricks' handy text visualisation class called Token Frequency Distribution (https://www.scikit-yb.org/en/latest/api/text/freqdist.html). 

At any rate, how does the picture look like for the real news articles?

Scanning the graphic reveals there's no instance of the word 'Unknown' among the top 50 authors of real news, as one would expect. Almost all academic studies reviewed for this article have pointed out exactly this conclusion: anonymity more often than not encourages the spread of fake news (Giachanou and Rosso, 2020). Rather unsurprising come to think of it. For in a faceless, highly interconnected and digitalised world where individuals disguise themselves to post whatever they want, it is unabashed toxicity —more than anything else— that thrives without bounds. Thus one should always be cautious of an account run by an anonymous person, as the likelihood of it being dubious is high.

Feature Engineering

Turning now to the feature engineering part of this project, we will explore a range of methods from the latest advances in Natural language processing (NLP). As a branch of machine learning and AI, it witnessed incredible progress in recent times. In fact, every year or so the data science community delights at the news of a new technology that provides state-of-the-art (SOTA) toolkits for NLP use cases.

Yet that doesn’t mean all problems which may arise are outsourced. Hard choices still have to be made, and extracting text features remains difficult. From delicate regular expressions to grappling with complex concepts such as word vectors, it does require a bit of nitty gritty work.

Among the biggest challenges of text classification is the high-dimensionality problem. Similar to numerical datasets, one can also extract very many features from text datasets: in tasks involving the analysis of news articles, social interactions and product reviews. There is vast vocabulary of idioms, words, and phrases to work through, often resulting in computationally expensive operations. For this project, I have followed the classification process of Ahmed et al. 2017:

The main axiom underpinning my feature engineering approach is simple: real news articles are different from fake ones in that the authors followed linguistic practices of journalism. Many of these journalists are professional writers who take great care in how they express themselves. Hence why I thought it sensible to extract linguistic features first. Here’s a list of the basic features I have extracted first:

  • char_count: how many characters an article is composed of

  • word_count: the overall word count

  • nr_unique_words: how many unique words are in an article?

  • nr_unknown_authors: was the article written by an unknown author?

  • common_nouns: how many common nouns in an article?

  • proper_nouns: how many proper nouns in an article

  • proper_common_ratio: the ratio of proper to common nouns

  • num_exclamation_marks: number of exclamation marks in an article

  • nr_question_marks: number of question marks in an article

  • readability score: Flesch Kincaid reading ease score

The previous exploratory data analysis has exposed that content written by an unknown author is a strong indication of fake news presence. Any article composed by an anonymous person is, therefore, likely propagating falsehood. A reading of the literature also reveals that fake news articles tend to pack a lot in the title, usually the names of individuals, institutions or political parties. Proper nouns, thus, appear more frequently than common nouns as a form of clickbait. That being so, I concluded it may be useful to examine the ratio between common nouns and proper nouns in an article.

Furthermore, what I encountered in the world of social media quite often is the use of exclamation marks in the text body, possibly as a trick to insinuate importance. This is why I’ve also decided to collect the frequency counts of how often an author used exclamation marks and how many times a question mark. Besides these simple symbol features, I thought it enriching to create linguistic features such as the Flesch-Kincaid reading ease score. Writing is undoubtedly a matter of style, and given that fake news propagators will try hard to disguise their falsehood, their writing could potentially be different in its style and language usage.

Every now and then, feature engineering does indeed turn out as a high-effort low-yield activity. Taking a look at the Pearson correlation coefficient between the ‘label’ column — where 1 denotes that an article is fake news and 0 when it’s not — and the engineered features, we can see that only the unknown author and number of unique words columns seem to bear any meaningful impact. The rest are marginally better than random noise. Yet interestingly, we can still see some noteworthy conclusions. To sum them up:

  • The more exclamation marks used in an article, the more likely it’s fake news.

  • The more proper nouns (names of people, places, religions and buildings), the more likely an article is fake news. This means that the clickbait hypothesis in the literature holds for our dataset.

  • The more unique words an author employs, the less likely he/she is spewing fake news.

But correlation is not causation as the old adage of statistics goes. We have to go beyond mere correlative analysis, and model the relationships using sophisticated algorithms. In view of this, and before proceeding to the model selection part, I have decided to include a powerful NLP technique to make up for the poor explanatory power of the manually engineered features. The trick is called word vectors, a built in functionality from the SpaCy library. So, what are word vectors, then?

Prior to introducing these magical word vectors, let us first grapple with the thought of semantic similarity. Ponder for a minute, aren’t some words closer to each other in meaning than other words? For instance, if you were to invite a friend to go for a nice burger meal (apologies to the vegetarians and vegans among you), could you not say: “Hey, shall we go to a fast food joint?” and this would theoretically include burger meals? Have a look at the image below and the idea will become a lot clearer.

Word vectors are numeric representations of textual data computed by statistical models which tend to use similarity measures (such as Cosine similarity) to express how close in meaning certain words are to each other. In the diagram above, we can see that just as the words boy and girl are similar to one another, the terms princess and prince are too. In the NLP toolkits of SpaCy, the similarity values range between 0 (no similarity) to 1 (perfect similarity). See below function:

Using word vectors from the massive vocabulary called “en_core_web_md” in SpaCy, accessed via the .vector attribute, I have generated 300 more features for our fake news prediction task. Subsequently, I have joined up the word vector dataframe with our manually engineered features. Let us now see how our amalgamated construct fares in the actual task of classifying news articles into either fake or real news.

Model Selection

There are dozens of modelling techniques employed by specialists of text classification in the academic literature. From Logistic Regression (LR) to Naive Bayes (NB) and neural network (NN) models, everything has been tried and tested on fake news datasets, both artificially constructed and scraped real-world datasets. But before finally embarking on predicting the fake news articles from Kaggle, and selecting a range of models, we have to split the data into a train set and test set. I made the hard choice of randomly selecting 30% of our odd 20k rows as test dataset.

This is quite a dataset in comparison to what many of the academic studies I have read did. Without any disrespect, but reading only the methodology sections, you’ll notice that the smaller the dataset the better the accuracy and results. See below the code snippets I’ve used to run 4 algorithms in total: Logistic Regression Classifier, Decision Tree Classifier, Random Forest Classifier and XGB Classifier. The last model — untuned in its parameters — performed the best with 76% accuracy on the 30% carved out test dataset.

# models
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from yellowbrick.classifier import ConfusionMatrix
from sklearn.metrics import accuracy_score
from sklearn.metrics import log_loss
from sklearn.metrics import average_precision_score
from sklearn.model_selection import cross_val_score
import xgboost as xgb
from IPython.core.interactiveshell import InteractiveShell
from sklearn.naive_bayes import MultinomialNB
InteractiveShell.ast_node_interactivity = "all"

df_scores = pd.DataFrame(columns=['Trainset Accuracy', 'CV Accuracy', 
                                  'Testset Accuracy', 'Testset Logloss'])
df_scores.index.name = "Model"

# individual models

m_log = LogisticRegression(C=0.8, class_weight=0.5, max_iter=2000, random_state=42)

m_dec = DecisionTreeClassifier(max_depth=4, 
                               min_samples_leaf=5, random_state=42)

forest = RandomForestClassifier(n_estimators=50, max_depth=6, min_samples_leaf=6,
                                min_samples_split=50, random_state=42)

xgb_clf = xgb.XGBClassifier(objective='binary:logistic', n_estimators=200, seed=42,
                            base_score=0.5, booster='gbtree', colsample_bylevel=1,
                            colsample_bynode=1, colsample_bytree=1, gamma=0,
                            learning_rate=0.1, max_delta_step=0, max_depth=2,
                            n_jobs=-1, reg_alpha=0, reg_lambda=1, subsample=0.7)

models = [m_log, m_dec, forest, xgb_clf]

functions = [model_runner_log, model_runner_dec, 
             model_runner_forest, model_runner_xgb]

for m, f in zip(models, functions):
    train_accuracy, cv_accuracy, test_accuracy, test_logloss, model_name = f(m, X_train, X_test, y_train, y_test)
    df_scores.at["{a}".format(a=model_name)] = [train_accuracy, cv_accuracy, test_accuracy, test_logloss]
    plot_confusion_matrix(m, X_test, y_test, normalize= 'true')
    plt.title('Fake News Confusion Matrix {a}'.format(a=model_name), fontsize=10, pad=10)
    plt.savefig('Fake News Confusion Matrix {a}.png'.format(a=model_name), format='png', dpi=1000);

I learned the hard way that detecting fake news out there in the wild is actually a pretty difficult undertaking. Of course, my accuracy only reflects the linguistic features I’ve manually extracted and the word vectors from SpaCy. Going above and beyond by deploying n-gram models, TF-IDF vectorisation techniques, more advanced neural nets, and exhaustive hyperparameter tuning on my algorithms would’ve probably catapulted me above the 80% accuracy mark. But that is for another day, as my article is already longer than the recommended 5 minute read.


Examining the confusion matrix of our winning algorithm, the XGB Classifier, reveals the problem that was symptomatic for all tested models. We can consequently conclude that it is much easier to detect real news than it is to detect fake news. For 86% of the real news instances were predicted correctly as real news. In contrast, only 14% were flagged as false positives. As an aside, it is important to note that we want to see more yellow and dark violet than any other colour in the confusion matrix plot.

But the picture looks very different from what we would like to see for fake news. Unfortunately, our algorithm only correctly spotted 66% of the fake news articles, while it flagged a staggering 34% of real news article pieces as fake news — illustrating where the biggest challenges lie. Much of fake news is so well disguised that even more potent models are needed to draw the line.

Yet another eventuality that I’m not aware of could be that satire pieces are in the mix with the 20,000 articles, throwing most of the selected algorithms off the path of more accurate detection. Whatever the case, however, I’ve learned a lot about fake news in this journey through academic papers and real-world data.

Indeed, as internet and social media usage becomes a reality for all corners of the globe, the spread of fake news will also become a more damaging and pervasive problem. Hundreds of millions of people are already connected on massive networks on social media platforms. Moreover, though these platforms have moderating mechanisms on posted content, there is still ample room for the spread of harmful stuff: which incentivises the spread of fake news for political or economic gain.

In this article, you have learned some of the techniques that are leveraged for the detection of such content. In a subsequent follow-up article I will explore more methods and contrast them with the ones employed here, possibly as a more advanced extension. Hopefully, I’ll be able to write more beginner-friendly posts in the future as well.

Feel free to let me know what your thoughts are. I would appreciate a subscribe to my blog newsletter, Medium account (https://warsamewords.medium.com/) and in case you want to connect with me on social media: my Twitter handle is @warsame_words. I welcome feedback and constructive criticism — for the latter, LinkedIn is a welcome avenue. Thank you for accompanying me on this journey.


  1. Detecting Fake News using Machine Learning: A Systematic Literature Review 2020 by Ahmed et.al. Cornell University Computers and Society. arXiv:2102.04458 [cs.CY]

  2. Ahmed, H., Traore, I., & Saad, S. (2017). Detection of online fake news using n-gram analysis and machine learning techniques. Proceedings of the International Conference on Intelligent, Secure, and Dependable Systems in Distributed and Cloud Environments, 127–138, Springer, Vancouver, Canada, 2017. https://doi.org/10.1007/978-3-319-69155-8_9 ↩

  3. Fake News Detection on Social Media: A Data Mining Perspective by Shu, Sliva, and Wang et.al. 2017. https://dl.acm.org/newsletter/sigkdd

  4. J. C. S. Reis, A. Correia, F. Murai, A. Veloso and F. Benevenuto, “Supervised Learning for Fake News Detection,” in IEEE Intelligent Systems, vol. 34, no. 2, pp. 76–81, March-April 2019, doi: 10.1109/MIS.2019.2899143.

  5. Rahul Agarwal Make Your Pandas Apply Faster using Parallel Processing https://towardsdatascience.com/make-your-own-super-pandas-using-multiproc-1c04f41944a1

  6. The Role of User Profiles for Fake News Detection by Zhou, Zafarani, Shu 2019. ASONAM ’19: Proceedings of the 2019 IEEE/ACM International Conference on Advances in Social Networks Analysis and MiningAugust 2019 Pages 436–439https://doi.org/10.1145/3341161.3342927

  7. Giachanou Anastasia, Rosso Paolo 2020, The Battle Against Online Harmful Information: The Cases of Fake News and Hate Speech.


Recent Posts

See All