Analysis of a fake news dataset with Machine Learning is the the Eropean Union’s East StratCom spin off monitoring the ongoing disinformative campaigns which aims to destabilise the Europen Union, NATO and western democracies.

They post high quality newsletters with disinformation reviews, techinques, fake news and disproofs. 

With machine learning services, a couple of scripts in Python and Elasticsearch it is quite straighforward to collect a fake news dataset from newsletters and set up a simple analytical dashboard.

Note: The dataset contains both fake news and disproofs and they are not tagged.


The final result with Kibana and Elasticsearch looks something like this.

We collected 803 news (it was few months ago, now they should much more) automatically extracting:

  • name of the author
  • language
  • article content
  • translation of the article to English
  • relevant terms (entity name recognition)
  • sentiment

analysis of fake news data set

Alternatively the fake news can be also analysed with a Jupyter notebook, here is a result of a quick draft of analysis.

PDF draft of analysis of a fake news dataset.


In order to populate the fake news data set and dashboard these where the steps, fully automated with Python scripts:

  1. Scrape the PDF newsletters from Internet
  2. Extract the links from each PDF newsletter
  3. Scrape the article content with, removing the menu and noisy content
  4. Translate the non-English articles to English with Microsoft Translator API or Google Cloud Tranlsation API 
  5. Apply the entity name recognition and sentiment analysis with Google Cloud Natural Language API
  6. Upload the news to ElasticSearch and create a Kibana Dashboard to visualise the result


The code is available on github The folders are:

  • Python: the python scripts to scrape, process and load the news
  • Dataset: the news scraped and processed, exported as json text files
  • Resources: the newsletters downloaded from
  • Jupyter: a juptyter notebook
  • kibana: the kibana dashboard


  • The translation to English was the most expensive step, the costs is about 10-15 $ per million characters. Both Microsoft Tranlsation API and Google Translation API performed exceptionally well.
  • The article extraction with diffboot can be replaced with a combination of Readability library and CLD2 library, both open source. 
  • The sentiment analysis did not produce significant results but it might be interesting in specific applications
  • Overall the cloud APIs already available over Internet can produce very accurate results and allow to set up a prototype of application very quickly.
  • For large scale applications the cloud API costs can be an issue and alternative solutions can be developed.