Analysis of a fake news dataset with Machine Learning

euvsdisinfo.eu is the the Eropean Union’s East StratCom spin off monitoring the ongoing disinformative campaigns which aims to destabilise the Europen Union, NATO and western democracies.

They post high quality newsletters with disinformation reviews, techinques, fake news and disproofs. 

With machine learning services, a couple of scripts in Python and Elasticsearch it is quite straighforward to collect a fake news dataset from euvsdisinfo.eu newsletters and set up a simple analytical dashboard.

Note: The dataset contains both fake news and disproofs and they are not tagged.

Dashboard

The final result with Kibana and Elasticsearch looks something like this.

We collected 803 news (it was few months ago, now they should much more) automatically extracting:

  • name of the author
  • language
  • article content
  • translation of the article to English
  • relevant terms (entity name recognition)
  • sentiment

analysis of fake news data set

Alternatively the fake news can be also analysed with a Jupyter notebook, here is a result of a quick draft of analysis.

PDF draft of analysis of a fake news dataset.

Procedure

In order to populate the fake news data set and dashboard these where the steps, fully automated with Python scripts:

  1. Scrape the euvsdisinfo.eu PDF newsletters from Internet
  2. Extract the links from each PDF newsletter
  3. Scrape the article content with https://www.diffbot.com, removing the menu and noisy content
  4. Translate the non-English articles to English with Microsoft Translator API or Google Cloud Tranlsation API 
  5. Apply the entity name recognition and sentiment analysis with Google Cloud Natural Language API
  6. Upload the news to ElasticSearch and create a Kibana Dashboard to visualise the result

Code

The code is available on github https://github.com/melphi/fakenews-analysis. The folders are:

  • Python: the python scripts to scrape, process and load the news
  • Dataset: the news scraped and processed, exported as json text files
  • Resources: the newsletters downloaded from euvsdisinfo.eu
  • Jupyter: a juptyter notebook
  • kibana: the kibana dashboard

Considerations

  • The translation to English was the most expensive step, the costs is about 10-15 $ per million characters. Both Microsoft Tranlsation API and Google Translation API performed exceptionally well.
  • The article extraction with diffboot can be replaced with a combination of Readability library and CLD2 library, both open source. 
  • The sentiment analysis did not produce significant results but it might be interesting in specific applications
  • Overall the cloud APIs already available over Internet can produce very accurate results and allow to set up a prototype of application very quickly.
  • For large scale applications the cloud API costs can be an issue and alternative solutions can be developed.