This project is part of the Udacity Data Scientist Nanodegree Program: Disaster Response Pipeline Project and the goal was to apply the data engineering skills learned in the course to analyze disaster data from Figure Eight to build a model for an API that classifies disaster messages. As always let’s apply CRISP-DM Process (Cross Industry Process for Data Mining) to tackle the problem:
Business Understanding
Data Understanding
Prepare Data
Data Modeling
Evaluate the Results
Deploy
Business Understanding
During and immediately after natural disaster there are millions of communication to disaster response organizations either direct or through social media. Disaster response organizations have to to filter and pull out the most important messages from this huge amount of communications a and redirect specific requests or indications to the proper organization that takes care of medical aid, water, logistics ecc. Every second is vital in this kind of situations, so handling the message correctly is the key
The project is divided in three sections:
Data Processing: build an ETL (Extract, Transform, and Load) Pipeline to extract data from the given dataset, clean the data, and then store it in a SQLite database
Machine Learning Pipeline: split the data into a training set and a test set. Then, create a machine learning pipeline that uses NLTK, as well as scikit-learn’s Pipeline and GridSearchCV to output a final model that predicts a message classifications for the 36 categories (multi-output classification)
Web development: develop a web application to show classify messages in real time
Data Understanding
The dataset provided by Figure Eight contains 30000 messages drawn from events including an earthquake in Haiti in 2010, an earthquake in Chile in 2010, floods in Pakistan in 2010, superstorm Sandy in the U.S.A. in 2012, and news articles spanning a large number of years and 100s of different disasters. The messages has been classified in 36 different categories related to disaster response and they have been stripped of sensitive informations in their entirety. A translation from the original language to english has also been provided. More information about the dataset here.
Prepare Data
The dataset is provided is basically composed by two files:
disaster_categories.csv: Categories of the messages
disaster_messages.csv: Multilingual disaster response messages
Data preparation steps:
Merge the two datasets
Split categories into separate category columns
One-hot encode category
Remove duplicates
Upload to SQLite database
Data Modeling
Now we will use the data to train a model that should take in the message column as input and output classification results on the other 36 categories in the dataset
Load the data
Create a ML Pipeline
Train the ML Pipeline
Test the model
Tune the model
Evaluate the results
Export the model
The components used in the pipeline are:
CountVectorizer: Convert a collection of text documents to a matrix of token counts
TfidfTransformer: Transform a count matrix to a tf-idf (term-frequency times inverse document-frequency) representation
MultiOutputClassifier: Multi target classification
As already said we then used GridSearchCV to exhaustive search over specified parameter values for our estimator
TFIDF is a numerical statistic that is intended to reflect how important a word is to a document in a collection. Check this fantastic Quora article for more information
The model is finally being saved to be loaded and and later used for real time message classification
Evaluate the Results
The dataset is highly imbalanced and that is the reason why the accuracy is high and the recall value is pretty low
To tackle an imbalanced dataset there are a lot of ways as shown in this really interesting Medium post
Deploy
A dash application has been developed as user interface: it is possible to submit a message to classify and have an overview of some information about the training dataset
Outro
I hope the post was interesting and thank you for taking the time to read it. The code for this project can be found in this GitHub repository, on my Medium you can find a more in depth story and on my Blogspot you can find the same post in italian. Let me know if you have any question and if you like the content that I create feel free to buy me a coffee.