Friday, May 13, 2022

Machine Learning in Practice: Shinkansen Travel Experience Hackathon



 
I had been wanting to practice data Science for a while, but been putting it off because it required too much effort, telling myself I was busy. Over the last few weeks, I commenced my journey, thanks to MIT Data Science Program and Great Learning - tough to begin, but very satisfying. If you are someone similar, "Just do it" and get started - time will find itself.

A couple of weeks ago, I participated in a Data Science Hackathon to test my concepts and knowledge of Python, I had learnt recently. First, the thought of participating in a hackathon made me uncomfortable – 72 hours to solve a problem along with your day job, can I do it? I am a novice; can I really compete and complete? I will never know if I never tried. 

It was a great experience to use data science techniques I had learnt to solve real world problems. The objective of the problem was to predict whether a passenger was satisfied or not (classification problem) considering his/her overall experience on the Shinkansen bullet train. 

We were given train and test data separately. There were several variables to consider (24 in all) right from customer demographic information to travel type and class, delay (arrival and departure), seat comfort, catering, onboard WIFI and entertainment, ease of online booking, legroom, check-in experience etc. to name a few. 

Took the following approach:

Treatment of missing / null values: Given the magnitude of missing values, dropping them altogether was not a viable option. Adopted the following principles:
Numerical values: Replaced missing values with the mean (average) of the dataset on both train and test set
Categorical values: Used a combination of replacing them with the most frequent values and using models to predict the value
This gave a cleaner, a more complete data set. 

Treating mismatch between train and test dataset: Few customers had an additional category to rate some variables e.g., Some customers had rated Cleanliness as extremely poor (new category in train data set) to Poor in test data set. The training data set had to be adjusted to map / categorize customer feedback appropriately. 

Exploratory Data Analysis (EDA): Used EDA to understand / visualize the data set to better understand variables and their impact on overall customer experience. Uncovered key insights and relationships between some variables and experience e.g., Onboard entertainment, seat comfort, cleanliness, online support had largely positive impact on customer experience, while travel class (economy), (in)convenient arrival time, catering had a negative impact on experience. 

Data preparation: Then performed encoding on categorical variables and scaled the dataset as preparation before applying different algorithms to test for accuracy.  

Apply Different Machine Learning Algorithms: Next step of the process was to apply different machine learning algorithms:
- Linear Discriminant Analysis (LDA)
- Quadratic Discriminant Analysis (QDA)
- Logistic Regression 
- K-Nearest Neighbor
- Decision Tree
- Random Forest
- XG Boost
- Naives Bayes

For each of these algorithms, followed an iterative approach - first train the dataset on the model, apply that to the test dataset, measure performance and then tune the hyper parameters to further improve the prediction accuracy. XGBoost and Random Forest models delivered the most accuracy (95.33%). 

I found the experience very enriching and fulfilling. 

Continue to learn and hone your skills, and keep reminding yourself to “just do it” every step of the way.