Flight Status Predictor

7 min readJun 25, 2020

Unit 2 Build Project

Wouldn’t it be great to predict your FLIGHT STATUS

before Booking your next Vacation?

Summary

I did some research and found a flight data for over 2,000,000 domestic US.flights for 2019 and 2020. After completing Exploratory Data Analysis(EDA) and predictive modeling, I have found that the main problem that you run into as data leakage. Another issue was the size of the class that I am trying to predict. The class is present 20/80 in the dataset; therefore we have an imbalanced dataset.

From our model we found, couple factors that will help us to predict the FLIGHT STATUS before a booking your next flight. What this means is that with our model we could predict if we will be ON-TIME or DELAYED with 64.5% accuracy. With this knowledge we can plan our next vacation with less headache and no longer fear being stuck in the airport.

About this Data

In the first two months of 2020, a whopping 18.9% of all domestic flights in the United States were delayed according to the Bureau of Transportation Statistics (BTS). The collected data is publicly available online, (https://www.kaggle.com/divyansh22/flight-delay-prediction). The features were manually chosen to do a primary time series analysis. This data could well be used to predict the flight delay at the destination airport specifically for the month of January & February 2019 and 2020 in upcoming years as the data is for January & February only.

Splitting Data into Time Series Split

First, we will split the data into train/val/test data sets. The Training set will include 1,000,000 records of flight information from January and February 2019. We will use the training data to train and fit our model. The Validation set includes 600,000 records of flight information from January 2020. We will use the validation data to test our different models to find a suitable model. Finally, the Test set of 500,000 records will be saved for testing the Model after using the training data to fit the model and the Validation data to back up our model.

Exploratory Data Analysis

Load the data and have a look before diving deeper

Data Leakage

To begin with, there was some serious data leakage. The data leakage was cause by data from the future; data after the instance we are trying to predict. These features were deleted during data cleaning and feature engineering.

For this case, I will use these Data Leakage/Future Casting features to demonstrate data leakage and its affect on metrics . When constructing a Comparison Set of “Data Leakage” metrics of Accuracy, ROC AUC Score, Confusion Matrix, precision and recall, I used all the columns that would be considered data leakage that my Wrangle Function would drop from the train/val/test set.

After using a Random Forest Classifier as our exploratory model, we have the following resulting Metrics for our Data Leakage Experiment.

This would predict the “best” model of Data Leakage considering it contains data from the future of the incident of the event to be predicted. So, is the model “cheating”? Yes, it is. This model produced an optimum confusion Matrix which is unrealistic for real world modeling with a Validation Accuracy of 94.3%, ROC AUC SCORE of 0.937 and Precision : 0.83, Recall : 0.75

Wow, but if it’s too good to be true, it probably is.

Before we continue, we will clean the data and engineer the features.

Imbalanced Dataset

We also have a dataset that acts like an imbalanced dataset with a 80/20 split in favor of ON-TIME. The data imbalance was adjusted with a technique called Under Sampling. I under sampled the majority class and the resulting split was 50/50. The reasoning for Under sampling the Majority Class was in order to improve the precision and recall of our model as we shall see in the Modeling and Metrics.

I wrote a function to Under Sample the Majority Class by match the size of the Minority Class and setting a Random Seed for reproducibility.

Modeling and Metrics

With under sampling the Majority Class, the baseline is 50%. Next, I need to choose which metrics to evaluate the models.

Our Metrics will be

Accuracy Score— It is the ratio of number of correct predictions to the total number of input samples
ROC AUC Score — results were considered excellent for AUC values between 0.9–1, good for AUC values between 0.8–0.9, fair for AUC values between 0.7–0.8, poor for AUC values between 0.6–0.7 and failed for AUC values between 0.5–0.6.
Precision (also called positive predictive value) is the fraction of relevant instances among the retrieved instances
Recall (also known as sensitivity) is the fraction of the total amount of relevant instances

For the Flight Status Predictor, I built five models in total. The First Model was an exploration in to the effect of data leakage on our model.

The next two models were a Logistic Regression and Random Forest Classifier. The purpose of these models was to see which model would beat the baseline of 80.6% after data cleaning and feature engineering. And they both beat the baseline. The Logistic Regression produced 86.4% accuracy score with a precision of 0.00 and recall of 0.00. While the Random Forest Classifier produced 85.7% accuracy with precision of 0.22 and recall of 0.00. The accuracy score shows improvement but with classification models precision and recall as important metrics.

In order to find a solution of the low recall, the next two models were built after Under Sampling was perform on the training dataset. After Under Sampling, the new baseline is 50% . The Logistic Regression produced 62.1% accuracy score with precision of 0.00 and recall of 0.00. While the Random Forest Classifier produced 64.5% accuracy score with precision 0.17 and recall of 0.62. The Random Forest Classifier demonstrated a 0.60 improvement in recall with 0.05 drop in precision.

The Random Forest Classifier with Under Sampling of the Majority Class has produced 64.5% accuracy score with improvement in recall from 0.02 to 0.62 but lost when from 0.22 to 0.17 in precision.

Let’s describe our model with a look into feature importance, permutation importance and PDP plots.

More details about the Flight Status Predictor Model

Feature Importance and Permutation Importance show that the Time of Day or DEP_TIME_BLK is the one of top factors in predicting FLIGHT STATUS.

Is your model useful?

Yes and No.

Yes, this model is useful. This model has brought up important issues such as time series split, data leakage, imbalanced dataset and choosing a model. The workflow has shown me that even with testing different metrics and different models it is beneficial to test and document our workflow so that we can find better ways to handle and deal with different types of data sets.

No, the Model is not useful because it is not ready for the real world. With an ROC AUC Score of 0.579, the results are pretty close to a baseline guess. In general, an ROC AUC Score of 0.5 suggests no discrimination, 0.7 to 0.8 is considered acceptable, 0.8 to 0.9 is considered excellent.

Further Study

For further consideration, data in the following areas might be found to be beneficial such as environmental issues, maintenance records and operational data in order to produce a better predictive model.