Predicting Airline Passenger Numbers with KNearest Neighbours
Last updated April 2017.
This post describes the solution we submitted to the Travel Meets Big Data Hackathon, hosted by Dublin Airport and HostelWorld by myself and my teammate James Lawlor, where our submission won the prize for Most Innovative Solution! The code for this project can be found here
@Lawlorino @GlennMoynihan presenting at #TravelMeetsBigData and winning 3rd prize! pic.twitter.com/vQNH5pLY57
— EntAnonSociety (@EntAnon) November 20, 2016
The Challenge
We decided to tackle the challenge issued by Dublin Airport which involved predicting the profile of passengers arriving for a given flight in 15minute intervals up to 3 hours before the departure time.
The challenge presented many interesting problems, namely because a given flight in the training data set could have up to five Prediction Dates  1 day; 1 week; 1 month; 6 months; or 1 year before the flight.
This meant we had to be careful with data leaking and make sure we were not using data in our training models that realistically wouldn’t be available.
The features of the available dataset contained a mix of numerical, ordinal and categorical data but for the purpose of our model we decided to focus only on the numerical and ordinal data:
Feature  Type  Range 

Flight Month  num  012 
Flight Week  num  051 
Flight Day  num  07 
Flight Hour  num  023 
Trip Duration  ord  15 
Bag Checkin  ord  15 
% Leisure  ord  15 
% Irish  ord  15 
% Female  ord  15 
% Party Size  ord  15 
% Arrive by Car  ord  15 
Our Model
We initially assumed that flights with extensive historical data should not be too difficult to predict since a future passenger profile can be predicted from a sensible averaging over the previous profiles.
The challenge lay in predicting the profiles for flights with little or no historical data. We needed a way to determine the flights that were similar enough to these which could then be used to formalate a prediction.
We thought the KNearestNeighbours (KNN) method seemed a sensible approach in this instance using the numerical and ordinal data as the features for the algorithm.
We used the NearestNeighbours functionality in sklearn where we initially preprocessed the data so that each feature was normalised between 01 and therefore equally influential.
We were also aware of periodicity in the temporal data, that is 23:00 at night is very close to 01:00 but KNN would considers these times far apart in featurespace. We could not think of a way to implement periodic boundary condtions in KNN in time for the submission but we don’t think it will affect the results detrimentally.
The next task was to decide on how many nearestneighbours was appropriate. We performed a convergence test on 500 random samples in a crossvalidatoin set with up to 100 NN. We found anywhere between 2050 NN to be reasonable so we chose 40.
Once the 40 NN were identifed for each flight we calculated the weightedaverage passenger number y_{j} at each 15 minute interval j with the weights given by the exponential of the Euclidean distance in featurespace to each NN flight i.
Results
The overall error of the full model against the test set was 8.67 which is consistent with the crossvalidation testing.
If we inspect a few flights we can get a sense of how the model is performing.
From below we can see that when the predicted flight profile coincides with historical data the model prediction is quite good. When the profile contains extremes however the fit suffers. We did not identify in any of the generated graphs a time when the model correctly predicted these outliers but the general gaussian trend was almost always reconstructed.
Flight: 5285 Model: 1 Day
Error = 3.30
The model predicted this flight very well as it lay very close to the historical data.
Flight: 3214 Model: 1 Day
Error = 9.71
The model didn’t do as well on this flight although there was a large spike in arrivals at the start.
Flight: 5127 Model: 1 Week
Error = 6.91
This 4 week model worked quite well also.
Flight: 4157 Model: 4 Weeks
Error = 5.69
The flight data here also fell in line with the historical data and the model prediction was not bad.
Possible Improvements

I would very much like to have implemented boundary conditions in the NN model. We could estimate the likely impact of this by looking more closely at the timeseries of various features and see if indeed they demonstrate periodicity.

The full model took some time to train (~40 minutes). I would like to bring this down by parallelising the NN algorithm over OpenMP or similar.

I would also have liked to experiment with different weightingschemes; it is not obvious if exponential weighting was the best practice here.

Another method worth investigating would be a clustering approach where similar flights are grouped according to their similar passenger profiles. This would reduce the training time for new data as the average group profile would be used as a predictive measure however the error would probably increase as it is no longer bespoke for each flight.