Last updated April 2017.

This post describes the solution we submitted to the Travel Meets Big Data Hackathon, hosted by Dublin Airport and HostelWorld by myself and my teammate James Lawlor, where our submission won the prize for Most Innovative Solution! The code for this project can be found here

The Challenge

We decided to tackle the challenge issued by Dublin Airport which involved predicting the profile of passengers arriving for a given flight in 15-minute intervals up to 3 hours before the departure time.

The challenge presented many interesting problems, namely because a given flight in the training data set could have up to five Prediction Dates - 1 day; 1 week; 1 month; 6 months; or 1 year before the flight.

This meant we had to be careful with data leaking and make sure we were not using data in our training models that realistically wouldn’t be available.

The features of the available dataset contained a mix of numerical, ordinal and categorical data but for the purpose of our model we decided to focus only on the numerical and ordinal data:

Feature Type Range
Flight Month num 0-12
Flight Week num 0-51
Flight Day num 0-7
Flight Hour num 0-23
Trip Duration ord 1-5
Bag Checkin ord 1-5
% Leisure ord 1-5
% Irish ord 1-5
% Female ord 1-5
% Party Size ord 1-5
% Arrive by Car ord 1-5
     
     

Our Model

We initially assumed that flights with extensive historical data should not be too difficult to predict since a future passenger profile can be predicted from a sensible averaging over the previous profiles.

The challenge lay in predicting the profiles for flights with little or no historical data. We needed a way to determine the flights that were similar enough to these which could then be used to formalate a prediction.

We thought the K-Nearest-Neighbours (KNN) method seemed a sensible approach in this instance using the numerical and ordinal data as the features for the algorithm.

We used the NearestNeighbours functionality in sklearn where we initially pre-processed the data so that each feature was normalised between 0-1 and therefore equally influential.

We were also aware of periodicity in the temporal data, that is 23:00 at night is very close to 01:00 but KNN would considers these times far apart in feature-space. We could not think of a way to implement periodic boundary condtions in KNN in time for the submission but we don’t think it will affect the results detrimentally.

The next task was to decide on how many nearest-neighbours was appropriate. We performed a convergence test on 500 random samples in a cross-validatoin set with up to 100 NN. We found anywhere between 20-50 NN to be reasonable so we chose 40.

Figure1

Once the 40 NN were identifed for each flight we calculated the weighted-average passenger number yj at each 15 minute interval j with the weights given by the exponential of the Euclidean distance in feature-space to each NN flight i.

Results

The overall error of the full model against the test set was 8.67 which is consistent with the cross-validation testing.

If we inspect a few flights we can get a sense of how the model is performing.

From below we can see that when the predicted flight profile coincides with historical data the model prediction is quite good. When the profile contains extremes however the fit suffers. We did not identify in any of the generated graphs a time when the model correctly predicted these outliers but the general gaussian trend was almost always reconstructed.

Flight: 5285 Model: 1 Day

Error = 3.30

The model predicted this flight very well as it lay very close to the historical data.

Figure1

Flight: 3214 Model: 1 Day

Error = 9.71

The model didn’t do as well on this flight although there was a large spike in arrivals at the start.

Figure1

Flight: 5127 Model: 1 Week

Error = 6.91

This 4 week model worked quite well also.

Figure1

Flight: 4157 Model: 4 Weeks

Error = 5.69

The flight data here also fell in line with the historical data and the model prediction was not bad.

Figure1

Possible Improvements

  1. I would very much like to have implemented boundary conditions in the NN model. We could estimate the likely impact of this by looking more closely at the time-series of various features and see if indeed they demonstrate periodicity.

  2. The full model took some time to train (~40 minutes). I would like to bring this down by parallelising the NN algorithm over OpenMP or similar.

  3. I would also have liked to experiment with different weighting-schemes; it is not obvious if exponential weighting was the best practice here.

  4. Another method worth investigating would be a clustering approach where similar flights are grouped according to their similar passenger profiles. This would reduce the training time for new data as the average group profile would be used as a predictive measure however the error would probably increase as it is no longer bespoke for each flight.