A Statistical Analysis of the CAO Points System
Last updated February 2017.
This post has since featured on an episode of the Atlantic 302 Podcast.
Introduction
Every August approximately 70,000 college applicants in Ireland receive their first round of offers for course places through the Central Applications Office (CAO). Each year leading up to the event students, teachers, parents and the media obsess over trying to predict the points for various courses.
While some courses remain relatively predictable most are not and can vary significantly form year to year, depending on demand, leaving students guessing as to how many points they need to achieve in the Leaving Cert (LC) in order to get their desired course.
While the CAO (and presumably the media) have at their disposal an extensive internal database, the purpose of this project was to investigate what sort of predictive power could be achieved by analysing the pubicly available data on the CAO website. It was also an excuse to familiarise myself with using Python pandas, SQL and some statistical techniques I have learned the past year.
I’m not only interested to see how, or if, I could predict the points for 2017 but I also want to see what other insights can be derived from the sparse dataset.
Here you can access my full Course Points Dataset file, the awarded Leaving Cert Points file, and the Python notebook I used to analyse the data.
You can also find my 2016 Predictions which shall be compared to the actual results using Linear Regression, AutoRegression and ARIMA models. My Predictions for 2017 can also be downloaded.
Finally, you can also access full resolution PDFs of all images on my github which you are free to use if you please cite this blog.
Acquiring and Cleaning the Data
While the points for each course are techinically publicly available from the CAO website, there is no central database from which to draw  you have to search for the PDF of all the points for a given year to get your info which is very cumbersome.
I wrote a short webscraper to acquire every single pointsrelated PDF and php file from the CAO website for every year after 2001 and, incidentally, found many PDFs in subdirectories that weren’t accessible via hyperlink.
Now that I had all the points data as well as the distributions of awarded LC points awarded each year I wrote another python script to scrape the relevant data from PDFs into a comprehensive CSV database (download here) comprising each course, its description, the college and the points for a given year.
Some Caveats:
The dataset however is not exhaustive for the following reasons.
 The data pertains to final round offers only.
 I’m only focusing on Level 8 courses from 20012015.
 There will be individually missing datapoints where:
 All qualified applicants got a place, in which case no points are listed.
 A course is no longer running.
 Some courses may also have changed name or code.
 Mature or graduate courses are excluded since they don’t apply the same way LC students do.
 Trinity’s TR001 course is also excluded since it’s an ensemble of many courses.
 The number of places on course may change over time and I cannot account for this. In the same vein, points will generally reflect ‘demand relative to number of places.
I believe we can still make some meaningful insights with the data that’s left, however, there remain some interesting challenges.
 How do I deal with the 25 bonus points for Honours maths introduced in 2012?
 What about courses that require an audition, interview or portfolio? Those points may exceed the maximum awarded points in the LC exams.
I will address these as they come up.
Exploratory Data Analysis
To get a macroscopic picture and get a sense of the data set and its limitations I first perform some exploratory data analysis. This will help me later on to decide which prediction model is best suited.
How does the number of LC students affect the number of courses?
Surpisingly, there is essentially no correlation between the number of LC students and the number of availble courses.
I think there may be two reasons for this. The first is that mature students and repeat applicants may be influencing the number of courses although this is difficult to prove with this dataset. The second, more cynical, reason is that colleges are under increasing demand to procure government funding which is allocated according to the proportion of students in a college. Thus by offering a more diverse range of courses, colleges can attract more students and get awarded more funds.
How does the number of LC students affect the average course points?
The average course points in a given year seem to strongly correlate with the number of LC students with an r value of 0.63. It would seem that increased student numbers is driving up the demand for college courses, especially since the financial crash of 2008.
How does the average awarded points affect the average course points?
However, if we inspect how the average awarded points of a given year relates to the course points we begin to see the bigger picture emerge.
For starters, the average LC points has risen almost 60 points over the last ten years! Moreover it has increased almost every year in that timeframe. For a standardised test this is a highly surprising result as one would expect the figure to vary randomly about some mean value if the test were to be considered at all consistent. I can conceive of a number of reasons for this effect:
 Students are getting more intelligent each year.
 Students are getting better at gaming the system.
 The Leaving Cert exams are getting easier.
 The Leaving Cert is being marked easier.
The true cause is probably some combination of the above, nonetheless it is a concerning statistic and should be further investigated.
From the above trends it is also clear that after 2008 the average course points are highly correlated with the average LC points with r=0.92.
This insight, in my view, supports a more concerning issue facing Irish third level education.
Before the crash the average awarded points increased every year yet had litte impact on the average course points since the number of courses increased and student numbers drastically dropped. In this instance, demand for college places was relatively low and the course points reflected that.
However, in 2008 there was a sudden increase in student numbers which from then on has lead to the average awarded points becoming the main driver behind the increase in course points, as seen in the last figure.
In effect, despite more available courses than ever before, demand is beginning to outstrip supply; the courses are now saturated with the shear volume of applicants and the average course points are now being dictated by the awarded points  suggesting the system is currently at capacity.
An optimist would suggest that the system is at least in balance for now, with the awarded points and course points being approximately equal. However, if the two were to ever diverge it would signal a dire state of affairs wherein a significant proportion of students who wish to attend university would not be offered a place, and only the higher achievers would.
What is the distribution of course points?
From 20012016 the distribution of points for all the courses seems reasonably normally distributed with a mean of 379 with a standard deviation of 103. The courses that require an interview, audition or portfolio sometimes require more than 600 points and can be seen in the tail of the histogram.
If we eliminate these courses we get a tighter fitting distribution with the mean declining slightly to 370 and standard deviation 87 with some slight leftskew towards higher points.
What is the distribution of awarded points?
If we crossexamine these distributions with the distribution of points awarded we see a marked distinction between the distributions.
While there was slight leftskew in the course points, there is a very strong rightskew in the awarded points with a mean of 340 and a standard deviation 157.
I found it highly surprising that almost 1/4 students achieved less than 200 points in their exams. This does nothing but highlight the intense competition for places at Level 8 as the course points are weighted upward while the awarded points are weighted heavily downward, with the average course points 30 points more then the average awarded points.
What has been the effect of Maths Bonus Points?
For the LC in 2012, 25 bonus points were automatically given to students who sat Honours level maths as an incentive to take the subject. Because of this some courses are now going to rise artificially, not because of increased demand but because of these bonus pointsand it’s impossible to determine exactly which ones will be affected.
What I can do, however, is see how the bonus points affected all the courses on average by performing an autoregression (AR) on the data. I should then be able to determine if, and to what degree, the bonus points mattered in the grand scheme of things.
If we examine the last 16 years in general we some general stability in the course points. The intercept of 15.94 and slope of 0.96 means that, generally speaking, lower points courses will see an increase while higher points courses remain relatively consistent. Here the black line indicates a slope = 1 and the fulcrum of the two lines lies at 454 points. This lends support to the average increase in course points over the years that we saw earlier.
As a matter of interest let’s investigate the pre and postcrash eras in isolation.
Between 20012008 we see the fulcrum lies toward the lower half at 341 points and demonstrates how higher courses in general fell in points during this time.
This constrast with 20082016 during which the fulcrum has shifted dramatically to almost 570 points. This highlights the marked increase in course points that occurred during these years, particularly at the lower end of the scale.
We can therefore tell the general trend of the courses over the years by the slope and the position of the fulcrum relative to the median of 400 points.
Interestingly, there seems to always be a slight pinch in the scatter around 280 points, below which the data flares out and makes it all look sort of rocketshaped. I can only surmise that this is because the popularity of the relatively small number of sub280 courses varies more wildly from year to year leading to such erratic points differences.
If we apply the same autoregression analysis on each yearly interval the following trend emerges:
================================
Interval  Slope  Intecept  Fulcrum 

20012002  0.958274  12.022932  288 
20022003  0.962025  12.925657  340 
20032004  1.006800  1.049425  154 
20042005  0.951657  15.142911  313 
20052006  0.981118  0.164459  9 
20062007  0.928320  30.025070  419 
20072008  0.921694  28.744911  367 
20082009  0.922315  35.030095  451 
20092010  0.962466  18.565979  495 
20102011  0.957871  23.662243  561 
20112012  1.062564  20.002919  320 
20122013  0.915665  42.925100  509 
20132014  0.986426  7.142493  526 
20142015  0.949977  32.086148  641 
20152016  0.984857  8.608068  568 
20012016  0.964934  15.935371  454 
20012008  0.956779  14.774989  342 
20082016  0.966250  19.244853  570 
================================
For almost all years the slope of the AR fit is less than 1, meaning that, generally speaking higher courses came down and lower courses came up, with the position of the fulcrum indicating which side responded more.
For instance, in the earlier years the fulcrum was less than 400 indicating that the higher points courses were generally in decline, whereas in the later years the fulcrum has shifted well above 400 indicating that higher course points are more stable while lower points are now rising.
For 20032004, while the slope is slightly greater than 1, the intercept is also only slightly less than 0 meaning this line is almost exactly through the origin and course points in general remained relatively stable.
However, for 20112012, while the slope is also only slightly greater than 1, the intercept is 20 and still relatively large which puts the fulcrum at 320 points. This means that higher points courses will in general see an increase. In fact you can clearly see this occurring in the graph where the majority of points over 475 lie above the black line by as much as 30 points  a clear demonstration of this effect.
Remarkably, we have been able to identify the effect of maths bonus points even though in the 20112012 period average points didn’t rise all that much (due to a large drop in student numbers). The trend thereafter continues the usual pattern found between 20082016 of increasing average points coming primarily from the lower end of the scale.
However, the effect does not seem to me to be sufficiently large enough across the whole range of points to justify any normalisation of the data to account for this artificial increase. Indeed, it is clear to me that in the period of 20082016 the increasing number of students is the primary driver of the points, and affects the entire range, rather than the bonus points which only affect a small section.
Findings Part 1

Course points are generally increasing in the past 8 years due to the increasing number of applicants.

This is further reflected in the course points vs candidate points which were uncorrelated before 2007 but have since become highly linked with student performance, which is steadily increasing.

The increasing candidate points is a concerning trend and cannot simply be attributed to maths bonus points.

Maths bonus points did have an effect on higher points courses however the demand from students is a much greater effect.

The number of places in each course is not taken into account and I would expect this to have an effect.

What this indicates is a highly competitive application procedure with universities now at or above capacity despite providing more courses than ever each year.
Course and College Rankings
College Rankings based on average entry points
Below the colleges are ranked according to the average entry points for all available courses from 20012016. The colour scheme indicates intervals of 100 points.
We see a broad spectrum of points from IADT at 548 to Portobello Collegel at 227. Among the higher end are many arts colleges that may require portfolios, auditions or interview.
If we omit courses with auditions, we see a largely similar trend.
Course Rankings
If we rank the courses based on average entry points since 2001 we see that, over all courses, those pertaining to the entertainment industry such as Music, Film, Television and Theatre have the highest points, presumably due the extra requirements such as an audition/portfolio/marking processes, notwithstanding the popularity of the career path.
All Courses
=======================================================
#  code  Name  Points  std 

1  DL832  Animation  918  105 
2  DL045  Film and Television Production  883  35 
3  DL834  Film and Television Production  880  53 
4  CR128  Popular Music Keyboards at CIT Cork School of…  873  159 
5  DL049  Design for Stage and Screen Makeup Design  854  71 
6  CR129  Popular Music Voice at CIT Cork School of Music  849  44 
7  DL048  Design for Stage and Screen Costume Design  826  95 
8  DL830  Design for Stage and Screen Makeup Design  787  147 
9  CR127  Popular Music Electric Guitar at CIT Cork School  776  104 
10  DL042  Photography  758  67 
11  LC102  Art and Design  750  51 
12  DL826  Visual Communication Design  748  142 
13  CR126  Popular Music Drums at CIT Cork School of Music  748  36 
14  CR121  Music at CIT Cork School of Music  741  78 
15  DL047  Design for Stage and Screen  722  98 
16  LC114  Design Fashion Knitwear and Textiles  719  71 
17  DL041  Animation  719  77 
18  DL833  Photography  718  74 
19  CR210  Contemporary Applied Art Ceramics Glass Textile  713  142 
20  CR700  Theatre and Drama Studies at CIT Cork School …  707  80 
=======================================================
Courses without Extra Requirements
If we omit courses with extra requirements the medical courses clearly dominate the rankings between Medicine, Dentistry, Pharmacy and Physiotherapy. Business, Finance and Law contribute almost as heavily.
===============================================
#  code  Name  Points  std 

1  RC003  Medicine with Leaving Cert Scholarship  589  7 
2  DN670  Quantitative Business  585  0 
3  TR051  Medicine  575  1 
5  CK701  Medicine  570  9 
6  TR017  Law and Business  566  12 
7  TR018  Law and French  566  10 
8  TR020  Law and Political Science  566  9 
9  DN002  Medicine  566  11 
10  TR052  Dental Science  564  18 
11  RC001  Medicine  564  11 
12  GY501  Medicine  564  10 
13  CK702  Dentistry  563  15 
14  LM100  Physiotherapy  561  10 
15  DC119  Global Business Canada  558  18 
16  DN230  Actuarial and Financial Studies  557  14 
17  DN616  Law with French Law BCL  556  18 
18  CK703  Pharmacy  556  9 
20  TR072  Pharmacy  551  8 
===============================================
Highest Points for 2016
=====================================================
#  code  Name  Points 

1  CR210  Contemporary Applied Art Ceramics Glass Textile  1000 
2  SG244  Fine Art  935 
3  DL830  Design for Stage and Screen Makeup Design  900 
4  CR129  Popular Music Voice at CIT Cork School of Music  900 
5  DL834  Film and Television Production  885 
6  DL826  Visual Communication Design  845 
7  CR127  Popular Music Electric Guitar at CIT Cork School  830 
8  CR128  Popular Music Keyboards at CIT Cork School of…  825 
9  DL829  Design for Stage and Screen Costume Design  815 
10  CR700  Theatre and Drama Studies at CIT Cork School …  800 
11  DL832  Animation  800 
12  DT545  Design Visual Communication  775 
13  CR126  Popular Music Drums at CIT Cork School of Music  770 
14  DT559  Photography  765 
15  DT506  Commercial Modern Music  750 
16  TR051  Medicine  730 
17  GY501  Medicine 6 year and embedded PhD options  723 
18  CR220  Fine Art at CIT Crawford College of Art and D…  710 
19  CW858  Sport Management and Coaching Options GAA Rugby  700 
20  CW038  Art Wexford  700 
=====================================================
Highest Points for 2016 without Extra Requirements
=====================================================
#  code  Name  Points 

1  TR076  Nanoscience Physics and Chemistry of Advanced…  595 
2  DC116  Global Business USA  590 
3  TR052  Dental Science  585 
4  CK702  Dentistry  585 
5  TR017  Law and Business  585 
6  DN670  Quantitative Business  585 
7  TR020  Law and Political Science  575 
8  TR018  Law and French  575 
9  TR073  Human Genetics  570 
10  CK703  Pharmacy  565 
11  TR031  Mathematics  565 
12  TR034  Management Science and Information Systems St…  565 
13  DN230  Actuarial and Financial Studies  560 
14  TR072  Pharmacy  560 
15  CK407  Mathematical Sciences  560 
16  TR015  Philosophy Political Science Economics and Socio  555 
17  LM100  Physiotherapy  555 
18  TR035  Theoretical Physics  555 
19  DN615  BCL Matrise  555 
20  DN440  Biomedical Health and Life Sciences  550 
=====================================================
Popularity of Particular Industries
The performance of certain industries also captures the media’s attention as some, particularly construction, are taken as indicators of the confidence in the economy, while others, such as STEM courses, are strongly promoted.
I performed a simple keyword search of all the courses and plotted the timeseries of the average points of the courses with that keyword in the course description.
The search isn’t 100% accurate however since searching a generic term like ‘Science’ will exclude courses like ‘Astrophysics’ and ‘Biology’, while conversely, I may be including extra courses that don’t technically belong, like ‘Social Science’. In general though, we’re just interested in a trend so it should work well enough.
Construction
Construction, Architecture
Did extremely well during the boom, collapsed with the economy in 2008 when the property bubble burst. Has seen some resurgence lately as the economy improves.
Law
Law
Not the trend I was expecting, especially considering the rankings above. It could be that newer, less popular courses involving Law are bringing down the points.
Medical
Medicine, Dentistry, Nursing, Physiotherapy, Pharmacy.
The medical field is always consistently high. The huge jump is the introduction of the HPAT in 2009.
Engineering
Engineering
Contrary to construction, Engineering went down during the boom but has since risen due to efforts in promoting STEM courses.
Science
Science, Physics, Biology, Chemistry, Geology, Mathematic
Science related courses have exploded in popularity since the crash, the promotion of STEM has been extremely successful here and students obviously see the benefit to a degree in science.
Technology
Comput, Information, Digital, Programming
Again, like Engineering, Computer and Tech related courses have seen a resurgance since the recession as Ireland becomes an emerging techhub of Europe.
Irish
Irish, Gaelic, Celtic
Irish related studies fluctuate a little bit but seems to remain reasonably stable over time.
European Languages
French, German, Spanish, Italian
The big 4 european languages are more popular than our own and also seem relatively stable within a 30point margin.
NonEU Languages
Chinese, Russian, Japanese, Arabic, Indian
NonEU languages however don’t share the same popularity
Performing Arts
Music, Theatre, Drama, Film
The Performing Arts have fared very well through the recession however the large jump in points would seem to suggest that an additional component to the exam, such as an audition, portfolio etc., came into effect. Yet the sector appears quite stable.
Humanities
Arts, English, Philosophy, Sociology, Politics
This mixed bag of Arts degrees suffers a lot of volatility over the years but seems to have settled down.
Corporate
Business, Economics, Accounting, Finance
Business and financial jobs have behaved similar to Science, Technology and Engineering becoming increasingly popular since the crash due to the lucrative career prospects on offer.
A clear trend emerged in the preceeding graphs; the onset of the recession caused a dramatic uptake in courses linked with career prospects conventionally considered “safe”  namely Science, Technology, Engineering a Business. While softer subjects such as the Arts and Languages have remained relatively stable throughout the last decade suggesting their demand has not increased as significantly as the others.
Predicting 2017 Points
In predicting the points for 2017 I have 3 strategies in mind:

Linear Regression: Probably the most naive model as there should not necessarily be a direct causal relationship between the year and the points at all. Still, it may prove to be a useful starting point.

AutoRegressiion AR(1): a bit more robust model for timeseries that makes a bit more intuitive sense. It dictates that this years points will depend on what last years points were, under the assumption of constant mean and variance.

ARIMA(1,1,0): Slightly more sophisticated than AR(1), this will take into account any trend where the points are going up (or down) on average, as we have seen can be the case.
I used the data from 20012016 as a training set in order to determine the best model to predict the points for 2017.
The predictions for the training set can be found in this data file which includies the 95% prediction errors, where I have only considered courses that have more than 5 data points since 2009.
I decided to only use data from 2009 on as this pertains the most recent era where student numbers are driving the points. As such the prediction errors are quite large given the scarce amount of data I’m using.
The mean square error (MSE) of each model is sumarised below where suprisingly all models are roughly equally valid. My reasoning for this is that we have already seen how some courses have seen a general increase in points since 2008, which would be well explained by either the Linear Regression or ARIMA(1,1,0) models, while others have remained relatively stable and thus would be more suited to the AR(1) model.
The best model will ultimately depend on the nature of the course itself.
==================
Model  MSE 

Linear Regression  39.5 
AR(1)  36.3 
ARIMA(1,1,0)  45.3 
==================
Using each model, I then predicted the points for 2017, including 95% prediction intervals, which can be found here.
To illustrate how each of these models work I will consider the model predictions for my own course Theoretical Physics in Trinity (TR035).
Case Study : Theoretical Physics TR035
Linear Regression
Prediction 2017 : 566 +/ 43 points
While the linear regression model for Points vs Year fits the data quite well, the equation of the line was always going to be bizarre because it assumes a linear increase in the points forever.
The prediction however is ok with a spread of almost 90 points  fairly wide but not untypical given the model MSE above.
AR(1)
Prediction 2017 : 531 +/ 67 points
The AR(1) model does not provide a as nice a fit however as there is clear evidence of a trend in the data for which AR(1) cannot appropriately account.
ARIMA(1,1,0)
Prediction 2017 : 561 +/ 47 points
The ARIMA(1,1,0) model yields almost exactly the same prediction as the Linear Regression. This is not unsurprising as the two are designed to consider trends in the data.
In this instance the Linear Regression and ARIMA(1,1,0) models performed best as the points for Theoretical Physics have generally increased over the years. The simple AR(1) model is not equipped to deal with this moving average and so yielded a worse prediction.
Conclusion
Although the data was by no means exhaustive I was able to make some interesting insights from publically available data:

Course points in general are going up, due to increasing student population and despite a continuous rise in the number of courses. This is putting increasing stress on universities which are now operating at or above capacity.

The distributions of points awarded vs course points highlights the huge competition for places every year.

Maths bonus points have played a small but significant for particular courses but in general student demand is the primary driver.

The course rankings reflect what is common knowledge about the popularity for Medical careers and related courses. What was surprising was the decline in Law; I would have considered it one of the “safe bets”.

The relatively small number of datapoints somewhat limits the predictive power of my models but I can still make some decent ballpark estimations with reasonable errors that are very likely to include the actual results. Even though I believe the ARIMA(1,1,0) model is more statistically justified over the simple Linear Regression model, it is hampered by losing two precious datapoints that increase the overall error.

Ultimately, I found the best indicators for predicting the points for a course are, in decreasing importance:
 Last year’s points.
 The general trend in that course’s popularity in the last 8 years.
 The number of LC students applying that year.
Further Questions
I have a few remaining questions that I didn’t get time to code up, or the data wasn’t extensive enough to answer for me. In no particular order, they are
 What is the relative performance of newer courses compared to more established ones?
 Which courses have suffered a decline in popularity?
 What was the biggest jump/fall in points in any given year for a particular course/college?
 How do CAO applicants who are not school leavers affecting the points?
 What is the gender breakdown of each course/college?
If there are any comments, corrections or suggests about this project, please feel free to email me.