Last updated February 2017.

This post has since featured on an episode of the Atlantic 302 Podcast.

Introduction

Every August approximately 70,000 college applicants in Ireland receive their first round of offers for course places through the Central Applications Office (CAO). Each year leading up to the event students, teachers, parents and the media obsess over trying to predict the points for various courses.

While some courses remain relatively predictable most are not and can vary significantly form year to year, depending on demand, leaving students guessing as to how many points they need to achieve in the Leaving Cert (LC) in order to get their desired course.

While the CAO (and presumably the media) have at their disposal an extensive internal database, the purpose of this project was to investigate what sort of predictive power could be achieved by analysing the pubicly available data on the CAO website. It was also an excuse to familiarise myself with using Python pandas, SQL and some statistical techniques I have learned the past year.

I’m not only interested to see how, or if, I could predict the points for 2017 but I also want to see what other insights can be derived from the sparse dataset.

Here you can access my full Course Points Dataset file, the awarded Leaving Cert Points file, and the Python notebook I used to analyse the data.

You can also find my 2016 Predictions which shall be compared to the actual results using Linear Regression, Auto-Regression and ARIMA models. My Predictions for 2017 can also be downloaded.

Finally, you can also access full resolution PDFs of all images on my github which you are free to use if you please cite this blog.

Acquiring and Cleaning the Data

While the points for each course are techinically publicly available from the CAO website, there is no central database from which to draw - you have to search for the PDF of all the points for a given year to get your info which is very cumbersome.

I wrote a short web-scraper to acquire every single points-related PDF and php file from the CAO website for every year after 2001 and, incidentally, found many PDFs in subdirectories that weren’t accessible via hyperlink.

Now that I had all the points data as well as the distributions of awarded LC points awarded each year I wrote another python script to scrape the relevant data from PDFs into a comprehensive CSV database (download here) comprising each course, its description, the college and the points for a given year.

Some Caveats:

The dataset however is not exhaustive for the following reasons.

  1. The data pertains to final round offers only.
  2. I’m only focusing on Level 8 courses from 2001-2015.
  3. There will be individually missing datapoints where:
    • All qualified applicants got a place, in which case no points are listed.
    • A course is no longer running.
  4. Some courses may also have changed name or code.
  5. Mature or graduate courses are excluded since they don’t apply the same way LC students do.
  6. Trinity’s TR001 course is also excluded since it’s an ensemble of many courses.
  7. The number of places on course may change over time and I cannot account for this. In the same vein, points will generally reflect ‘demand relative to number of places.

I believe we can still make some meaningful insights with the data that’s left, however, there remain some interesting challenges.

  1. How do I deal with the 25 bonus points for Honours maths introduced in 2012?
  2. What about courses that require an audition, interview or portfolio? Those points may exceed the maximum awarded points in the LC exams.

I will address these as they come up.

Exploratory Data Analysis

To get a macroscopic picture and get a sense of the data set and its limitations I first perform some exploratory data analysis. This will help me later on to decide which prediction model is best suited.

How does the number of LC students affect the number of courses?

Figure1

Surpisingly, there is essentially no correlation between the number of LC students and the number of availble courses.

I think there may be two reasons for this. The first is that mature students and repeat applicants may be influencing the number of courses although this is difficult to prove with this dataset. The second, more cynical, reason is that colleges are under increasing demand to procure government funding which is allocated according to the proportion of students in a college. Thus by offering a more diverse range of courses, colleges can attract more students and get awarded more funds.

How does the number of LC students affect the average course points?

Figure2

The average course points in a given year seem to strongly correlate with the number of LC students with an r value of 0.63. It would seem that increased student numbers is driving up the demand for college courses, especially since the financial crash of 2008.

How does the average awarded points affect the average course points?

Figure3

However, if we inspect how the average awarded points of a given year relates to the course points we begin to see the bigger picture emerge.

For starters, the average LC points has risen almost 60 points over the last ten years! Moreover it has increased almost every year in that timeframe. For a standardised test this is a highly surprising result as one would expect the figure to vary randomly about some mean value if the test were to be considered at all consistent. I can conceive of a number of reasons for this effect:

  1. Students are getting more intelligent each year.
  2. Students are getting better at gaming the system.
  3. The Leaving Cert exams are getting easier.
  4. The Leaving Cert is being marked easier.

The true cause is probably some combination of the above, nonetheless it is a concerning statistic and should be further investigated.

From the above trends it is also clear that after 2008 the average course points are highly correlated with the average LC points with r=0.92.

This insight, in my view, supports a more concerning issue facing Irish third level education.

Before the crash the average awarded points increased every year yet had litte impact on the average course points since the number of courses increased and student numbers drastically dropped. In this instance, demand for college places was relatively low and the course points reflected that.

However, in 2008 there was a sudden increase in student numbers which from then on has lead to the average awarded points becoming the main driver behind the increase in course points, as seen in the last figure.

In effect, despite more available courses than ever before, demand is beginning to outstrip supply; the courses are now saturated with the shear volume of applicants and the average course points are now being dictated by the awarded points - suggesting the system is currently at capacity.

An optimist would suggest that the system is at least in balance for now, with the awarded points and course points being approximately equal. However, if the two were to ever diverge it would signal a dire state of affairs wherein a significant proportion of students who wish to attend university would not be offered a place, and only the higher achievers would.

What is the distribution of course points?

Figure4

From 2001-2016 the distribution of points for all the courses seems reasonably normally distributed with a mean of 379 with a standard deviation of 103. The courses that require an interview, audition or portfolio sometimes require more than 600 points and can be seen in the tail of the histogram.

Figure5

If we eliminate these courses we get a tighter fitting distribution with the mean declining slightly to 370 and standard deviation 87 with some slight left-skew towards higher points.

What is the distribution of awarded points?

If we cross-examine these distributions with the distribution of points awarded we see a marked distinction between the distributions.

Figure6

While there was slight left-skew in the course points, there is a very strong right-skew in the awarded points with a mean of 340 and a standard deviation 157.

I found it highly surprising that almost 1/4 students achieved less than 200 points in their exams. This does nothing but highlight the intense competition for places at Level 8 as the course points are weighted upward while the awarded points are weighted heavily downward, with the average course points 30 points more then the average awarded points.

What has been the effect of Maths Bonus Points?

For the LC in 2012, 25 bonus points were automatically given to students who sat Honours level maths as an incentive to take the subject. Because of this some courses are now going to rise artificially, not because of increased demand but because of these bonus pointsand it’s impossible to determine exactly which ones will be affected.

What I can do, however, is see how the bonus points affected all the courses on average by performing an auto-regression (AR) on the data. I should then be able to determine if, and to what degree, the bonus points mattered in the grand scheme of things.

If we examine the last 16 years in general we some general stability in the course points. The intercept of 15.94 and slope of 0.96 means that, generally speaking, lower points courses will see an increase while higher points courses remain relatively consistent. Here the black line indicates a slope = 1 and the fulcrum of the two lines lies at 454 points. This lends support to the average increase in course points over the years that we saw earlier.

Figure7

As a matter of interest let’s investigate the pre- and post-crash eras in isolation.

Between 2001-2008 we see the fulcrum lies toward the lower half at 341 points and demonstrates how higher courses in general fell in points during this time.

Figure8

This constrast with 2008-2016 during which the fulcrum has shifted dramatically to almost 570 points. This highlights the marked increase in course points that occurred during these years, particularly at the lower end of the scale.

Figure9

We can therefore tell the general trend of the courses over the years by the slope and the position of the fulcrum relative to the median of 400 points.

Interestingly, there seems to always be a slight pinch in the scatter around 280 points, below which the data flares out and makes it all look sort of rocket-shaped. I can only surmise that this is because the popularity of the relatively small number of sub-280 courses varies more wildly from year to year leading to such erratic points differences.

If we apply the same auto-regression analysis on each yearly interval the following trend emerges:

================================

Interval Slope Intecept Fulcrum
2001-2002 0.958274 12.022932 288
2002-2003 0.962025 12.925657 340
2003-2004 1.006800 -1.049425 154
2004-2005 0.951657 15.142911 313
2005-2006 0.981118 0.164459 9
2006-2007 0.928320 30.025070 419
2007-2008 0.921694 28.744911 367
2008-2009 0.922315 35.030095 451
2009-2010 0.962466 18.565979 495
2010-2011 0.957871 23.662243 561
2011-2012 1.062564 -20.002919 320
2012-2013 0.915665 42.925100 509
2013-2014 0.986426 7.142493 526
2014-2015 0.949977 32.086148 641
2015-2016 0.984857 8.608068 568
2001-2016 0.964934 15.935371 454
2001-2008 0.956779 14.774989 342
2008-2016 0.966250 19.244853 570

================================

For almost all years the slope of the AR fit is less than 1, meaning that, generally speaking higher courses came down and lower courses came up, with the position of the fulcrum indicating which side responded more.

For instance, in the earlier years the fulcrum was less than 400 indicating that the higher points courses were generally in decline, whereas in the later years the fulcrum has shifted well above 400 indicating that higher course points are more stable while lower points are now rising.

For 2003-2004, while the slope is slightly greater than 1, the intercept is also only slightly less than 0 meaning this line is almost exactly through the origin and course points in general remained relatively stable.

Figure10

However, for 2011-2012, while the slope is also only slightly greater than 1, the intercept is -20 and still relatively large which puts the fulcrum at 320 points. This means that higher points courses will in general see an increase. In fact you can clearly see this occurring in the graph where the majority of points over 475 lie above the black line by as much as 30 points - a clear demonstration of this effect.

Figure11

Remarkably, we have been able to identify the effect of maths bonus points even though in the 2011-2012 period average points didn’t rise all that much (due to a large drop in student numbers). The trend thereafter continues the usual pattern found between 2008-2016 of increasing average points coming primarily from the lower end of the scale.

However, the effect does not seem to me to be sufficiently large enough across the whole range of points to justify any normalisation of the data to account for this artificial increase. Indeed, it is clear to me that in the period of 2008-2016 the increasing number of students is the primary driver of the points, and affects the entire range, rather than the bonus points which only affect a small section.

Findings Part 1

  • Course points are generally increasing in the past 8 years due to the increasing number of applicants.

  • This is further reflected in the course points vs candidate points which were uncorrelated before 2007 but have since become highly linked with student performance, which is steadily increasing.

  • The increasing candidate points is a concerning trend and cannot simply be attributed to maths bonus points.

  • Maths bonus points did have an effect on higher points courses however the demand from students is a much greater effect.

  • The number of places in each course is not taken into account and I would expect this to have an effect.

  • What this indicates is a highly competitive application procedure with universities now at or above capacity despite providing more courses than ever each year.

Course and College Rankings

College Rankings based on average entry points

Below the colleges are ranked according to the average entry points for all available courses from 2001-2016. The colour scheme indicates intervals of 100 points.

Figure12

We see a broad spectrum of points from IADT at 548 to Portobello Collegel at 227. Among the higher end are many arts colleges that may require portfolios, auditions or interview.

If we omit courses with auditions, we see a largely similar trend.

Figure13

Course Rankings

If we rank the courses based on average entry points since 2001 we see that, over all courses, those pertaining to the entertainment industry such as Music, Film, Television and Theatre have the highest points, presumably due the extra requirements such as an audition/portfolio/marking processes, notwithstanding the popularity of the career path.

All Courses

=======================================================

# code Name Points std
1 DL832 Animation 918 105
2 DL045 Film and Television Production 883 35
3 DL834 Film and Television Production 880 53
4 CR128 Popular Music Keyboards at CIT Cork School of… 873 159
5 DL049 Design for Stage and Screen Makeup Design 854 71
6 CR129 Popular Music Voice at CIT Cork School of Music 849 44
7 DL048 Design for Stage and Screen Costume Design 826 95
8 DL830 Design for Stage and Screen Makeup Design 787 147
9 CR127 Popular Music Electric Guitar at CIT Cork School 776 104
10 DL042 Photography 758 67
11 LC102 Art and Design 750 51
12 DL826 Visual Communication Design 748 142
13 CR126 Popular Music Drums at CIT Cork School of Music 748 36
14 CR121 Music at CIT Cork School of Music 741 78
15 DL047 Design for Stage and Screen 722 98
16 LC114 Design Fashion Knitwear and Textiles 719 71
17 DL041 Animation 719 77
18 DL833 Photography 718 74
19 CR210 Contemporary Applied Art Ceramics Glass Textile 713 142
20 CR700 Theatre and Drama Studies at CIT Cork School … 707 80

=======================================================

Courses without Extra Requirements

If we omit courses with extra requirements the medical courses clearly dominate the rankings between Medicine, Dentistry, Pharmacy and Physiotherapy. Business, Finance and Law contribute almost as heavily.

===============================================

# code Name Points std
1 RC003 Medicine with Leaving Cert Scholarship 589 7
2 DN670 Quantitative Business 585 0
3 TR051 Medicine 575 1
5 CK701 Medicine 570 9
6 TR017 Law and Business 566 12
7 TR018 Law and French 566 10
8 TR020 Law and Political Science 566 9
9 DN002 Medicine 566 11
10 TR052 Dental Science 564 18
11 RC001 Medicine 564 11
12 GY501 Medicine 564 10
13 CK702 Dentistry 563 15
14 LM100 Physiotherapy 561 10
15 DC119 Global Business Canada 558 18
16 DN230 Actuarial and Financial Studies 557 14
17 DN616 Law with French Law BCL 556 18
18 CK703 Pharmacy 556 9
20 TR072 Pharmacy 551 8

===============================================

Highest Points for 2016

=====================================================

# code Name Points
1 CR210 Contemporary Applied Art Ceramics Glass Textile 1000
2 SG244 Fine Art 935
3 DL830 Design for Stage and Screen Makeup Design 900
4 CR129 Popular Music Voice at CIT Cork School of Music 900
5 DL834 Film and Television Production 885
6 DL826 Visual Communication Design 845
7 CR127 Popular Music Electric Guitar at CIT Cork School 830
8 CR128 Popular Music Keyboards at CIT Cork School of… 825
9 DL829 Design for Stage and Screen Costume Design 815
10 CR700 Theatre and Drama Studies at CIT Cork School … 800
11 DL832 Animation 800
12 DT545 Design Visual Communication 775
13 CR126 Popular Music Drums at CIT Cork School of Music 770
14 DT559 Photography 765
15 DT506 Commercial Modern Music 750
16 TR051 Medicine 730
17 GY501 Medicine 6 year and embedded PhD options 723
18 CR220 Fine Art at CIT Crawford College of Art and D… 710
19 CW858 Sport Management and Coaching Options GAA Rugby 700
20 CW038 Art Wexford 700

=====================================================

Highest Points for 2016 without Extra Requirements

=====================================================

# code Name Points
1 TR076 Nanoscience Physics and Chemistry of Advanced… 595
2 DC116 Global Business USA 590
3 TR052 Dental Science 585
4 CK702 Dentistry 585
5 TR017 Law and Business 585
6 DN670 Quantitative Business 585
7 TR020 Law and Political Science 575
8 TR018 Law and French 575
9 TR073 Human Genetics 570
10 CK703 Pharmacy 565
11 TR031 Mathematics 565
12 TR034 Management Science and Information Systems St… 565
13 DN230 Actuarial and Financial Studies 560
14 TR072 Pharmacy 560
15 CK407 Mathematical Sciences 560
16 TR015 Philosophy Political Science Economics and Socio 555
17 LM100 Physiotherapy 555
18 TR035 Theoretical Physics 555
19 DN615 BCL Matrise 555
20 DN440 Biomedical Health and Life Sciences 550

=====================================================

Popularity of Particular Industries

The performance of certain industries also captures the media’s attention as some, particularly construction, are taken as indicators of the confidence in the economy, while others, such as STEM courses, are strongly promoted.

I performed a simple key-word search of all the courses and plotted the time-series of the average points of the courses with that key-word in the course description.

The search isn’t 100% accurate however since searching a generic term like ‘Science’ will exclude courses like ‘Astrophysics’ and ‘Biology’, while conversely, I may be including extra courses that don’t technically belong, like ‘Social Science’. In general though, we’re just interested in a trend so it should work well enough.

Construction

Construction, Architecture

Did extremely well during the boom, collapsed with the economy in 2008 when the property bubble burst. Has seen some resurgence lately as the economy improves.

Figure14

Law

Law

Not the trend I was expecting, especially considering the rankings above. It could be that newer, less popular courses involving Law are bringing down the points.

Figure15

Medical

Medicine, Dentistry, Nursing, Physiotherapy, Pharmacy.

The medical field is always consistently high. The huge jump is the introduction of the HPAT in 2009.

Figure16

Engineering

Engineering

Contrary to construction, Engineering went down during the boom but has since risen due to efforts in promoting STEM courses.

Figure17

Science

Science, Physics, Biology, Chemistry, Geology, Mathematic

Science related courses have exploded in popularity since the crash, the promotion of STEM has been extremely successful here and students obviously see the benefit to a degree in science.

Figure18

Technology

Comput, Information, Digital, Programming

Again, like Engineering, Computer and Tech related courses have seen a resurgance since the recession as Ireland becomes an emerging tech-hub of Europe.

Figure19

Irish

Irish, Gaelic, Celtic

Irish related studies fluctuate a little bit but seems to remain reasonably stable over time.

Figure20

European Languages

French, German, Spanish, Italian

The big 4 european languages are more popular than our own and also seem relatively stable within a 30-point margin.

Figure21

Non-EU Languages

Chinese, Russian, Japanese, Arabic, Indian

Non-EU languages however don’t share the same popularity

Figure22

Performing Arts

Music, Theatre, Drama, Film

The Performing Arts have fared very well through the recession however the large jump in points would seem to suggest that an additional component to the exam, such as an audition, portfolio etc., came into effect. Yet the sector appears quite stable.

Figure23

Humanities

Arts, English, Philosophy, Sociology, Politics

This mixed bag of Arts degrees suffers a lot of volatility over the years but seems to have settled down.

Figure24

Corporate

Business, Economics, Accounting, Finance

Business and financial jobs have behaved similar to Science, Technology and Engineering becoming increasingly popular since the crash due to the lucrative career prospects on offer.

Figure25

A clear trend emerged in the preceeding graphs; the onset of the recession caused a dramatic uptake in courses linked with career prospects conventionally considered “safe” - namely Science, Technology, Engineering a Business. While softer subjects such as the Arts and Languages have remained relatively stable throughout the last decade suggesting their demand has not increased as significantly as the others.

Predicting 2017 Points

In predicting the points for 2017 I have 3 strategies in mind:

  1. Linear Regression: Probably the most naive model as there should not necessarily be a direct causal relationship between the year and the points at all. Still, it may prove to be a useful starting point.

  2. Auto-Regressiion AR(1): a bit more robust model for time-series that makes a bit more intuitive sense. It dictates that this years points will depend on what last years points were, under the assumption of constant mean and variance.

  3. ARIMA(1,1,0): Slightly more sophisticated than AR(1), this will take into account any trend where the points are going up (or down) on average, as we have seen can be the case.

I used the data from 2001-2016 as a training set in order to determine the best model to predict the points for 2017.

The predictions for the training set can be found in this data file which includies the 95% prediction errors, where I have only considered courses that have more than 5 data points since 2009.

I decided to only use data from 2009 on as this pertains the most recent era where student numbers are driving the points. As such the prediction errors are quite large given the scarce amount of data I’m using.

The mean square error (MSE) of each model is sumarised below where suprisingly all models are roughly equally valid. My reasoning for this is that we have already seen how some courses have seen a general increase in points since 2008, which would be well explained by either the Linear Regression or ARIMA(1,1,0) models, while others have remained relatively stable and thus would be more suited to the AR(1) model.

The best model will ultimately depend on the nature of the course itself.

==================

Model MSE
Linear Regression 39.5
AR(1) 36.3
ARIMA(1,1,0) 45.3

==================

Using each model, I then predicted the points for 2017, including 95% prediction intervals, which can be found here.

To illustrate how each of these models work I will consider the model predictions for my own course Theoretical Physics in Trinity (TR035).

Case Study : Theoretical Physics TR035

Linear Regression

Prediction 2017 : 566 +/- 43 points

While the linear regression model for Points vs Year fits the data quite well, the equation of the line was always going to be bizarre because it assumes a linear increase in the points forever.

The prediction however is ok with a spread of almost 90 points - fairly wide but not untypical given the model MSE above.

Figure26

AR(1)

Prediction 2017 : 531 +/- 67 points

The AR(1) model does not provide a as nice a fit however as there is clear evidence of a trend in the data for which AR(1) cannot appropriately account.

Figure27

ARIMA(1,1,0)

Prediction 2017 : 561 +/- 47 points

The ARIMA(1,1,0) model yields almost exactly the same prediction as the Linear Regression. This is not unsurprising as the two are designed to consider trends in the data.

Figure28

In this instance the Linear Regression and ARIMA(1,1,0) models performed best as the points for Theoretical Physics have generally increased over the years. The simple AR(1) model is not equipped to deal with this moving average and so yielded a worse prediction.

Conclusion

Although the data was by no means exhaustive I was able to make some interesting insights from publically available data:

  1. Course points in general are going up, due to increasing student population and despite a continuous rise in the number of courses. This is putting increasing stress on universities which are now operating at or above capacity.

  2. The distributions of points awarded vs course points highlights the huge competition for places every year.

  3. Maths bonus points have played a small but significant for particular courses but in general student demand is the primary driver.

  4. The course rankings reflect what is common knowledge about the popularity for Medical careers and related courses. What was surprising was the decline in Law; I would have considered it one of the “safe bets”.

  5. The relatively small number of datapoints somewhat limits the predictive power of my models but I can still make some decent ballpark estimations with reasonable errors that are very likely to include the actual results. Even though I believe the ARIMA(1,1,0) model is more statistically justified over the simple Linear Regression model, it is hampered by losing two precious datapoints that increase the overall error.

  6. Ultimately, I found the best indicators for predicting the points for a course are, in decreasing importance:

    • Last year’s points.
    • The general trend in that course’s popularity in the last 8 years.
    • The number of LC students applying that year.

Further Questions

I have a few remaining questions that I didn’t get time to code up, or the data wasn’t extensive enough to answer for me. In no particular order, they are

  1. What is the relative performance of newer courses compared to more established ones?
  2. Which courses have suffered a decline in popularity?
  3. What was the biggest jump/fall in points in any given year for a particular course/college?
  4. How do CAO applicants who are not school leavers affecting the points?
  5. What is the gender breakdown of each course/college?

If there are any comments, corrections or suggests about this project, please feel free to email me.