Last updated February 2017.
This post has since featured on an episode of the Atlantic 302 Podcast.
Every August approximately 70,000 college applicants in Ireland receive their first round of offers for course places through the Central Applications Office (CAO). Each year leading up to the event students, teachers, parents and the media obsess over trying to predict the points for various courses.
While some courses remain relatively predictable most are not and can vary significantly form year to year, depending on demand, leaving students guessing as to how many points they need to achieve in the Leaving Cert (LC) in order to get their desired course.
While the CAO (and presumably the media) have at their disposal an extensive internal database, the purpose of this project was to investigate what sort of predictive power could be achieved by analysing the pubicly available data on the CAO website. It was also an excuse to familiarise myself with using Python pandas, SQL and some statistical techniques I have learned the past year.
I’m not only interested to see how, or if, I could predict the points for 2017 but I also want to see what other insights can be derived from the sparse dataset.
Finally, you can also access full resolution PDFs of all images on my github which you are free to use if you please cite this blog.
Acquiring and Cleaning the Data
While the points for each course are techinically publicly available from the CAO website, there is no central database from which to draw - you have to search for the PDF of all the points for a given year to get your info which is very cumbersome.
I wrote a short web-scraper to acquire every single points-related PDF and php file from the CAO website for every year after 2001 and, incidentally, found many PDFs in subdirectories that weren’t accessible via hyperlink.
Now that I had all the points data as well as the distributions of awarded LC points awarded each year I wrote another python script to scrape the relevant data from PDFs into a comprehensive CSV database (download here) comprising each course, its description, the college and the points for a given year.
The dataset however is not exhaustive for the following reasons.
- The data pertains to final round offers only.
- I’m only focusing on Level 8 courses from 2001-2015.
- There will be individually missing datapoints where:
- All qualified applicants got a place, in which case no points are listed.
- A course is no longer running.
- Some courses may also have changed name or code.
- Mature or graduate courses are excluded since they don’t apply the same way LC students do.
- Trinity’s TR001 course is also excluded since it’s an ensemble of many courses.
- The number of places on course may change over time and I cannot account for this. In the same vein, points will generally reflect ‘demand relative to number of places.
I believe we can still make some meaningful insights with the data that’s left, however, there remain some interesting challenges.
- How do I deal with the 25 bonus points for Honours maths introduced in 2012?
- What about courses that require an audition, interview or portfolio? Those points may exceed the maximum awarded points in the LC exams.
I will address these as they come up.
Exploratory Data Analysis
To get a macroscopic picture and get a sense of the data set and its limitations I first perform some exploratory data analysis. This will help me later on to decide which prediction model is best suited.
How does the number of LC students affect the number of courses?
Surpisingly, there is essentially no correlation between the number of LC students and the number of availble courses.
I think there may be two reasons for this. The first is that mature students and repeat applicants may be influencing the number of courses although this is difficult to prove with this dataset. The second, more cynical, reason is that colleges are under increasing demand to procure government funding which is allocated according to the proportion of students in a college. Thus by offering a more diverse range of courses, colleges can attract more students and get awarded more funds.
How does the number of LC students affect the average course points?
The average course points in a given year seem to strongly correlate with the number of LC students with an r value of 0.63. It would seem that increased student numbers is driving up the demand for college courses, especially since the financial crash of 2008.
How does the average awarded points affect the average course points?
However, if we inspect how the average awarded points of a given year relates to the course points we begin to see the bigger picture emerge.
For starters, the average LC points has risen almost 60 points over the last ten years! Moreover it has increased almost every year in that timeframe. For a standardised test this is a highly surprising result as one would expect the figure to vary randomly about some mean value if the test were to be considered at all consistent. I can conceive of a number of reasons for this effect:
- Students are getting more intelligent each year.
- Students are getting better at gaming the system.
- The Leaving Cert exams are getting easier.
- The Leaving Cert is being marked easier.
The true cause is probably some combination of the above, nonetheless it is a concerning statistic and should be further investigated.
From the above trends it is also clear that after 2008 the average course points are highly correlated with the average LC points with r=0.92.
This insight, in my view, supports a more concerning issue facing Irish third level education.
Before the crash the average awarded points increased every year yet had litte impact on the average course points since the number of courses increased and student numbers drastically dropped. In this instance, demand for college places was relatively low and the course points reflected that.
However, in 2008 there was a sudden increase in student numbers which from then on has lead to the average awarded points becoming the main driver behind the increase in course points, as seen in the last figure.
In effect, despite more available courses than ever before, demand is beginning to outstrip supply; the courses are now saturated with the shear volume of applicants and the average course points are now being dictated by the awarded points - suggesting the system is currently at capacity.
An optimist would suggest that the system is at least in balance for now, with the awarded points and course points being approximately equal. However, if the two were to ever diverge it would signal a dire state of affairs wherein a significant proportion of students who wish to attend university would not be offered a place, and only the higher achievers would.
What is the distribution of course points?
From 2001-2016 the distribution of points for all the courses seems reasonably normally distributed with a mean of 379 with a standard deviation of 103. The courses that require an interview, audition or portfolio sometimes require more than 600 points and can be seen in the tail of the histogram.
If we eliminate these courses we get a tighter fitting distribution with the mean declining slightly to 370 and standard deviation 87 with some slight left-skew towards higher points.
What is the distribution of awarded points?
If we cross-examine these distributions with the distribution of points awarded we see a marked distinction between the distributions.
While there was slight left-skew in the course points, there is a very strong right-skew in the awarded points with a mean of 340 and a standard deviation 157.
I found it highly surprising that almost 1/4 students achieved less than 200 points in their exams. This does nothing but highlight the intense competition for places at Level 8 as the course points are weighted upward while the awarded points are weighted heavily downward, with the average course points 30 points more then the average awarded points.
What has been the effect of Maths Bonus Points?
For the LC in 2012, 25 bonus points were automatically given to students who sat Honours level maths as an incentive to take the subject. Because of this some courses are now going to rise artificially, not because of increased demand but because of these bonus pointsand it’s impossible to determine exactly which ones will be affected.
What I can do, however, is see how the bonus points affected all the courses on average by performing an auto-regression (AR) on the data. I should then be able to determine if, and to what degree, the bonus points mattered in the grand scheme of things.
If we examine the last 16 years in general we some general stability in the course points. The intercept of 15.94 and slope of 0.96 means that, generally speaking, lower points courses will see an increase while higher points courses remain relatively consistent. Here the black line indicates a slope = 1 and the fulcrum of the two lines lies at 454 points. This lends support to the average increase in course points over the years that we saw earlier.
As a matter of interest let’s investigate the pre- and post-crash eras in isolation.
Between 2001-2008 we see the fulcrum lies toward the lower half at 341 points and demonstrates how higher courses in general fell in points during this time.
This constrast with 2008-2016 during which the fulcrum has shifted dramatically to almost 570 points. This highlights the marked increase in course points that occurred during these years, particularly at the lower end of the scale.
We can therefore tell the general trend of the courses over the years by the slope and the position of the fulcrum relative to the median of 400 points.
Interestingly, there seems to always be a slight pinch in the scatter around 280 points, below which the data flares out and makes it all look sort of rocket-shaped. I can only surmise that this is because the popularity of the relatively small number of sub-280 courses varies more wildly from year to year leading to such erratic points differences.
If we apply the same auto-regression analysis on each yearly interval the following trend emerges:
For almost all years the slope of the AR fit is less than 1, meaning that, generally speaking higher courses came down and lower courses came up, with the position of the fulcrum indicating which side responded more.
For instance, in the earlier years the fulcrum was less than 400 indicating that the higher points courses were generally in decline, whereas in the later years the fulcrum has shifted well above 400 indicating that higher course points are more stable while lower points are now rising.
For 2003-2004, while the slope is slightly greater than 1, the intercept is also only slightly less than 0 meaning this line is almost exactly through the origin and course points in general remained relatively stable.
However, for 2011-2012, while the slope is also only slightly greater than 1, the intercept is -20 and still relatively large which puts the fulcrum at 320 points. This means that higher points courses will in general see an increase. In fact you can clearly see this occurring in the graph where the majority of points over 475 lie above the black line by as much as 30 points - a clear demonstration of this effect.
Remarkably, we have been able to identify the effect of maths bonus points even though in the 2011-2012 period average points didn’t rise all that much (due to a large drop in student numbers). The trend thereafter continues the usual pattern found between 2008-2016 of increasing average points coming primarily from the lower end of the scale.
However, the effect does not seem to me to be sufficiently large enough across the whole range of points to justify any normalisation of the data to account for this artificial increase. Indeed, it is clear to me that in the period of 2008-2016 the increasing number of students is the primary driver of the points, and affects the entire range, rather than the bonus points which only affect a small section.
Findings Part 1
Course points are generally increasing in the past 8 years due to the increasing number of applicants.
This is further reflected in the course points vs candidate points which were uncorrelated before 2007 but have since become highly linked with student performance, which is steadily increasing.
The increasing candidate points is a concerning trend and cannot simply be attributed to maths bonus points.
Maths bonus points did have an effect on higher points courses however the demand from students is a much greater effect.
The number of places in each course is not taken into account and I would expect this to have an effect.
What this indicates is a highly competitive application procedure with universities now at or above capacity despite providing more courses than ever each year.
Course and College Rankings
College Rankings based on average entry points
Below the colleges are ranked according to the average entry points for all available courses from 2001-2016. The colour scheme indicates intervals of 100 points.
We see a broad spectrum of points from IADT at 548 to Portobello Collegel at 227. Among the higher end are many arts colleges that may require portfolios, auditions or interview.
If we omit courses with auditions, we see a largely similar trend.
If we rank the courses based on average entry points since 2001 we see that, over all courses, those pertaining to the entertainment industry such as Music, Film, Television and Theatre have the highest points, presumably due the extra requirements such as an audition/portfolio/marking processes, notwithstanding the popularity of the career path.
|2||DL045||Film and Television Production||883||35|
|3||DL834||Film and Television Production||880||53|
|4||CR128||Popular Music Keyboards at CIT Cork School of…||873||159|
|5||DL049||Design for Stage and Screen Makeup Design||854||71|
|6||CR129||Popular Music Voice at CIT Cork School of Music||849||44|
|7||DL048||Design for Stage and Screen Costume Design||826||95|
|8||DL830||Design for Stage and Screen Makeup Design||787||147|
|9||CR127||Popular Music Electric Guitar at CIT Cork School||776||104|
|11||LC102||Art and Design||750||51|
|12||DL826||Visual Communication Design||748||142|
|13||CR126||Popular Music Drums at CIT Cork School of Music||748||36|
|14||CR121||Music at CIT Cork School of Music||741||78|
|15||DL047||Design for Stage and Screen||722||98|
|16||LC114||Design Fashion Knitwear and Textiles||719||71|
|19||CR210||Contemporary Applied Art Ceramics Glass Textile||713||142|
|20||CR700||Theatre and Drama Studies at CIT Cork School …||707||80|
Courses without Extra Requirements
If we omit courses with extra requirements the medical courses clearly dominate the rankings between Medicine, Dentistry, Pharmacy and Physiotherapy. Business, Finance and Law contribute almost as heavily.
|1||RC003||Medicine with Leaving Cert Scholarship||589||7|
|6||TR017||Law and Business||566||12|
|7||TR018||Law and French||566||10|
|8||TR020||Law and Political Science||566||9|
|15||DC119||Global Business Canada||558||18|
|16||DN230||Actuarial and Financial Studies||557||14|
|17||DN616||Law with French Law BCL||556||18|
Highest Points for 2016
|1||CR210||Contemporary Applied Art Ceramics Glass Textile||1000|
|3||DL830||Design for Stage and Screen Makeup Design||900|
|4||CR129||Popular Music Voice at CIT Cork School of Music||900|
|5||DL834||Film and Television Production||885|
|6||DL826||Visual Communication Design||845|
|7||CR127||Popular Music Electric Guitar at CIT Cork School||830|
|8||CR128||Popular Music Keyboards at CIT Cork School of…||825|
|9||DL829||Design for Stage and Screen Costume Design||815|
|10||CR700||Theatre and Drama Studies at CIT Cork School …||800|
|12||DT545||Design Visual Communication||775|
|13||CR126||Popular Music Drums at CIT Cork School of Music||770|
|15||DT506||Commercial Modern Music||750|
|17||GY501||Medicine 6 year and embedded PhD options||723|
|18||CR220||Fine Art at CIT Crawford College of Art and D…||710|
|19||CW858||Sport Management and Coaching Options GAA Rugby||700|
Highest Points for 2016 without Extra Requirements
|1||TR076||Nanoscience Physics and Chemistry of Advanced…||595|
|2||DC116||Global Business USA||590|
|5||TR017||Law and Business||585|
|7||TR020||Law and Political Science||575|
|8||TR018||Law and French||575|
|12||TR034||Management Science and Information Systems St…||565|
|13||DN230||Actuarial and Financial Studies||560|
|16||TR015||Philosophy Political Science Economics and Socio||555|
|20||DN440||Biomedical Health and Life Sciences||550|
Popularity of Particular Industries
The performance of certain industries also captures the media’s attention as some, particularly construction, are taken as indicators of the confidence in the economy, while others, such as STEM courses, are strongly promoted.
I performed a simple key-word search of all the courses and plotted the time-series of the average points of the courses with that key-word in the course description.
The search isn’t 100% accurate however since searching a generic term like ‘Science’ will exclude courses like ‘Astrophysics’ and ‘Biology’, while conversely, I may be including extra courses that don’t technically belong, like ‘Social Science’. In general though, we’re just interested in a trend so it should work well enough.
Did extremely well during the boom, collapsed with the economy in 2008 when the property bubble burst. Has seen some resurgence lately as the economy improves.
Not the trend I was expecting, especially considering the rankings above. It could be that newer, less popular courses involving Law are bringing down the points.
Medicine, Dentistry, Nursing, Physiotherapy, Pharmacy.
The medical field is always consistently high. The huge jump is the introduction of the HPAT in 2009.
Contrary to construction, Engineering went down during the boom but has since risen due to efforts in promoting STEM courses.
Science, Physics, Biology, Chemistry, Geology, Mathematic
Science related courses have exploded in popularity since the crash, the promotion of STEM has been extremely successful here and students obviously see the benefit to a degree in science.
Comput, Information, Digital, Programming
Again, like Engineering, Computer and Tech related courses have seen a resurgance since the recession as Ireland becomes an emerging tech-hub of Europe.
Irish, Gaelic, Celtic
Irish related studies fluctuate a little bit but seems to remain reasonably stable over time.
French, German, Spanish, Italian
The big 4 european languages are more popular than our own and also seem relatively stable within a 30-point margin.
Chinese, Russian, Japanese, Arabic, Indian
Non-EU languages however don’t share the same popularity
Music, Theatre, Drama, Film
The Performing Arts have fared very well through the recession however the large jump in points would seem to suggest that an additional component to the exam, such as an audition, portfolio etc., came into effect. Yet the sector appears quite stable.
Arts, English, Philosophy, Sociology, Politics
This mixed bag of Arts degrees suffers a lot of volatility over the years but seems to have settled down.
Business, Economics, Accounting, Finance
Business and financial jobs have behaved similar to Science, Technology and Engineering becoming increasingly popular since the crash due to the lucrative career prospects on offer.
A clear trend emerged in the preceeding graphs; the onset of the recession caused a dramatic uptake in courses linked with career prospects conventionally considered “safe” - namely Science, Technology, Engineering a Business. While softer subjects such as the Arts and Languages have remained relatively stable throughout the last decade suggesting their demand has not increased as significantly as the others.
Predicting 2017 Points
In predicting the points for 2017 I have 3 strategies in mind:
Linear Regression: Probably the most naive model as there should not necessarily be a direct causal relationship between the year and the points at all. Still, it may prove to be a useful starting point.
Auto-Regressiion AR(1): a bit more robust model for time-series that makes a bit more intuitive sense. It dictates that this years points will depend on what last years points were, under the assumption of constant mean and variance.
ARIMA(1,1,0): Slightly more sophisticated than AR(1), this will take into account any trend where the points are going up (or down) on average, as we have seen can be the case.
I used the data from 2001-2016 as a training set in order to determine the best model to predict the points for 2017.
The predictions for the training set can be found in this data file which includies the 95% prediction errors, where I have only considered courses that have more than 5 data points since 2009.
I decided to only use data from 2009 on as this pertains the most recent era where student numbers are driving the points. As such the prediction errors are quite large given the scarce amount of data I’m using.
The mean square error (MSE) of each model is sumarised below where suprisingly all models are roughly equally valid. My reasoning for this is that we have already seen how some courses have seen a general increase in points since 2008, which would be well explained by either the Linear Regression or ARIMA(1,1,0) models, while others have remained relatively stable and thus would be more suited to the AR(1) model.
The best model will ultimately depend on the nature of the course itself.
Using each model, I then predicted the points for 2017, including 95% prediction intervals, which can be found here.
To illustrate how each of these models work I will consider the model predictions for my own course Theoretical Physics in Trinity (TR035).
Case Study : Theoretical Physics TR035
Prediction 2017 : 566 +/- 43 points
While the linear regression model for Points vs Year fits the data quite well, the equation of the line was always going to be bizarre because it assumes a linear increase in the points forever.
The prediction however is ok with a spread of almost 90 points - fairly wide but not untypical given the model MSE above.
Prediction 2017 : 531 +/- 67 points
The AR(1) model does not provide a as nice a fit however as there is clear evidence of a trend in the data for which AR(1) cannot appropriately account.
Prediction 2017 : 561 +/- 47 points
The ARIMA(1,1,0) model yields almost exactly the same prediction as the Linear Regression. This is not unsurprising as the two are designed to consider trends in the data.
In this instance the Linear Regression and ARIMA(1,1,0) models performed best as the points for Theoretical Physics have generally increased over the years. The simple AR(1) model is not equipped to deal with this moving average and so yielded a worse prediction.
Although the data was by no means exhaustive I was able to make some interesting insights from publically available data:
Course points in general are going up, due to increasing student population and despite a continuous rise in the number of courses. This is putting increasing stress on universities which are now operating at or above capacity.
The distributions of points awarded vs course points highlights the huge competition for places every year.
Maths bonus points have played a small but significant for particular courses but in general student demand is the primary driver.
The course rankings reflect what is common knowledge about the popularity for Medical careers and related courses. What was surprising was the decline in Law; I would have considered it one of the “safe bets”.
The relatively small number of datapoints somewhat limits the predictive power of my models but I can still make some decent ballpark estimations with reasonable errors that are very likely to include the actual results. Even though I believe the ARIMA(1,1,0) model is more statistically justified over the simple Linear Regression model, it is hampered by losing two precious datapoints that increase the overall error.
Ultimately, I found the best indicators for predicting the points for a course are, in decreasing importance:
- Last year’s points.
- The general trend in that course’s popularity in the last 8 years.
- The number of LC students applying that year.
I have a few remaining questions that I didn’t get time to code up, or the data wasn’t extensive enough to answer for me. In no particular order, they are
- What is the relative performance of newer courses compared to more established ones?
- Which courses have suffered a decline in popularity?
- What was the biggest jump/fall in points in any given year for a particular course/college?
- How do CAO applicants who are not school leavers affecting the points?
- What is the gender breakdown of each course/college?
If there are any comments, corrections or suggests about this project, please feel free to email me.