A Statistical Analysis of the CAO Points System Aug 16, 2016 Last updated February 2017. This post has since featured on an episode of the Atlantic 302 Podcast. Introduction Every August approximately 70,000 college applicants in Ireland receive their first round of offers for course places through the Central Applications Office (CAO). Each year leading up to the event students, teachers, parents and the media obsess over trying to predict the points for various courses. While some courses remain relatively predictable most are not and can vary significantly form year to year, depending on demand, leaving students guessing as to how many points they need to achieve in the Leaving Cert (LC) in order to get their desired course. While the CAO (and presumably the media) have at their disposal an extensive internal database, the purpose of this project was to investigate what sort of predictive power could be achieved by analysing the pubicly available data on the CAO website. It was also an excuse to familiarise myself with using Python pandas, SQL and some statistical techniques I have learned the past year. I’m not only interested to see how, or if, I could predict the points for 2017 but I also want to see what other insights can be derived from the sparse dataset. Here you can access my full Course Points Dataset file, the awarded Leaving Cert Points file, and the Python notebook I used to analyse the data. You can also find my 2016 Predictions which shall be compared to the actual results using Linear Regression, Auto-Regression and ARIMA models. My Predictions for 2017 can also be downloaded. Finally, you can also access full resolution PDFs of all images on my github which you are free to use if you please cite this blog. Acquiring and Cleaning the Data While the points for each course are techinically publicly available from the CAO website, there is no central database from which to draw - you have to search for the PDF of all the points for a given year to get your info which is very cumbersome. I wrote a short web-scraper to acquire every single points-related PDF and php file from the CAO website for every year after 2001 and, incidentally, found many PDFs in subdirectories that weren’t accessible via hyperlink. Now that I had all the points data as well as the distributions of awarded LC points awarded each year I wrote another python script to scrape the relevant data from PDFs into a comprehensive CSV database (download here) comprising each course, its description, the college and the points for a given year. Some Caveats: The dataset however is not exhaustive for the following reasons. The data pertains to final round offers only. I’m only focusing on Level 8 courses from 2001-2015. There will be individually missing datapoints where: All qualified applicants got a place, in which case no points are listed. A course is no longer running. Some courses may also have changed name or code. Mature or graduate courses are excluded since they don’t apply the same way LC students do. Trinity’s TR001 course is also excluded since it’s an ensemble of many courses. The number of places on course may change over time and I cannot account for this. In the same vein, points will generally reflect ‘demand relative to number of places. I believe we can still make some meaningful insights with the data that’s left, however, there remain some interesting challenges. How do I deal with the 25 bonus points for Honours maths introduced in 2012? What about courses that require an audition, interview or portfolio? Those points may exceed the maximum awarded points in the LC exams. I will address these as they come up. Exploratory Data Analysis To get a macroscopic picture and get a sense of the data set and its limitations I first perform some exploratory data analysis. This will help me later on to decide which prediction model is best suited. How does the number of LC students affect the number of courses? Surpisingly, there is essentially no correlation between the number of LC students and the number of availble courses. I think there may be two reasons for this. The first is that mature students and repeat applicants may be influencing the number of courses although this is difficult to prove with this dataset. The second, more cynical, reason is that colleges are under increasing demand to procure government funding which is allocated according to the proportion of students in a college. Thus by offering a more diverse range of courses, colleges can attract more students and get awarded more funds. How does the number of LC students affect the average course points? The average course points in a given year seem to strongly correlate with the number of LC students with an r value of 0.63. It would seem that increased student numbers is driving up the demand for college courses, especially since the financial crash of 2008. How does the average awarded points affect the average course points? However, if we inspect how the average awarded points of a given year relates to the course points we begin to see the bigger picture emerge. For starters, the average LC points has risen almost 60 points over the last ten years! Moreover it has increased almost every year in that timeframe. For a standardised test this is a highly surprising result as one would expect the figure to vary randomly about some mean value if the test were to be considered at all consistent. I can conceive of a number of reasons for this effect: Students are getting more intelligent each year. Students are getting better at gaming the system. The Leaving Cert exams are getting easier. The Leaving Cert is being marked easier. The true cause is probably some combination of the above, nonetheless it is a concerning statistic and should be further investigated. From the above trends it is also clear that after 2008 the average course points are highly correlated with the average LC points with r=0.92. This insight, in my view, supports a more concerning issue facing Irish third level education. Before the crash the average awarded points increased every year yet had litte impact on the average course points since the number of courses increased and student numbers drastically dropped. In this instance, demand for college places was relatively low and the course points reflected that. However, in 2008 there was a sudden increase in student numbers which from then on has lead to the average awarded points becoming the main driver behind the increase in course points, as seen in the last figure. In effect, despite more available courses than ever before, demand is beginning to outstrip supply; the courses are now saturated with the shear volume of applicants and the average course points are now being dictated by the awarded points - suggesting the system is currently at capacity. An optimist would suggest that the system is at least in balance for now, with the awarded points and course points being approximately equal. However, if the two were to ever diverge it would signal a dire state of affairs wherein a significant proportion of students who wish to attend university would not be offered a place, and only the higher achievers would. What is the distribution of course points? From 2001-2016 the distribution of points for all the courses seems reasonably normally distributed with a mean of 379 with a standard deviation of 103. The courses that require an interview, audition or portfolio sometimes require more than 600 points and can be seen in the tail of the histogram. If we eliminate these courses we get a tighter fitting distribution with the mean declining slightly to 370 and standard deviation 87 with some slight left-skew towards higher points. What is the distribution of awarded points? If we cross-examine these distributions with the distribution of points awarded we see a marked distinction between the distributions. While there was slight left-skew in the course points, there is a very strong right-skew in the awarded points with a mean of 340 and a standard deviation 157. I found it highly surprising that almost 1/4 students achieved less than 200 points in their exams. This does nothing but highlight the intense competition for places at Level 8 as the course points are weighted upward while the awarded points are weighted heavily downward, with the average course points 30 points more then the average awarded points. What has been the effect of Maths Bonus Points? For the LC in 2012, 25 bonus points were automatically given to students who sat Honours level maths as an incentive to take the subject. Because of this some courses are now going to rise artificially, not because of increased demand but because of these bonus pointsand it’s impossible to determine exactly which ones will be affected. What I can do, however, is see how the bonus points affected all the courses on average by performing an auto-regression (AR) on the data. I should then be able to determine if, and to what degree, the bonus points mattered in the grand scheme of things. If we examine the last 16 years in general we some general stability in the course points. The intercept of 15.94 and slope of 0.96 means that, generally speaking, lower points courses will see an increase while higher points courses remain relatively consistent. Here the black line indicates a slope = 1 and the fulcrum of the two lines lies at 454 points. This lends support to the average increase in course points over the years that we saw earlier. As a matter of interest let’s investigate the pre- and post-crash eras in isolation. Between 2001-2008 we see the fulcrum lies toward the lower half at 341 points and demonstrates how higher courses in general fell in points during this time. This constrast with 2008-2016 during which the fulcrum has shifted dramatically to almost 570 points. This highlights the marked increase in course points that occurred during these years, particularly at the lower end of the scale. We can therefore tell the general trend of the courses over the years by the slope and the position of the fulcrum relative to the median of 400 points. Interestingly, there seems to always be a slight pinch in the scatter around 280 points, below which the data flares out and makes it all look sort of rocket-shaped. I can only surmise that this is because the popularity of the relatively small number of sub-280 courses varies more wildly from year to year leading to such erratic points differences. If we apply the same auto-regression analysis on each yearly interval the following trend emerges: ================================ Interval Slope Intecept Fulcrum 2001-2002 0.958274 12.022932 288 2002-2003 0.962025 12.925657 340 2003-2004 1.006800 -1.049425 154 2004-2005 0.951657 15.142911 313 2005-2006 0.981118 0.164459 9 2006-2007 0.928320 30.025070 419 2007-2008 0.921694 28.744911 367 2008-2009 0.922315 35.030095 451 2009-2010 0.962466 18.565979 495 2010-2011 0.957871 23.662243 561 2011-2012 1.062564 -20.002919 320 2012-2013 0.915665 42.925100 509 2013-2014 0.986426 7.142493 526 2014-2015 0.949977 32.086148 641 2015-2016 0.984857 8.608068 568 2001-2016 0.964934 15.935371 454 2001-2008 0.956779 14.774989 342 2008-2016 0.966250 19.244853 570 ================================ For almost all years the slope of the AR fit is less than 1, meaning that, generally speaking higher courses came down and lower courses came up, with the position of the fulcrum indicating which side responded more. For instance, in the earlier years the fulcrum was less than 400 indicating that the higher points courses were generally in decline, whereas in the later years the fulcrum has shifted well above 400 indicating that higher course points are more stable while lower points are now rising. For 2003-2004, while the slope is slightly greater than 1, the intercept is also only slightly less than 0 meaning this line is almost exactly through the origin and course points in general remained relatively stable. However, for 2011-2012, while the slope is also only slightly greater than 1, the intercept is -20 and still relatively large which puts the fulcrum at 320 points. This means that higher points courses will in general see an increase. In fact you can clearly see this occurring in the graph where the majority of points over 475 lie above the black line by as much as 30 points - a clear demonstration of this effect. Remarkably, we have been able to identify the effect of maths bonus points even though in the 2011-2012 period average points didn’t rise all that much (due to a large drop in student numbers). The trend thereafter continues the usual pattern found between 2008-2016 of increasing average points coming primarily from the lower end of the scale. However, the effect does not seem to me to be sufficiently large enough across the whole range of points to justify any normalisation of the data to account for this artificial increase. Indeed, it is clear to me that in the period of 2008-2016 the increasing number of students is the primary driver of the points, and affects the entire range, rather than the bonus points which only affect a small section. Findings Part 1 Course points are generally increasing in the past 8 years due to the increasing number of applicants. This is further reflected in the course points vs candidate points which were uncorrelated before 2007 but have since become highly linked with student performance, which is steadily increasing. The increasing candidate points is a concerning trend and cannot simply be attributed to maths bonus points. Maths bonus points did have an effect on higher points courses however the demand from students is a much greater effect. The number of places in each course is not taken into account and I would expect this to have an effect. What this indicates is a highly competitive application procedure with universities now at or above capacity despite providing more courses than ever each year. Course and College Rankings College Rankings based on average entry points Below the colleges are ranked according to the average entry points for all available courses from 2001-2016. The colour scheme indicates intervals of 100 points. We see a broad spectrum of points from IADT at 548 to Portobello Collegel at 227. Among the higher end are many arts colleges that may require portfolios, auditions or interview. If we omit courses with auditions, we see a largely similar trend. Course Rankings If we rank the courses based on average entry points since 2001 we see that, over all courses, those pertaining to the entertainment industry such as Music, Film, Television and Theatre have the highest points, presumably due the extra requirements such as an audition/portfolio/marking processes, notwithstanding the popularity of the career path. All Courses ======================================================= # code Name Points std 1 DL832 Animation 918 105 2 DL045 Film and Television Production 883 35 3 DL834 Film and Television Production 880 53 4 CR128 Popular Music Keyboards at CIT Cork School of… 873 159 5 DL049 Design for Stage and Screen Makeup Design 854 71 6 CR129 Popular Music Voice at CIT Cork School of Music 849 44 7 DL048 Design for Stage and Screen Costume Design 826 95 8 DL830 Design for Stage and Screen Makeup Design 787 147 9 CR127 Popular Music Electric Guitar at CIT Cork School 776 104 10 DL042 Photography 758 67 11 LC102 Art and Design 750 51 12 DL826 Visual Communication Design 748 142 13 CR126 Popular Music Drums at CIT Cork School of Music 748 36 14 CR121 Music at CIT Cork School of Music 741 78 15 DL047 Design for Stage and Screen 722 98 16 LC114 Design Fashion Knitwear and Textiles 719 71 17 DL041 Animation 719 77 18 DL833 Photography 718 74 19 CR210 Contemporary Applied Art Ceramics Glass Textile 713 142 20 CR700 Theatre and Drama Studies at CIT Cork School … 707 80 ======================================================= Courses without Extra Requirements If we omit courses with extra requirements the medical courses clearly dominate the rankings between Medicine, Dentistry, Pharmacy and Physiotherapy. Business, Finance and Law contribute almost as heavily. =============================================== # code Name Points std 1 RC003 Medicine with Leaving Cert Scholarship 589 7 2 DN670 Quantitative Business 585 0 3 TR051 Medicine 575 1 5 CK701 Medicine 570 9 6 TR017 Law and Business 566 12 7 TR018 Law and French 566 10 8 TR020 Law and Political Science 566 9 9 DN002 Medicine 566 11 10 TR052 Dental Science 564 18 11 RC001 Medicine 564 11 12 GY501 Medicine 564 10 13 CK702 Dentistry 563 15 14 LM100 Physiotherapy 561 10 15 DC119 Global Business Canada 558 18 16 DN230 Actuarial and Financial Studies 557 14 17 DN616 Law with French Law BCL 556 18 18 CK703 Pharmacy 556 9 20 TR072 Pharmacy 551 8 =============================================== Highest Points for 2016 ===================================================== # code Name Points 1 CR210 Contemporary Applied Art Ceramics Glass Textile 1000 2 SG244 Fine Art 935 3 DL830 Design for Stage and Screen Makeup Design 900 4 CR129 Popular Music Voice at CIT Cork School of Music 900 5 DL834 Film and Television Production 885 6 DL826 Visual Communication Design 845 7 CR127 Popular Music Electric Guitar at CIT Cork School 830 8 CR128 Popular Music Keyboards at CIT Cork School of… 825 9 DL829 Design for Stage and Screen Costume Design 815 10 CR700 Theatre and Drama Studies at CIT Cork School … 800 11 DL832 Animation 800 12 DT545 Design Visual Communication 775 13 CR126 Popular Music Drums at CIT Cork School of Music 770 14 DT559 Photography 765 15 DT506 Commercial Modern Music 750 16 TR051 Medicine 730 17 GY501 Medicine 6 year and embedded PhD options 723 18 CR220 Fine Art at CIT Crawford College of Art and D… 710 19 CW858 Sport Management and Coaching Options GAA Rugby 700 20 CW038 Art Wexford 700 ===================================================== Highest Points for 2016 without Extra Requirements ===================================================== # code Name Points 1 TR076 Nanoscience Physics and Chemistry of Advanced… 595 2 DC116 Global Business USA 590 3 TR052 Dental Science 585 4 CK702 Dentistry 585 5 TR017 Law and Business 585 6 DN670 Quantitative Business 585 7 TR020 Law and Political Science 575 8 TR018 Law and French 575 9 TR073 Human Genetics 570 10 CK703 Pharmacy 565 11 TR031 Mathematics 565 12 TR034 Management Science and Information Systems St… 565 13 DN230 Actuarial and Financial Studies 560 14 TR072 Pharmacy 560 15 CK407 Mathematical Sciences 560 16 TR015 Philosophy Political Science Economics and Socio 555 17 LM100 Physiotherapy 555 18 TR035 Theoretical Physics 555 19 DN615 BCL Matrise 555 20 DN440 Biomedical Health and Life Sciences 550 ===================================================== Popularity of Particular Industries The performance of certain industries also captures the media’s attention as some, particularly construction, are taken as indicators of the confidence in the economy, while others, such as STEM courses, are strongly promoted. I performed a simple key-word search of all the courses and plotted the time-series of the average points of the courses with that key-word in the course description. The search isn’t 100% accurate however since searching a generic term like ‘Science’ will exclude courses like ‘Astrophysics’ and ‘Biology’, while conversely, I may be including extra courses that don’t technically belong, like ‘Social Science’. In general though, we’re just interested in a trend so it should work well enough. Construction Construction, Architecture Did extremely well during the boom, collapsed with the economy in 2008 when the property bubble burst. Has seen some resurgence lately as the economy improves. Law Law Not the trend I was expecting, especially considering the rankings above. It could be that newer, less popular courses involving Law are bringing down the points. Medical Medicine, Dentistry, Nursing, Physiotherapy, Pharmacy. The medical field is always consistently high. The huge jump is the introduction of the HPAT in 2009. Engineering Engineering Contrary to construction, Engineering went down during the boom but has since risen due to efforts in promoting STEM courses. Science Science, Physics, Biology, Chemistry, Geology, Mathematic Science related courses have exploded in popularity since the crash, the promotion of STEM has been extremely successful here and students obviously see the benefit to a degree in science. Technology Comput, Information, Digital, Programming Again, like Engineering, Computer and Tech related courses have seen a resurgance since the recession as Ireland becomes an emerging tech-hub of Europe. Irish Irish, Gaelic, Celtic Irish related studies fluctuate a little bit but seems to remain reasonably stable over time. European Languages French, German, Spanish, Italian The big 4 european languages are more popular than our own and also seem relatively stable within a 30-point margin. Non-EU Languages Chinese, Russian, Japanese, Arabic, Indian Non-EU languages however don’t share the same popularity Performing Arts Music, Theatre, Drama, Film The Performing Arts have fared very well through the recession however the large jump in points would seem to suggest that an additional component to the exam, such as an audition, portfolio etc., came into effect. Yet the sector appears quite stable. Humanities Arts, English, Philosophy, Sociology, Politics This mixed bag of Arts degrees suffers a lot of volatility over the years but seems to have settled down. Corporate Business, Economics, Accounting, Finance Business and financial jobs have behaved similar to Science, Technology and Engineering becoming increasingly popular since the crash due to the lucrative career prospects on offer. A clear trend emerged in the preceeding graphs; the onset of the recession caused a dramatic uptake in courses linked with career prospects conventionally considered “safe” - namely Science, Technology, Engineering a Business. While softer subjects such as the Arts and Languages have remained relatively stable throughout the last decade suggesting their demand has not increased as significantly as the others. Predicting 2017 Points In predicting the points for 2017 I have 3 strategies in mind: Linear Regression: Probably the most naive model as there should not necessarily be a direct causal relationship between the year and the points at all. Still, it may prove to be a useful starting point. Auto-Regressiion AR(1): a bit more robust model for time-series that makes a bit more intuitive sense. It dictates that this years points will depend on what last years points were, under the assumption of constant mean and variance. ARIMA(1,1,0): Slightly more sophisticated than AR(1), this will take into account any trend where the points are going up (or down) on average, as we have seen can be the case. I used the data from 2001-2016 as a training set in order to determine the best model to predict the points for 2017. The predictions for the training set can be found in this data file which includies the 95% prediction errors, where I have only considered courses that have more than 5 data points since 2009. I decided to only use data from 2009 on as this pertains the most recent era where student numbers are driving the points. As such the prediction errors are quite large given the scarce amount of data I’m using. The mean square error (MSE) of each model is sumarised below where suprisingly all models are roughly equally valid. My reasoning for this is that we have already seen how some courses have seen a general increase in points since 2008, which would be well explained by either the Linear Regression or ARIMA(1,1,0) models, while others have remained relatively stable and thus would be more suited to the AR(1) model. The best model will ultimately depend on the nature of the course itself. ================== Model MSE Linear Regression 39.5 AR(1) 36.3 ARIMA(1,1,0) 45.3 ================== Using each model, I then predicted the points for 2017, including 95% prediction intervals, which can be found here. To illustrate how each of these models work I will consider the model predictions for my own course Theoretical Physics in Trinity (TR035). Case Study : Theoretical Physics TR035 Linear Regression Prediction 2017 : 566 +/- 43 points While the linear regression model for Points vs Year fits the data quite well, the equation of the line was always going to be bizarre because it assumes a linear increase in the points forever. The prediction however is ok with a spread of almost 90 points - fairly wide but not untypical given the model MSE above. AR(1) Prediction 2017 : 531 +/- 67 points The AR(1) model does not provide a as nice a fit however as there is clear evidence of a trend in the data for which AR(1) cannot appropriately account. ARIMA(1,1,0) Prediction 2017 : 561 +/- 47 points The ARIMA(1,1,0) model yields almost exactly the same prediction as the Linear Regression. This is not unsurprising as the two are designed to consider trends in the data. In this instance the Linear Regression and ARIMA(1,1,0) models performed best as the points for Theoretical Physics have generally increased over the years. The simple AR(1) model is not equipped to deal with this moving average and so yielded a worse prediction. Conclusion Although the data was by no means exhaustive I was able to make some interesting insights from publically available data: Course points in general are going up, due to increasing student population and despite a continuous rise in the number of courses. This is putting increasing stress on universities which are now operating at or above capacity. The distributions of points awarded vs course points highlights the huge competition for places every year. Maths bonus points have played a small but significant for particular courses but in general student demand is the primary driver. The course rankings reflect what is common knowledge about the popularity for Medical careers and related courses. What was surprising was the decline in Law; I would have considered it one of the “safe bets”. The relatively small number of datapoints somewhat limits the predictive power of my models but I can still make some decent ballpark estimations with reasonable errors that are very likely to include the actual results. Even though I believe the ARIMA(1,1,0) model is more statistically justified over the simple Linear Regression model, it is hampered by losing two precious datapoints that increase the overall error. Ultimately, I found the best indicators for predicting the points for a course are, in decreasing importance: Last year’s points. The general trend in that course’s popularity in the last 8 years. The number of LC students applying that year. Further Questions I have a few remaining questions that I didn’t get time to code up, or the data wasn’t extensive enough to answer for me. In no particular order, they are What is the relative performance of newer courses compared to more established ones? Which courses have suffered a decline in popularity? What was the biggest jump/fall in points in any given year for a particular course/college? How do CAO applicants who are not school leavers affecting the points? What is the gender breakdown of each course/college? If there are any comments, corrections or suggests about this project, please feel free to email me.