Select Page

HIGH SCHOOL ENROLLMENT PROJECTIONS

Time SeriesLinear RegressionRidge RegularizationARIMAHybrid ModelTime Series Cross-Validation

CONTEXT AND OBJECTIVE

Forecasting the number of students who will enroll in high schools 15 years from now is crucial for the government. High school attendance is continually increasing, necessitating the regular construction of new schools to accommodate the demand. Since building a new school takes 10 to 15 years in Switzerland, from the project phase to actual construction, enrollment forecasts up to 15 years in the future are essential for effective planning.

Based on historical enrollment data and using a standard statistical approach, reliable forecasts could only be generated for the next 10 years, i.e., up to 2033 at the time the project was realized.

The goal of this project was to develop a machine learning model capable of providing projections up to 2040, thereby filling the gap in long-term forecasts.

 

WHAT WAS DONE

Using Python’s statsmodel and scikit-learn packages, the initial step involved identifying a model capable of learning the general trend from the historical data and forecasts provided by the reference statistical method. Various linear regression model variants were tested, and the best results were achieved using linear and quadratic time dependency (time steps) features. Ridge regression was incorporated to prevent overfitting.

In the second step, an ARIMA model was trained on the residuals, resulting in a final hybrid model that combined the predictions of the linear and ARIMA models.

The hybrid model’s ability to generalize and provide reliable predictions was assessed using time series cross-validation (TSCV).

 

The Hybrid Model

The training data consisted of the history of past enrollments (black), along with short- (green), mid- (yellow), and long-term (red) forecasts generated by the reference method based on classical statistics. The long-term predictions included three different scenarios. The hybrid model’s role was to extend these three scenarios beyond 2033 using the training data. The model combined a linear regression, which captures the trend, and an ARIMA model that learned the year-by-year fluctuations.

Building the ARIMA Model

Autocorrelation and partial autocorrelation plots were used to find the optimum hyperparameters. In both cases, a significant correlation could be observed for a time-lag of 2. Aditionnal tests confirmed that 2nd order autoregressive (AR) and moving average (MA) components provided the best results.