You are on page 1of 4



Luis Rei Institut Jo!ef Stefan Jo!ef Stefan International Postgraduate School Jamova 39, 1000 Ljubljana, Slovenia e-mail:

! ! !

ABSTRACT We begin by introducing the data previously collected from the Open Data Slovenia API and the steps taken to preprocess it into the format used to build our models. Next, we discuss how the models were built and evaluated. And finally we discuss the results and how we believe they could be improved.

Car Parks are common and essential infrastructures in modern day cities. The availability of parking spaces can have an impact on a persons choice of transportation mode, departure time or even on wether to depart at all. The city of Ljubljana provides the number of available spaces for each car park online. However, since decisions often need to be made ahead of time, the availably of parking spaces should be provided ahead of time. In this paper we analyse the viability of providing such information through the use of predictive modelling. We begin by describing how we built these models and conclude that they are both viable and accurate. Finally, we address how the models could be further improved.

! !

2 DATA We use the number of available parking spaces for a total of 11 car parks obtained from the Open Data Slovenia API between 2011-09-12 and 2013-11-18. This data is contains entries for every 5 minutes which includes the name of the car park, an unique numerical identifier and a timestamp. We also use the equally provided total number of parking spaces of each car park during the preprocessing stage. The API provides additional data such as the coordinates of the car park and the price per hour, in euros, of a parking space which we do not use. Table 1 shows the names of the car parks used and their respective maximum capacity.
Park PH Kozolec Tivoli I Mirje Trg MDB Gospodarsko raz. PH Kongresni trg Spaces 248 360 110 40 550 720 Park Petkovskovo II Be!igrad Trg preko. brigad Kranj#eva $ale II Spaces 85 62 98 118 80

!! !

1 INTRODUCTION Available parking spaces for cars are necessary and often scarce resource. It is common for people to decide their mode of transportation, time of departure and even wether or not to go to a personal or public event based on the expected availability of parking spaces. The public company of Javno podjetje Ljubljanska parkiri"#a in tr!nice, d.o.o manages the car parks owned by the Municipality of Ljubljana. The company provides the number of currently available spaces for each park on their website. This information is also provided by the Open Data Slovenia initiative in the form of a web based Application Programming Interface (API). However, because plans often need to be made with varying degrees of precedence, the availability of parking spaces at the expected time of arrival is needed, rather than its present value. The goal of this paper is to assess the viability and accuracy of predicting the number of free spaces, in a given car park, at a future time. We build predictive models for different time intervals and for several car parks and evaluate their accuracy. Our models rely on machine learning methods, namely linear regression, decision trees [1] and random forest [2] rather than more typical statistical methods such as Autoregressive-moving-average (ARMA) [3] models. We built models to predict the number of available spaces at each car park at the end of different time intervals: 30min from the current time, 1h, 2h and 3h.

Table 1: Car Parks used and their maximum capacity

2.1 Time series resampling In order for to predict parking space availability at the end of discrete time intervals using this approach, we first need to resample the time series from the provided 5min intervals into the intervals that we will use to make predictions: 30min, 1h, 2h and 3h. This is accomplished by taking grouping the 5min intervals and grouping them into the new intervals, taking for each the value of the number of available spaces at the last of the groups 5min intervals.


The choice of taking the last value as the method of resampling, over other methods, such as the mean or the minimum, derives from our goal which can be summarised as: at the end of the next time period, how many spaces will be available at a certain car park. The resampling was performed using the Pandas library.

2.2 Sliding Window The machine learning methods used have no built-in notion of ordering different examples, including time ordering. Each example is just a list of feature values. In order to provide information about the value of a feature at a previous point in time, it must be included as a different feature. We perform this using a sliding window approach where each example contains as distinct features the number of available parking spots at the end of the current time interval as well as at the end of a fixed number of previous time intervals. This number is referred to as window size. We set the window size to 4 after performing the experiment with all values between 1 and 20 and determining it yielded the lowest average error for regression over all time intervals and all parks combined. As can be seen in Figure 1.

missing value of the target variable are removed from our dataset. The missing values of the sliding window variables are replaced by their value in the training set. This replacement occurs automatically at model training and testing time. Table 2 describes the number total number of examples contained in our data for each of the time intervals. 30 min Examples 32704 60 min 16365 120 min 8191 180 min 5464

! ! !

Table 2: Number of examples for each interval.

3. MACHINE LEARNING METHODS USED We chose to use and compare two different methods for performing the actual regression: linear regression and random forest. 3.1 Brief description of the methods used The first model we implemented, linear regression fitted using ordinary least squares, is among the oldest, most well-studied and simplest methods of performing regression. In essence, it approximately solves the equation Ax + b = y by choosing the A and b parameters that minimise the sum of square errors. This model has no parameters. The second model we implemented was regression tree, specifically Classification And Regression Tree (CART). The tree performs regression by assigning the average of the sample in its leafs. A tree can be fitted to the data by splitting the data into subsets based on which results in a smaller error. This process is repeated on each derived subset in a recursive manner. Nodes are expanded until all leaves contain 2 or less samples. This value is a parameter of regression trees that can be tuned. Random forest regression works by creating an ensemble of regression trees. Each tree is built from a sample drawn with replacement from the training set. In addition, when splitting a node during the construction of the tree, the split that is picked is the best split among a random subset of the features. The result of each the regression performed by each tree is then averaged to get the ensembles result. The main parameters to adjust in Random Forest regression is the number of estimators and maximum number of features. The former is the number of trees in the forest. The larger the better, but also the longer it will take to compute. Results will stop getting significantly better beyond a critical number of trees. The latter is the size of the random subsets of features to consider when splitting a node. The lower the value, the greater the reduction of variance, but also the greater the increase in bias. We set the number of estimators to 20 while the the maximum number of features is set it to actual number of features, a known empirical good default value. According to the additional parameters, nodes are expanded similarly to the decision trees presented previously. In light of the results obtained, we did not see it necessary for the purpose of this paper to further tune these parameters.

Figure 1: Global average of the error for different time window sizes.

2.3 Missing and Invalid Values The original data used appeared to contain information relative to 27 different park identifiers. Upon a first analysis we determined that some of these parks were errors and contained either no relevant information or mixed information from other parks. The data for some parks was only available for a short time period. The data relative to this parks was removed. The data relative to the remaining 11 parks is missing for many of the 5min intervals during the time period analysed. Worse still, several entries contain clearly wrong or invalid values such as negative available spaces or more available spaces than the total number of spaces in the car park. All of these values were removed prior to the resampling. The result was that several values were missing from the resampled time series. After transforming the time series data into examples using the sliding window technique, all examples which have a


3.2 Brief description of the evaluation criteria The evaluation measure we chose was the root mean squared error (RMSE), %mean(ei2), the standard derivation of the regression errors ei which corresponds to the square root of the mean square error (MSE). While both measures, RMSE and MSE, are the two most common error measures for regression, the RMSE is preferred because it is on the same scale as the data, making results simpler to understand.

! !

4. RESULTS We fitted the regressors using all of the data except for the last 3 months which were held out for evaluation. For comparison, we use two baselines, the mean and the previous value. The former consists of predicting every value using the training set mean of the target variable while the latter consists of assigning the previous value in the time sequence. 30 min Mean Previous Value Linear Regression Regression Tree Random Forest 41,2 10,1 3,5 0,5 0,4 60 min 41,4 16,3 4,2 0,8 0,5 120 min 41,6 26,6 4,8 0,4 0,6 180 min 41,3 33,9 4,7 0,5 0,5

Figure 3: Actual free places in PH Kongresni trg in blue and random forest regression in green. 120 min interval, 8 days of data.

Table 3: Average RMSE for each interval.

Random forest regression was the best method but not significantly when compared to simple decision trees. The mean is clearly a very bad predictor. In Figure 2 we show the error for each park when on the 120 min interval.

Figure 3 shows the behaviour of the target value for one particular car park but is representative of the others. Available spaces in the ljubljana car parks follow a distribution in which is closely tied with work schedules during the week. Between 01h and 7h the number of free spaces is nearly constant. Then it increases to its maximum number as people leave for work reaching its it at 9h when the park is empty or nearly so. As people arrive for work, the number of free parking spaces decreases sharply reaching its minimum with the park near or full at 17h when people begin to leave work. The parks with the highest number of total places are the ones with higher error across all measures. Its easy to explain this relationship for the baseline predictors. Because it is a high variance of the distribution illustrated in Figure 3, where many of the values are at or near the extremes, the more spread apart the minimum and the maximum are, the worse predictor the mean becomes. The problem with using the previous value can be seen in Table 3, the bigger the time interval, the worse the predictions become. Larger intervals result in steeper ascent or descent curves and thus the point at the next step is more distant along the target axis from the current point. A high number of total spaces increases the steepness as bigger car parks simply get more free places during the night while still being completely occupied during the working hours. While the machine learning methods not directly as affected by these factors, they are affected by our strategy for replacing missing values: the training set mean. If the mean is a poor predictor of the target value, that means that missing values will be further away from their real values and have a more negative impact. That was precisely what happened in Figure 3 when the regressor makes a very big mistake. As can be seen in Table 4, less than 1% of the examples have missing values yet these correspond to 70% of the regression error on our test dataset. Examples with Missing Values

Figure 2: RMSE, all parks, 120min interval. Mean in purple, previous value in cyan, linear regression in black and random forest in red.

Percentage of test set Percentage of error (RMSE)

0.7% 71%

Table 4: Missing values percentage of test set and error.


! !

6. CONCLUSION In this work we showed that it is possible to accurately predict the number of available parking spaces in the car parks of Ljubljana at the end of each 30min, 1h, 2h or 3h intervals. Regression using either regression trees or ensembles of trees provided very good results. While the results using random forest were slightly better, the fact that these models are built using a very small number of features does not allow for big gains from using the ensemble of trees. The best results were attained while using only 4 features as seen in Figure 1. Most of the error in our predictions resulted from a poor strategy for replacing missing values: the mean. In future work, using the regressors predictions for missing attribute values is the most promising option and should be able to significantly reduce the error. While this work uses only past values of the target variable, similar to a classical statistical autoregressive model, the ability to add additional data such as calendar, weather and event data could be used to further lower the prediction error.

References [1] L. Breiman, J. Friedman, R. Olshen, and C. Stone, Classification and Regression Trees, Wadsworth, Belmont, CA, 1984. [2] L Breiman, Random Forests, Machine Learning, 45 (1), 5-32, 2001. [3] P. Whitle, Hypothesis testing in time series analysis. Vol. 4. Almqvist & Wiksells, 1951.