You are on page 1of 51


Forecasting Italian retail sales at the aggregate level and down to 430 product categories

employing hierarchical time series models and a time series clustering application


In this thesis the forecasting accuracy of ETS, Arima and autoregressive neural network models is

evaluated in the empirical case of total Italian retail sales. Using a dataset of 430 time series for

product categories hierarchical forecasts are built with the optimal combination approach

developed in Hyndman et al. (2011). The predictive accuracy of these models is then evaluated in

comparison to other hierarchical combination methods. Lastly a different definition of the

hierarchy based on time series clustering procedures is introduced with the aim of improving the

predictive accuracy of the hierarchical models for the highest level of aggregation.


This thesis is concerned with forecasting Italian retail sales first at the aggregate level then in the

second part a wider dataset with time series data for single product categories is used as a basis in

building hierarchical forecasting models.

The first hypothesis tested is concerned with verifying whether the ETS forecast method is the

best performing automated forecasting technique for the time series of total Italian retail sales

when compared with Arima and artificial neural network autoregressive models. These forecasting

techniques are considered since they are methods apt to model a trend and strong seasonal

fluctuations which are relevant features in the data analysed.

Retail sales also present a very clear hierarchical structure: according to the ECR classification

single product categories can be aggregate in sectors which sum up to eight departments. In the

second part of this work total retail sales are disaggregated in three inferior levels delineating a

hierarchical structure, which is used as the basis for a battery of empirical tests of the forecasting

performance of different hierarchical forecasting models.

The second hypothesis of this work is therefore that the optimal combination hierarchical

forecasting method is the best performing hierarchical forecasting method. In fact the main issue

with building forecasts for a hierarchy of time series is ensuring coherence across all levels, which

is, controlling that the sum of the forecasts for a given level is the same as the forecast for their

aggregate. Traditional approaches start with forecasting either the top level (top-down) or the

bottom one (bottom-up) and then use these results to build the other levels of the hierarchy.

Hyndman et al. (2011) introduce a novel approach based on forecasting all series at all levels of

the hierarchy separately and then use a regression model to combine the forecasts obtaining a

coherent result can be proved to be unbiased and to have minimum variance among all

combination forecasts under some assumptions.

The retail sales time series obtained from IRI data will provide the basis for a battery of empirical

tests on the forecasting performance of different hierarchical models aiming first to provide some

additional empirical tests which are lacking in the literature due to the relatively young age of the

“optimal combination approach” and of some of its variations. Particular focus will be devoted to a

comparison of the performance of different weighting procedures of the optimal combination

approach, namely the “wls”, “MinT” and “nseries” procedures.

The third hypothesis is then to confirm that the forecasting accuracy for total Italian retail sales

can be improved by employing hierarchical forecasting instead of considering only the time series

for total sales and univariate time series models.

The last hypothesis tested in this work is that the predictive accuracy of the hierarchical forecasts

can be further improved by changing the hierarchical aggregation structure from that described by

the ECR classification to a custom one obtained through time series clustering procedures.

The next section will provide a literature review, the data and the univariate econometric

procedures are discussed in section 2, section 3 contains the analysis of aggregate retail sales.

Section 4 contains a discussion of hierarchical forecasting models, which are then evaluated in

their empirical application in section 5, section 6 will deal with restructuring the hierarchy through

time series clustering and section 7 is the conclusion.

1. Literature review:

This work attempts to provide a forecasting framework which can aid industry practitioners in

dealing with organizational problems: Mentzer and Bienstock (1998) identify forecasts of sales as

essential inputs to many decisional challenges in fields such as marketing, sales, purchasing and

accounting. Alon et al. (2001) add that more accurate forecasts for aggregate retail sales can be

useful for large retail chains as changes in their sales are often systemic and they are more affected

by industry wide trends.

Zhang (2009) assesses that accurate demand forecasting plays a fundamental role in the profitability

of retail operations thanks to its importance in planning purchasing, production, transportation and

labour force. Moreover it seems important to note that improving the ability of retailing managers

to estimate future sales can lead to positive results in customer satisfaction, reduced loss of

products, increased sales revenue and more efficient production planning (Chen and Ou, 2011a,


Kremer et al. (2016) study common biases in judgemental hierarchical forecasts for the retail

industry and provide some guidelines on how to better communicate forecasts to industry


On the topic of hierarchical models a heavy body of literature has been produced comparing the

performance of bottom-up and top-down methods, Narasimhan et al. (1995) and Fliedner (1999)

side in favour of the latter pointing out that forecasts at the aggregate level are better with

aggregate data while Edwards & Orcutt (1969) argue that aggregation of time series data causes a

non-negligible loss of information and therefore that bottom-up forecasts are preferable.

In general empirical tests on the performance of these two methods tend to come up in favour of

the bottom-up method, one earlier example is Kinney (1971), more recently Marcellino et al. (2002)

show that inflation and economic activity in the Euro area can be forecasted more accurately by

aggregating results from a different econometric model from each country than by considering

directly the series at the aggregate level. Also Stock & Watson (2002) present similar results in

identifying potential for the improvement of aggregate forecasts in the aggregation of country

specific effects.

If estimation uncertainty is an issue Giacomini & Granger (2004) state that aggregating forecasts

from a space-time AR model is weakly more efficient than the aggregate of the forecasts from a

VAR. Hendry et al. (2011) argue that forecasts for eurozone inflation can be improved by including

variables at the disaggregate level in the model for the aggregate level.

Hyndman et al. (2011) introduce the optimal combination approach for hierarchical time series

which is going to be employed extensively in this work. Building on this novel approach Hyndman

et al. (2015) provide some improved computational techniques useful for large datasets and some

empirical simulations testing the performance of these hierarchical reconciliation techniques.

Another empirical application of the optimal combination approach can be seen in Ben Taieb et al.

(2017) where they are applied to a hierarchy of electricity demand derived from smart meter data.

This forecasting framework is further extended in Athanasopoulos et al (2018) where a novel

weighting procedure is introduced along with a battery of empirical tests on its forecasting

performance when applied to Australian tourism data.

2. Forecasting models:

This section provides a brief description of the forecasting models which we are going to apply to

the time series data of Italian retail sales.

Exponential smoothing:

The term exponential smoothing describes a family of forecasting methods where in each method

the forecasts are built as a weighted combination of past observations, moreover more recent

observations have a higher weight than older ones. Finally one can note that the weight of the

observations decreases exponentially as they get older thus pointing to the origin of the name

exponential smoothing.

The origin of this class of forecasting methods dates back to 1950s when they were first

introduced by Robert G. Brown for the purpose of forecasting spare parts demand in the inventory

system of the military navy in the US. In the same period but independently Charles Holt was also

working on exponential smoothing for the US Office of Naval Research, his work appeared in an

internal document (Holt 1957) which was widely circulated and quoted but which was published

only in 2004 in the International Journal of Forecasting.

Holt is credited with a wide body of work on additive and multiplicative exponential smoothing

which became well know through a paper by his student Peter Winters (1960) which provided

empirical tests for Holt’s methods.

John Muth (1960), who collaborated with Holt, introduced two statistical models whose forecasts

are equal to those given by simple exponential smoothing. His models were the first of a long

series of statistical models related to forecasting using exponential smoothing, many of these are

state space models, including the ones introduced by Muth, for which the minimum mean squared

error forecasts are the forecasts from simple exponential smoothing.

State Space Models:

Let 𝑦𝑡 denote the observation at time t and let 𝑥𝑡 be the state vector containing the unobserved

components, namely the level, trend and seasonality of the series. Following the notation first

delineated by Anderson and Moore (1979) the linear innovations form of state space models can

be written as:

𝑦𝑡 = 𝑤 ′ 𝑥𝑡−1 + 𝜀𝑇 (1.1 a)

𝑥𝑡 = 𝐹𝑥𝑡−1 + 𝑔𝜀𝑇 (1.1b)

Where 𝜀𝑇 is a white noise series and F,g and w are coefficients. Equation (1.1a) is knows as the

measurement equation as it models the relationship between the unobserved states 𝑥𝑡−1 and the

observation 𝑦𝑡 .

On the other hand the second equation (1.1b) is known as the transition equation; it describes the

evolution of the states over time.

This innovations formulation is the framework behind the ETS model Hyndman et al. (2002) which,

building on the work of Ord et al. (1997) introduce a class of state space models that underlies all

of the exponential smoothing methods, moreover they propose a modelling framework which

provides stochastic models allowing for likelihood calculations, the construction of prediction

intervals and automatic model selection.

ETS models:

In the context of time series decomposition, four main components of a time series can be

identified: trend, cycle, seasonal and error components

ETS models as described in Hyndman et al. (2008) will only consider three choosing to omit

considering the cycle component.

As can be read in Hyndman et al. (2008) the idea behind the ETS model, namely Error, Trend,

Seasonality is that using these three components and combining them 30 state space models can

be identified representing the whole class of exponential smoothing, this allows for the

construction of the very useful ETS function included in the R package “forecast” which includes

automated model selection and is therefore extremely useful for automated batch forecasting.

Given that we have 430 time series for product sales available, the time cost of estimating

manually parameters for each of them would be prohibitive; moreover we could also be

interested in updating the model as new observations get recorded thus justifying the need for

automated forecasts.

Exponential smoothing places strong emphasis on the trend component, which is itself broken

down in the level term (l) and the growth term (b).Let 𝑇ℎ define the trend h periods ahead and

calling 𝜑 a damping parameter, with 0 < 𝜑 <.

considers five variations of the trend component:

• None 𝑇ℎ = 𝑙

• Additive 𝑇ℎ = 𝑙 + 𝑏ℎ

• Additive damped 𝑇ℎ = 𝑙 + ( 𝜑 + 𝜑 2 + … + 𝜑 ℎ )𝑏

• Multiplicative 𝑇ℎ = 𝑙 + 𝑏 ℎ

• Multiplicative damped 𝑇ℎ = 𝑙𝑏( 𝜑 + 𝜑 2 + … + 𝜑 ℎ )

Beside the trend the ETS framework must estimate a seasonal structure which can be absent,

multiplicative or additive. At last the error component must be considered, it can be either

multiplicative or additive. Summing up these three components can be combined in 30 singular


This taxonomy is based on the work of Gardner (1985) which was modified by Hyndman et al

(2002) and later extended in Taylor (2003a).

Most of the models resulting from variations of this taxonomy are already established and widely

used exponential smoothing methods, for example a model with no trend and no seasonality is a

simple exponential smoothing method (SES) while Holt’s linear method id described by a

trend additive damped and no seasonality. A seasonal component combined with an additive

trend identifies Holt-Winters method which can be additive or multiplicative depending on the

first component.

Two models with the same parameters for seasonality and trend but which display different error

terms, which can be either additive or multiplicative, will produce the same point forecasts but

will build widely different confidence intervals for predictions.

Here we write the formula for the state space model which underlies all the 30 exponential

smoothing models, suppose 𝑥𝑡 is the state vector 𝑥𝑡 = (𝑙𝑡 , 𝑏𝑡 , 𝑠𝑡 , 𝑠𝑡−1 , …. , 𝑠𝑡−𝑚+1 )′ , the state

space structure appears as follows:

𝑦𝑡 = 𝑤(𝑥𝑡−1 ) + 𝑟(𝑥𝑡−1 )𝜀𝑇 (2.1 a)

𝑥𝑡 = 𝑓(𝑥𝑡−1 ) + 𝑔(𝑥𝑡−1 )𝜀𝑇 (2.1b)

Given that {𝜀𝑇 } is a Gaussian white noise process with variance 𝜎 2 and 𝜇𝑡 = 𝑤(𝑥𝑡−1 ) .

In the case of the model with additive errors 𝑟(𝑥𝑡−1 ) = 1 so that 𝑦𝑡 = 𝜇𝑡 + 𝜀𝑇 while a

multiplicative error model would have 𝑟(𝑥𝑡−1 ) = 𝜇𝑡 therefore setting the value of 𝑦𝑡 equal to

𝜇𝑡 (1 + 𝜀𝑇 ).

In their excellent book “forecasting with exponential smoothing: the state space approach”

Hyndman et al. (2008) provide complete formulas for each one of the 30 models.

ETS algorithm

In the R package “forecast” Hyndman provides a forecasting function “ETS” which builds forecasts

based on the exponential smoothing taxonomy previously described. Since we are later going to

employ extensively said function it appears quite relevant to delve deeper in its workings it is

based on a forecasting algorithm structured as follows:

• All appropriate models are applied to the series and each model’s parameters are


• The best model is selected according to the AICc criterion.

• Point forecasts are built employing the chosen model.

• Prediction intervals are estimated for the selected model.

Hyndman et al. (2002) conducted some extensive empirical tests on the forecasting performance

of this technique which was tested on the M-competition data (Makridakis et al.1982) and IJF-M3

competition data (Makridakis and Hibon 2000).

These tests highlighted the performance of the model on short term forecasting horizons and on

series displaying strong seasonal patterns.

Initialization and estimation:

The vector of initial values is defined using a simple heuristic scheme based on Hyndman et al.

(2002) which is set as follows:

• The initial seasonal component is computed starting from a 2 ∗ 𝑚 (seasonal frequency)

moving average estimated on the first years of data, then the data is detrended and initial

seasonal indices are the result of a two steps process initiated by taking the average of the

detrended data in each seasonal period and concluded by normalizing the indices thus

obtained so that they sum up to zero for additive seasonal methods or to m for

multiplicative seasonality.

• In order to estimate the starting level there are two options: in the case of seasonal data a

linear trend must be computed on the first ten the seasonally adjust values available, on

the other hand when dealing with nonseasonal data it is sufficient to compute a linear

trend on the first ten observations against a time vector from one to ten. In both these

cases the initial level is set equal to the intercept of the trend.

• Lastly the initial growth is set to be the slope of the trend regarding an additive trend while

the presence of a multiplicative trend requires thee initial growth 𝑏0 = 1 + 𝑏/𝑎 where a is

the intercept and b is the slope of the trend previously estimated.

After computing the initial states the estimation procedure can start when the initial states are

improved while the maximum likelihood parameters are chosen.

While multiple sources of error state space models need to employ the Kalman filter the

innovations state space models involved in the ETS model are single source of errors models

therefore it can be demonstrated that

𝑛 𝑛

𝐿∗ (𝜃, 𝑥0 ) = 𝑛𝑙𝑜𝑔 (∑ 𝜀𝑡2 ) + 2 ∑ log(𝑟(𝑥𝑡−1 ))

𝑡=1 𝑡=1

Is equal to twice the negative logarithm of the likelihood function conditional on the parameters

vector and the initial states vector.

Therefore the parameters and the improved initial states can be easily computed using the

formulas for recursive calculations of the point forecasts and parameters specific to each model

and minimizing 𝐿∗ .

Holt-Winters’ exponential smoothing

Of the 30 models which make an appearance in the ETS model I believe it is warranted to regard

with further detail the ones dealing with Holt-Winters exponential smoothing both in order to give

an example of the structure of an exponential smoothing and because it is the one which is most

often selected when the model is applied to the retail sales time series.

Holt (1957) introduced the first examples of exponential smoothing models and Winters (1960)

expanded on them and added the seasonal component. Thus creating a very effective method for

modelling series which display a strong seasonal pattern. The Holt-Winters seasonal method

consists of four equations: the forecast one and three smoothing equations one for the level ℓt,

one for the trend bt, and one for the seasonal component st. The corresponding smoothing

parameters are α, β∗ and γ. The letter m denotes the seasonal frequency.

There are two versions of this method which compute differently the seasonal component. The

additive method results more performing when the seasonal variations are constant through the

series, while the multiplicative method is preferred when the seasonality increases proportionally

to the level of the series. In the case of the additive method the seasonal component is subtracted

in order to create the seasonally adjusted series. The multiplicative method reads the seasonal

component as a percentage and the seasonal adjustment is performed dividing by the seasonal


The structural form for the additive method therefore can be written as

𝑦̂𝑡+ℎ|𝑡 = 𝑙𝑡 + ℎ𝑏𝑡 + 𝑠𝑡−𝑚+ℎ𝑚

𝑙𝑡 = 𝛼(𝑦𝑡 − 𝑠𝑡−𝑚 ) + (1 − 𝛼)(𝑙𝑡−1 + 𝑏𝑡−1 )

𝑏 𝑡 = 𝛽(𝑙𝑡 − 𝑙𝑡−1 ) + (1 − 𝛽)𝑏𝑡−1

𝑠 𝑡 = γ(𝑦𝑡 + 𝑙𝑡−1 − 𝑏𝑡−1 ) + (1 − γ)𝑠𝑡−𝑚

where ℎ𝑚 = [ ]+1

The multiplicative method differs slightly and can be expressed as follows:

𝑦̂𝑡+ℎ|𝑡 = (𝑙𝑡 + ℎ𝑏𝑡 )𝑠𝑡−𝑚+ℎ𝑚

𝑙𝑡 = 𝛼 ( ) + (1 − 𝛼)(𝑙𝑡−1 + 𝑏𝑡−1 )

𝑏 𝑡 = 𝛽 ∗ (𝑙𝑡 − 𝑙𝑡−1 ) + (1 − 𝛽 ∗ )𝑏𝑡−1

𝑠𝑡 = γ + (1 − γ)𝑠𝑡−𝑚
(𝑙𝑡−1 + 𝑏𝑡−1 )

In our case due to the strong seasonality displayed by our series the multiplicative method results

more performing, moreover we have that adding a damped trend, instead of a linear one can

increase the predictive accuracy of the model.

Autoregressive Neural Network

Lately one does not have to look far in order to see multiple success stories of the application of

machine learning techniques to the most disparate tasks, from image processing to playing GO,

from asset allocation to forecasting energy demand as the price of computational resources

decline and big data becomes more and more widely available machine learning models seem to

be employed ever more.

Alon et al. (2001) present one of the few examples of an application of neural networks

forecasting models to retail sales, the authors draw a comparison of forecasting accuracy between

neural networks and traditional models of time series analysis for the case of aggregate US retail

sales in the period 1978-1995. The authors conclude that this model performs better than Box-

Jenkins Arima models and add a note praising Holt-Winters exponential smoothing for its

performance despite its relative simplicity.

The model we are going to fit to the retail sales available is the autoregressive neural network

(AR-NN) model, the basic idea is to generalize the standard AR(p) model

𝑦𝑡 = 𝜑1 𝑦𝑡−1 + … + 𝜑𝑝 𝑦𝑡−𝑝 + 𝑎𝑡

Switching from a linear model to a nonlinear one structured as follows:

𝑦𝑡 = 𝑓(𝑦𝑡−1 , … , 𝑦𝑡−𝑝 ; 𝑤) + 𝑎𝑡
Where 𝑎𝑡 is a noise process, w is the weight vector and 𝑓(𝑦𝑡−1 , … , 𝑦𝑡−𝑝 ; 𝑤) is a feedforward

neural network. The function we are using is the nnetar() function included in the R package

forecast, it builds only single layered multi-layer perceptrons and works upon three basic

parameters, the number of lags in the input layer, the number of seasonal lags and the number of

nodes included in the hidden layer. Therefore a AR-NN(p,0,0) is equivalent to an AR-NN(p,0,0)

model except for the stationarity restrictions, an AR-NN(2,1,2)12 would include as inputs

𝑦𝑡−1 , 𝑦𝑡−2 , 𝑦𝑡−12 and would include 2 neurons in the hidden layer.

the Data

Monthly data were made available from IRI recording LCC (Largo consumo confezionato) retail

sales for Supermarkets, Ipermarkets and “Libero Servizio” therefore covering all the large retail

industry sales except for the class of Discount sales.

Where Ipermarkets are defined as retail stores with surface sales area higher than 2.500 m²,

Supermarkets are the stores between 400 m² and 2500 m², lastly stores with area between 100 m²

and 400 m² are classified as LIS. The only part of Italian retail distribution network missing from

our dataset is the Discount class, which accounts for 17% of total sales, which are defined as

structures in which the assortment does not provide for the presence of branded products.

We have four time series for retail sales for macro areas in Italy, these time series start in 2002

and end in September 2017 and can be aggregated to obtain total retail sales in Italy. Moreover

starting from the beginning of 2012 we can also make use of time series data for single product

categories, we have 430 products available which can be aggregated in 73 sectors which in turn

turn sum up to 8 departments according to the ECR classification.

Due to the differences in length we are going to employ two different sets of techniques for our

analysis, regarding the historical sales from 2002 to 2017 we are going to employ some classical
time series modelling approaches, namely Arima, Holt Winter’s exponential smoothing and ETS

models, as well as a more experimental technique, which is autoregressive neural networks.

On the other hand, in the case of the multiple time series we have available from 2012 we are

going to build and analyse hierarchical models building heavily on Hyndman et al. (2011).

3. Aggregate retail sales analysis:

Figure 1: aggregate Italian retail sales

In figure one the time series for monthly aggregate Italian LCC sales can be observed, the strength

of the seasonal component of the series can be easily noted, the yearly peak is the month of

December which registers an increase in sales of about 30% compared to the November average

which can get as high as a 60% increase when measuring sales in the beverages department.

The LCC acronym stands for “Largo consumo confezionato “, according to the classification

structure adopted by most Italian retail chains total sales get divided in three groups: LCC, “peso

variabile” and Nonfood, the second category accounts for around 20% of total sales and describes

the sales of products which can be bought in a variable quantity chosen by the costumer, the third

category covers only about 5% of the distribution and relates to products which are, as the name

suggests, not edible like school supplies or clothing.

LCC therefore is the most important subset of the sales of Italian retail distributors and consists

mostly of alimentary products but not entirely as of the 8 departments it is subdivided in, 2 of

them are personal care and home cleaning products which account respectively for 7% and 8% of

LCC sales.

In terms of relative percentages the heaviest department is general groceries accounting for 36%

of total LCC sales, followed by Fresh which reaches 20% and Beverages with 15%; a lower volume

is covered by frozen products (5%), pet care supplies (2%) and fruits and vegetables with 6%. The

last number’s apparent low value can be better framed if one remembers that most of the sales of

fruits and vegetables do not fall in the LCC category as the customers can pick the quantities and

thus are not included in the dataset which was made available for this analysis.

The departments are subdivided into a total of 73 sectors which are unevenly distributed, general

groceries alone is divided in 22 sectors while pet care only has 3 and fruits and vegetables only

show two sectors. Moreover departments often contain products which display a different

characteristics: for example beverage sales are divided in wine, beer and liquors as well as water,

sparkling beverages and juices; another example of the in-department variability is general

groceries as it contains bread and pasta as well as sweets, condiments and pickled food beside

other categories.

Figure 2: aggregate retail sales polar plot

As can be seen in figure one retail sales display strong seasonal patterns which appear both at the

aggregate level and at the lower levels, in fact much of the improvements in accuracy gained from

the construction of a hierarchical model stem from modelling the seasonal components of the

products at a disaggregated level separately.

A better visualization of the seasonality at the aggregate level can be seen in figure two with the

representation provided by a polar plot, we can see that the general structure of the seasonal

patterns does not vary much with time: from 2002 onwards the month with the highest sales is

always December, with June and September also recording high revenue.

Regarding the months of March and April we can see a high variability in sales, this is explained by

the Easter effect: in fact Easter is a powerful driver for retail sales and it can fall in either month

thus shifting sales from one to the other depending on calendar effects.

Rolling forecast origin cross validation:

In assessing the forecast accuracy of the various forecasting models on the empirical substrata of

Italian retail sales a time series rolling forecast origin cross validation approach is adopted.

This validation method is described in Athanasopoulos and Hyndman (2018, section 3.4) it starts

by defining a period the minimum length (k) necessary for estimating model parameters,

therefore each model is trained in the period from the starting point of the series up to k and

forecasts up to 12 months ahead are produced, as well as measures of the forecast accuracy.

Then each model is re-estimated using a rolling forecast origin, which is, stretching the estimation

window by one additional month in each step, this procedure is repeated for every window from k

up to the penultimate observation in our time series producing T-k measures of the 1 step ahead

forecast error and up to T-k-12 measures for the 12 step ahead forecast error.

The advantage of this approach is that it mimics the form of the inventory problem where the

planner must iteratively to make decisions estimating future sales.

Forecasting model comparison:

Our time series for the aggregate Italian sales and for the four macro areas start in 2002 include 15

years and 9 months of data, arbitrarily we set the minimum length necessary for estimating the

model to seven years and we proceed to apply the procedure described in the previous paragraph

in order to assess the relative performance of the various models.

As total length of each time series is of 189 monthly observations, we will have for each series 105

measures of the one step ahead forecasting accuracy, 104 of the 2 steps ahead and so on

concluding with 93 observations of the forecasting error 12 months ahead.

The computations have all been carried out with the statistical software R (R Core Team 2018),

specifically the R distribution “Microsoft R Open” (2017) has been employed with the Intel MKL

package for parallel computing.

For the construction of the forecasts the R package “forecast” by Hyndman and Khandakar (2008)

has been used, specifically we the ets() and nnnetar() functions working as described in section

one have been employed as well as the auto.arima() function which is responsible for the

automated estimation and forecasting of an Arima model, further details on the structure of the

auto.arima function can be also found in Hyndman and Khandakar(2008).

It is interesting to note that for every estimation period and for each one of the five time series

studied the ETS model always select, out of its pool of 30 state space models, the Holt-Winters

exponential smoothing model with multiplicative errors, multiplicative seasonality and trend

additive damped (M, Ad, M). This observation is corroborated by the relevant literature as this

model is widely regarded as well performing in the context of modelling series with strong

seasonal patterns.

Regarding the neural network model employed it is an AR-NN(13,1,7)12 model meaning that we

include 13 lagged observations as well as one seasonal lag of the series analysed

forecast horizon (h)

1 2 3 4 5 6 7 8 9 10 11 12 Average
NN-AR 1,54 1,81 1,89 1,87 1,92 2,02 2,35 2,34 2,18 2,33 2,45 2,56 2,10
Arima 1,55 1,61 1,73 1,78 1,96 2,08 2,27 2,46 2,61 2,76 2,89 2,99 2,22
ETS 1,81 1,89 2,00 2,04 2,13 2,25 2,40 2,47 2,61 2,68 2,80 2,86 2,33
NN-AR 1,68 1,68 1,73 1,66 1,63 1,57 1,66 1,64 1,66 1,72 1,79 1,74 1,68
Arima 1,30 1,33 1,45 1,43 1,56 1,57 1,70 1,84 1,89 2,00 2,05 2,17 1,69
ETS 1,50 1,50 1,51 1,43 1,52 1,54 1,60 1,65 1,68 1,67 1,72 1,69 1,59
NN-AR 1,65 1,92 1,98 2,15 1,97 2,14 2,47 2,43 2,42 2,76 2,65 2,69 2,27
Arima 1,66 1,72 1,90 1,99 2,09 2,23 2,44 2,70 2,94 3,17 3,34 3,54 2,48
ETS 1,56 1,63 1,70 1,67 1,83 1,79 2,03 2,11 2,21 2,39 2,50 2,44 1,99
NN-AR 2,11 2,36 2,60 2,74 2,49 2,74 2,79 2,74 2,95 2,96 3,01 2,89 2,70
Arima 2,25 2,40 2,48 2,56 2,68 2,88 2,93 3,06 3,13 3,09 3,13 3,31 2,83
ETS 2,07 2,17 2,25 2,30 2,33 2,48 2,59 2,61 2,68 2,79 2,84 2,81 2,49
NN-AR 2,75 3,30 3,52 3,72 3,51 3,54 3,86 3,92 4,26 4,46 4,48 4,81 3,84
Arima 2,72 2,86 2,93 3,12 3,24 3,36 3,58 3,76 3,82 4,05 4,17 4,37 3,50
ETS 2,88 2,91 3,15 3,26 3,25 3,31 3,52 3,72 3,92 4,13 4,10 4,26 3,53

Table 1: mean MAPE for out-of-sample forecasting for different models, in bold is lowest forecast error for a given horizon

In table one the result of the rolling forecast origin cross validation procedure can be seen, as a measure of

out-of-sample forecasting accuracy the average MAPE on a given forecasting horizon is displayed as well as

the average MAPE across all 12 values of h.

There is not a model which is unequivocally better than the others but, as it is often the case in the context

of time series analysis, different models perform better in different horizons.

It seems that the Arima model displays a very strong performance when dealing with very short term

forecasts as it shows the lowest forecasting error measures at the aggregate level and for Nord-Ovest and

Sud macro areas for h<5.

On the other hand for forecasting horizons equal or greater than five the ETS model appears to be the most

performing with the NN-AR placing second, in fact even if at the aggregate level the second one shows the

lowest forecast error, in all macro areas the ETS model proves to be the best performer.

4. Hierarchical models:

Alternative to the aggregate sales time series data analysed in the previous section, starting from

2012 a more detailed set of time series data is available at the product level. The dataset obtained

from IRI consists of 430 series for sales of classes of products, these series can be allocated in 73

sectors and 8 departments according to the ECR classification. This data framework seems

extremely suitable to the construction of a hierarchical model for forecasting sales both at the

aggregate level and for single product categories.

Retail sales display very strong and varied seasonal patterns, for example sales of cold beverages

peak in the summer months while beef or apple sales tend to flatline, the biggest advantage of

employing a hierarchical model is that this allows us to take into account the different seasonal

and trend signals when computing aggregate forecasts.

In their paper Hyndman et al. (2011) have introduced an innovative method of computing

hierarchical forecasts while previous approaches were either top down or bottom up methods or

some combination of the two.

Top down forecasting starts with the computation of a forecast for the highest aggregate level

which is then disaggregated on the lower levels, the forecasts of the lower levels are constrained

according to historical proportions in order to sum up to the one for the aggregate time series;

Gross & Sohl (1990) describe several approaches to the computation of these proportions.

The opposite approach is the bottom-up method where each series at the lowermost level is

forecasted separately and then the aggregate level as some simple combination of the forecasts

for the bottom level.

The innovation introduced by Hyndman et al. (2011) consists in a new approach to hierarchical

forecasting that performs better than both other methods.

This method computes independently a forecast for each series in the hierarchy at each level and

follows up by using a regression model to optimally combine and reconcile these forecasts. Under

some relatively simple assumption the forecasts produced by this method have been proven to be

unbiased and to have minimum variance in the paper quoted above.

Notation for Hierarchical forecasting

It seems useful to introduce the notation for hierarchical modelling and forecasting adopted in

Hyndman et al. (2011) as it will serve as a basis to provide a description of the statistical structure

of the hierarchical models which are tested later.

Let our observations be recorded at times t= 1,…,n and let t=n+1,…,n+h be the period we are

interested in forecasting, then we call 𝑌𝑡 the value of the aggregate at the highest level of the

hierarchy, adding an index we have that 𝑌𝑡 = ∑𝑖 𝑌𝑖,𝑡 where 𝑌𝑖,𝑡 denotes the generic member of the

first level, this property is true for all levels as for the second level 𝑌𝑖,𝑡 = ∑𝑗 𝑌𝑖𝑗,𝑡 and so on

𝑚𝑖 = ∑𝑘 𝑌𝑖𝑗𝑘,𝑡

Then we call 𝑚𝑖 the number of series at a given level, such that in our case 𝑚1 = 8 and 𝑚2 = 73;

the total number of series in the hierarchy is then 𝑚 = 𝑚1 + 𝑚2 + … + 𝑚𝐾 where k is the

number of levels in the hierarchy, therefore 𝑌𝐾,𝑡 represents the collection of all series at the

bottom level of the hierarchy. Then defining 𝑌𝑖,𝑡 the vector of all observations at a given level i and

at a given time t and 𝑌𝑡 will represent the collection of all series in hierarchy, at all levels.

With these definitions Hyndman et al. (2011) propose a modelling framework which can be used

to give a common structure to all hierarchical forecasting models, be they bottom-up or top-down,

which is based on the following equation

𝑌𝑡 = 𝑆𝑌𝐾,𝑡

With S playing the crucial role of the “summing matrix” which is responsible for defining the

aggregating structure of the hierarchy up from the bottom level. The order of the summing matrix

is 𝑚 × 𝑚𝑘 where the first row is a unit vector of length equal to the number of series in the

bottom level of the hierarchy, which is 𝑚𝑘 .

Optimal combination approach

Now that we have described the notational structure for describing a hierarchy we can describe

the generalised notation also introduced in Hyndman et al. (2011) which can describe every

currently used method for hierarchical forecasting: given a hierarchy of time series data with k

levels and m series we start by computing independent forecasts for each one of the m series for

periods 𝑛 + 1 … . 𝑛 + ℎ, calling 𝑌̂𝑋,𝑛 (ℎ) the forecast h periods ahead for series X where 𝑌̂𝑛 (ℎ) will

then describe the matrix of all the base forecasts built with the same structure of 𝑌𝑡 .

Then given that all currently used hierarchical forecasting methods consist of some linear

combination of base forecasts each on of them can be described introducing some matrix P of

order 𝑚 × 𝑚𝑘 , that is:

𝑌𝑛 (ℎ) = 𝑆𝑃𝑌̂𝑛 (ℎ)

In fact every hierarchical method must produce coherent forecasts in the sense that the sum of

the forecasts for the lower levels must add up to the higher levels. This process is implemented by

the P matrix which extracts the relevant elements from the base forecasts which are then summed

up by the summing matrix in order to produce the final revised forecast ̃

𝑌𝑛 (ℎ).

This notation can be used to describe both bottom up and top-down hierarchical forecasting

methods, using

𝑃 = [0𝑚𝑘 ×(𝑚−𝑚𝑘 ) |𝐼𝑚𝐾 ]

Forecasts aggregated through the bottom-up method are obtained as the P matrix extracts only

bottom-level forecasts as indicated by the 𝑚𝑘 × 𝑚𝑘 identity matrix which are then summed up

according to the hierarchical structure defined by the summing matrix. While with

𝑃 = [𝑝|0𝑚𝑘 ×(𝑚−1) ]

A top-down aggregation is produced given that 𝑝 = [𝑝1 , 𝑝2 , … , 𝑝𝑛 ] is a vector of the top-down

proportions summing to one, note that different top down forecasting methods can be

represented by different definitions of the forecasted proportions.

The first important result which come to light in Hyndman et al. (2011) stems directly from this set

of definitions of hierarchical forecasting methods: the author proposes that, assuming that the

base forecasts are unbiased which is that 𝐸[𝑌̂𝑛 (ℎ) ] = 𝐸[𝑌𝑛 (ℎ) ] and that the revised forecasts are

preferable if also unbiased, then the following equality must be maintained: 𝐸[𝑌𝑛 (ℎ)] =

𝐸[𝑌𝑛 (ℎ)] = 𝑆𝐸[𝑌𝐾,𝑛 (ℎ) ] Let us define 𝛽𝑛 (ℎ) = 𝐸[𝑌𝐾,𝑛+ℎ |𝑌1 , … , 𝑌𝑛 ] as the true mean of the

future values for bottom level series named K. Then we have that 𝐸[𝑌 ̂
𝑛 (ℎ)] = 𝑆𝑃𝐸[𝑌𝑛 (ℎ) ] =

𝑆𝑃𝑆𝛽𝑛 (ℎ) the authors are able to state that the unbiasedness of the revised forecasts will hold if


Therefore, calling 𝛴ℎ the variance of the base forecast the authors state that the variance of the

revised forecasts can be written as:

𝑉𝑎𝑟[𝑌𝑛 (ℎ)] = 𝑆𝑃𝛴ℎ 𝑃′𝑆′

Also, the set of pre-reconciliation forecasts can be expressed as:

𝑌̂𝑛 (ℎ) = 𝑆𝛽𝑛 (ℎ) + 𝜀ℎ

Where 𝛽𝑛 (ℎ) is unknown and 𝜀ℎ has zero mean and covariance matrix 𝑉𝑎𝑟(𝜀ℎ ) = 𝛴ℎ .

The intuition which forms the basis for the main result of the paper is that the previous equation

can be managed as a regression, assuming full knowledge of 𝛴ℎ generalized least squares models

could be used to obtain the minimum variance unbiased estimate of 𝛽𝑛 (ℎ) given full knowledge of

the 𝛴ℎ matrix.

As in most practical applications this is not the case Hyndman et al. (2011) introduce one further

assumption: that the distribution of the forecast errors display the same aggregational structure of

the original data, that is 𝜀ℎ ≈ 𝑆𝜀𝐾,ℎ where 𝜀𝐾,ℎ is the matrix with the forecast error for each series

in the bottom level. As a direct result of adopting this assumption the variance matrix can be

expressed as 𝛴ℎ = 𝑆𝛺ℎ 𝑆′ given that 𝛺ℎ = 𝑉𝑎𝑟(𝜀𝐾,ℎ ).

Theorem 1 in Hyndman et al. (2011) states:

Given 𝑌 = 𝛽ℎ + 𝜀 with 𝑉𝑎𝑟(𝜀) = 𝛴ℎ = 𝑆𝛺ℎ 𝑆′ and S a summing matrix. Then the generalized least

squares estimate of 𝛽 obtained using the Moore-Penrose generalized inverse is independent of

𝛺ℎ :

𝛽̂ℎ = (𝑆 ′ 𝛴ℎ† 𝑆)−1 𝑆 ′ 𝛴ℎ† 𝑌 = (𝑆′𝑆)−1 𝑆′𝑌

With variance matrix 𝑉𝑎𝑟(𝛽̂ℎ ) = 𝛺ℎ which is the minimum variance linear unbiased estimate and

𝛴ℎ† is the Moore-Penrose generalised inverse of 𝛴ℎ† .

This theorem shows that OLS can be employed rather than GLS in building the set of revised

forecasts, therefore the set of the revised forecasts for the optimal combination approach to

hierarchical forecasting can be expressed as:

𝑌𝑛 (ℎ) = 𝑆(𝑆 ′ 𝑆)−1 𝑆′𝑌̂𝑛 (ℎ)

So that 𝑃 = 𝑆(𝑆 ′ 𝑆)−1

This result also shows that under the assumption 𝛴ℎ = 𝑆𝛺ℎ 𝑆′ the optimal combination of base

forecasts depends uniquely on the aggregational structure, therefore allowing for the use of any

set of weights when computing the revised forecasts.

Lastly the authors point out that estimating the covariance matrix 𝛴ℎ can be avoided unless one is

interested in building prediction intervals as 𝑉𝑎𝑟(𝑌𝑛 (ℎ)) = 𝛴ℎ .

Having described the forecasting framework the basis have been laid in order to begin the

empirical application: in the next section this framework for hierarchical time series forecasting is

going to be applied to Italian retail sales of packaged products

5. Comparative analysis

In the following section we will confront measures of forecast accuracy for the main hierarchical

reconciliation methods: namely classical top down, bottom up approaches and the optimal

combination forecasts presented by Hyndman et al.( 2011), we will repeat this analysis with

forecasts built with both Arima models and ETS models in order to determine which models

performs better when applied to the time series of Italian retail sales structured as a 4 levels

hierarchy. The aim is to identify the optimal formulation of hierarchical models when applied to

retail sales in Italy and to provide further experimental data on the forecasting performance of the

newly minted optimal combination approach.

The data set we are going to use as a basis for the empirical tests was provided by IRI and consists

of 430 time series for the sales of the 430 product categories starting from January 2012 and

ending in September 2017, these will form the bottom level for the hierarchy with two

intermediate levels: departments and sectors respectively made up of 8 and 73 series summing up

to total sales at level zero. The aggregational structure of this hierarchy, which is also the summing

matrix, is defined by the ECR tree of product categories which is adhered by most operators in the

Italian retail industry.

When comparing the forecasting performances of hierarchical models a rolling origin time series

cross validation approach is adopted, the general structure is the same as the one which was

described previously. We have 69 observations available for each time series, starting from

January 2012 and ending in September 2017. The minimum length required for estimation is

assumed to be of 3 years which is far lower than would have been preferable but was forced by

the relative short length of the data available. In fact given a monthly seasonal frequency at the

shortest estimation length each model has only three observations for each seasonal period which

hampers in particular the estimation of the seasonal Arima models thus providing a possible

reason for their relatively poor out-of-sample performance compared to the ETS model.

The rolling origin cross validation procedure will give as output for each series in the hierarchy 33

observations of the one-step ahead forecasting accuracy and 21 observations for the 12 steps-

ahead one.

As a measure of the forecasting performance the MAE is adopted changing from the MAPE

reported in the previous table; this choice is motivated by the problem of variable magnitude at

any given level of the hierarchy: for example, at the departments level general grocery accounts

for 37% while pet care only has a weight of 2%. This problem is compounded then considering

single product categories: sales of some seasonal products, of insecticides or of some innovative

product categories like vegetarian cured meats are extremely hard to predict as much of their

sales levels depend on assortment and inventory decisions, on volatile seasonal factors or on

other factors. Nonetheless these products account for a very low volume of total sales and it

seems misleading to give them the same importance in the assessment of model accuracy as that

of categories accounting for a far higher share of total revenue.

Therefore the tables below are built according to the following procedure: for each origin point t

(k<t<T) forecasts are computed, then the MAE for each series at each level are summed up to

obtain a measure of the total MAE for each level of the hierarchy for a single forecasting horizon

at a single t.

Finally an average is taken of the MAE measures for each forecasting horizon in order to obtain a

final measure for the average forecasting error for a given level of the hierarchy at a given horizon.

For example we take the average of each of the 31 measures of the 3 months ahead MAE as the

measure for the forecast error for a given method at a given level.

Combination methods:

The tables below contain the measures of the out-of-sample forecast errors obtained through this

procedure, table 2 and table 3 differ only in the computation of the base forecasts which are ETS

forecasts in the first and Arima forecasts in the second case.

The software used for statistical modelling is R with base forecasts using the “forecast” package

for base forecasts and the “hts” package for hierarchical reconciliation. The computational time

was of 4 hours and 40 minutes for both tables computed on a laptop with i5-7200 CPU, clock

speed 2.5 GHz, and 8 Gb DDR4 RAM.

ETS forecasts forecast horizon (h)

1 2 3 4 5 6 7 8 9 10 11 12 Average
Top level: Italia
comb 40.1 39.7 44.2 45.5 51.3 53.3 54.7 56.1 61.9 58.5 66.0 68.2 53.3
bottom up 40.9 41.2 45.6 41.4 49.6 49.8 54.1 59.2 61.3 62.0 69.8 70.3 53.8
top down 41.6 38.4 45.0 45.0 52.9 55.9 55.5 56.5 63.4 59.6 66.7 68.4 54.1
Level 1: Departments
comb 66.3 68.5 74.4 75.5 79.6 80.1 80.2 82.3 87.9 86.1 91.8 95.6 80.7
bottom up 68.5 72.4 77.4 78.1 80.1 80.9 80.8 82.6 86.1 88.2 94.2 94.4 82.0
top down 70.9 71.4 78.6 82.7 85.1 85.9 89.0 86.2 94.8 92.8 98.8 101.9 86.5
Level 2: Sectors
comb 113.2 120.5 131.5 136.1 139.1 141.7 146.4 149.3 155.5 160.4 167.2 173.3 144.5
bottom up 114.3 123.1 135.7 140.6 143.3 143.8 148.3 149.6 157.7 162.3 169.0 175.2 146.9
top down 120.6 126.6 138.1 145.9 148.7 151.6 156.7 156.7 162.7 169.8 176.5 181.7 153.0
Level 3: Products
comb 163.4 183.7 204.5 214.6 217.2 217.9 221.9 224.4 231.8 238.1 246.0 254.2 218.1
bottom up 162.1 181.9 201.3 211.0 214.7 216.9 221.6 224.4 233.1 239.0 246.9 254.0 217.2
top down 171.0 189.4 208.5 218.4 222.7 224.8 230.1 231.3 235.8 243.2 251.4 259.3 223.8

Table 2: average of total MAE for out-of-sample forecasting of alternative aggregation methods for

hierarchical forecasting with ETS forecast method, comb stands for optimal combination approach

Arima forecasts forecast horizon (h)
1 2 3 4 5 6 7 8 9 10 11 12 Average
Top level: Italia
comb 66,1 69,6 67,8 69,7 73,0 71,1 79,0 76,5 82,4 85,7 86,2 93,2 76,7
bottom up 95,8 95,2 82,8 97,4 92,4 96,1 107,6 106,8 107,5 116,7 121,6 120,8 103,4
top down 49,9 51,1 63,6 65,6 71,6 67,6 73,8 70,9 68,4 69,8 71,9 81,6 67,2
Level 1: Departments
comb 113,4 130,3 139,6 144,2 151,3 151,6 156,3 148,7 149,2 149,1 152,0 155,2 145,1
bottom up 132,5 148,8 155,9 171,3 174,1 175,2 182,4 172,5 171,5 173,9 174,8 175,1 167,3
top down 127,4 142,2 137,9 145,0 150,8 137,4 152,0 143,2 150,7 154,8 156,0 165,7 146,9
Level 2: Sectors
comb 210,4 247,0 274,4 287,5 295,6 293,8 293,0 271,1 265,5 267,5 272,5 279,1 271,4
bottom up 213,5 249,5 271,7 287,6 291,8 293,5 294,4 281,8 272,7 272,3 273,4 271,1 272,8
top down 254,3 289,2 325,1 344,2 349,8 347,7 358,9 327,7 325,5 329,5 332,5 345,5 327,5
Level 3: Products
comb 317,0 362,0 398,1 421,0 423,8 425,5 434,0 417,2 414,7 416,8 422,7 425,9 406,6
bottom up 291,8 335,8 365,2 387,2 394,2 395,8 402,2 389,6 386,0 388,8 393,0 393,2 376,9
top down 351,2 405,5 447,9 470,4 471,8 469,6 486,5 462,4 458,6 462,8 469,0 472,8 452,4

Table 3: average of total MAE for out-of-sample forecasting of alternative aggregation methods for

hierarchical forecasting with Arima forecast method, comb stands for optimal combination approach

These results seem to confirm that the overall performance of the optimal combination

hierarchical reconciliation method is better than that of the bottom up and top down approaches.

In table 2, where base forecasts are built with the ETS model, we can easily see that both the

optimal combination and the bottom up methods seem to univocally outperform the top-down

approach, this observation is coherent with the relevant literature as the bottom up method is

widely regarded as superior (Schwarzkopf et al., 1988) except when there is reason to believe the

bottom level series to be unreliable.

At the intermediate levels, both in the case of departments and sectors the optimal combination

method’s performance is strictly superior to the bottom up’s, while at the bottom level the latter

can claim to be the most accurate, even if it beats the former only by a small margin. On the top

level, that of total sales, the situation is less clear as neither of the two is univocally better than

the other but the average MAE is slightly lower for the optimal combination approach.

Table 3 presents slightly messier results, the bottom-up and top-down methods show the lowest

MAE values respectively at the bottom and at the top level while the combination method
outperforms them both at the intermediate levels in term of average MAE but is not univocally

better for every forecasting horizon. However it should be noted that the optimal combination

approach is either the best performing or the second best for all 4 levels of the hierarchy.

Having confirmed that in general the optimal combination approach seems to be the best

performing one in the next section we are going to test some different weighting procedures.

Weighting methods:

The optimal combination forecast reconciliation method allows for the implementation of any set

of weights which can be independent of the observed time series (Hyndman et al. 2011). In this

section the wls, MinT and nseries weighting procedures related to hierarchical reconciliation

methods are going to be compared.

The nseries weights are the simplest ones as they are based on the number of series present at each

node of the hierarchy, they are equal to the inverse of the row sums of the summation matrix.

Hyndman, Lee and Wang (2016) introduce the use of weighted least squares (wls) to produce

reconciled forecasts in the context of the optimal combination reconciliation and develop an

improved algorithm for the computation of the variance-covariance matrix which are used to build

wls weights using forecast variances.

Athanasopoulos et al. (2018) highlight that this method ignores the diagonal covariance elements

and building on the approach introduced in Hyndman et al. (2011) introduce the MinT weighting

procedure which uses a full estimate of the covariance to build the weights.

This novel approach is aimed at the minimization of the variances of the forecast errors assuming

unbiasedness for the forecasts.

Table four below displays the forecasting performance of these three weighting procedures, the

evaluation is carried out implementing the same procedure of rolling forecast origin cross validation

which was applied to different combination methods in the previous section, the computations

lasted about 2 hours on the same laptop as in the previous example.

method = comb forecast horizon (h)

ETS forecast 1 2 3 4 5 6 7 8 9 10 11 12 Average
Top level: Italia
wls 40.1 39.7 44.2 45.5 51.3 53.3 54.7 56.1 61.9 58.5 66.0 68.2 53.3
nseries 39.1 39.1 42.2 43.0 48.5 51.0 52.1 54.5 59.5 55.4 63.5 65.9 51.2
mint 40.1 40.6 45.8 44.1 51.5 54.0 56.7 61.4 62.7 63.8 72.8 73.6 55.6
Level 1: Departments
wls 66.3 68.5 74.4 75.5 79.6 80.1 80.2 82.3 87.9 86.1 91.8 95.6 80.7
nseries 65.9 68.4 73.5 75.3 78.0 78.2 77.7 80.4 84.1 83.7 89.3 92.6 78.9
mint 66.8 73.6 78.3 78.3 82.5 80.5 77.4 84.1 86.6 89.4 94.1 97.8 82.4
Level 2: Sectors
wls 113.2 120.5 131.5 136.1 139.1 141.7 146.4 149.3 155.5 160.4 167.2 173.3 144.5
nseries 113.0 120.8 132.7 137.1 139.5 141.3 145.7 148.6 155.5 159.8 166.8 172.3 144.4
mint 114.4 127.6 138.6 142.4 144.5 145.1 146.6 151.6 156.1 163.0 171.9 174.7 148.0
Level 3: Products
wls 163.4 183.7 204.5 214.6 217.2 217.9 221.9 224.4 231.8 238.1 246.0 254.2 218.1
nseries 166.9 187.9 208.9 219.1 222.0 222.8 226.5 229.5 237.6 244.1 251.6 260.1 223.1
mint 162.1 187.0 208.8 218.4 220.6 220.7 223.4 227.1 233.0 242.3 252.2 259.5 221.3

Table 4: average of total MAE for out-of-sample forecasting employing different weighting methods, the

base forecasts are built with ETS model and optimal combination reconciliation

The most important result is that in this empirical test the nseries weights outperform the other two

methods at the higher levels of aggregation: this procedure is univocally better for the two highest

levels of the hierarchy and is pretty much tied up with the wls weights at the sector level as on

average it presents a MAE which is only 0.1 points higher.

The most likely reason behind the high accuracy displayed by the nseries weights has to do with the

structure of the hierarchy as both at the departments and at the sectors levels the ECR classification

which defines our hierarchical structure is very imbalanced.

For example the whole department of fruits and vegetables, despite being composed of 57

product categories is divided in only two sectors: one for fruits and one for vegetables, also pet

care supplies which account for only around 2% of total sales are nonetheless accounted as one of

the 8 departments and are divided in 3 of the 73 sectors, one of which contains only a single

product category.
On the other hand the department of general groceries accounts for 36% of total sales and

contains a staggering 22 of the 73 sectors as well as 128 out of the 430 product categories.

Therefore the nseries weights seem able to outperform the other methods as they are able to

model better the structure of our hierarchy by taking into account the different number of series

at each node.

6. Alternative hierarchical structure:

The hypothesis underlining this section is that the hierarchical structure we have been using up

until now may be suboptimal and that changing it could improve the forecasting performance at

the highest level of aggregation for the hierarchical models.

The ECR classification is produced by GS1, a non-profit organization that develops and maintains

global standards, the most well knows of which is the barcode, it aims at improving the efficiency

of supply chains and provide the industry with a common nomenclature. Providing forecasts for

the middle groups, namely departments and sectors, is a worthwhile effort as they are universally

adopted by industry practitioners, nonetheless if we assume to be interested only in the forecast

for sales at the aggregate level then the aggregational structure of the hierarchy could be


In fact the hierarchy we have been using so far is mainly built to address concerns about supply

chain management and depends principally on assortment allocation, thus it may be suboptimal in

a context of pure time series analysis giving us reason to consider revisiting it; the observations in

the previous section point to a very imbalanced hierarchical structure which could be the source

of inefficiencies in our model, therefore in the following section an alternative is going to be

Time series clustering:

Time series clustering methods are employed to build an alternative hierarchy which is then tested

in terms of forecasting accuracy performance at the aggregate level against the original hierarchy

and against univariate time series forecasting models.

The alternative hierarchy is obtained through partitional time series clustering employing dynamic

time warping as the distance measure and DBA centroids.

Partitional clustering is a clustering strategy which assigns each observation univocally to one of k

clusters, where k is given at the beginning of the procedure, in this case 40 clusters are employed

which is a number chosen arbitrarily by observing the scree plot of the sum of squared within

distances at different values of k and by observing the silhouette curve and the CH curve.

This set of partitional clusters is built using the “tsclust” function included in the “dtwclust” R

package, the clustering algorithm starts by initialising k centroids which are created by randomly

choosing k series from our database and equating them to the centroids of k individual clusters,

then the following procedure is implemented either a set number of times or until no change in

the cluster assignment is detected:

• The distance of each series from each centroid is calculated and each series is assigned to

the cluster whose centroid is closest according to a chosen distance function.

• The centroids of each cluster which have been changed in the previous procedure are


• If any cluster is empty it is reinitialised by changing its centroid to be equal to a random

series in the dataset.

The dynamic time warping distance:

The main issue of time series clustering is the definition of the distance function: as each

observation in our dataset consists of a single time series the distance function must provide a

measure of the distance between each pair of time series.

The simplest approach would be to use the Euclidean distance, which is calculated by taking the

average of the Euclidean distances between each pair of points in the two time series, as shown in

Liao (2005) this metric can be improved by implementing some normalization procedure before

the clustering.

As the results of the application of this measure appeared unsatisfactory a more complex

approach was deemed necessary thus the Dynamic Time Warping distance(DTW) was

implemented, Giorgino (2009) provides a good review of the algorithm, which is then

implemented through the dtwclust R package (Espinosa,2018).

This measure, that is the namesake of the dtwclust package, in general terms works by first

stretching or compressing a pair of time series in order to make them as similar as possible and

then computing the sum of the pairwise distances between the aligned time series points.

The computation of dtw distance starts by building an 𝑚 × 𝑛 matrix, called Local Cost Matrix

(LCM) where m and n are the length of the two series x and y which are being compared, each

element of this matrix is defined as follows:

𝑙𝑐𝑚(𝑖, 𝑗) = (|𝑥𝑖 − 𝑦𝑖 |𝑝 )1/𝑝

Thus each (𝑖, 𝑗) entry of the LCM matrix consists of the 𝑙𝑝 norm between 𝑥𝑖 and 𝑦𝑖 thus it can get

computationally expensive for large datasets, fortunately in our case we only have 430 series and

69 observations which despite not being a trivial number do not pose insurmountable

computational issues.

Established the local cost matrix the DTW algorithm must find the optimal path through the local

cost matrix which must start from lcm(1,1) and end at lcm(n,m), this process substantially

amounts to optimally aligning the two series before calculating the pointwise distance between

the two. The path is calculated step by step each time minimizing the cost increases under a set of

constraints which are arbitrarily chosen, among these the most important is the step pattern

which determines which moves are available from each point in the lcm matrix as well as the cost

associated to each direction in which to move, in our case we are employing the “symmetric2”

pattern which places a higher cost on diagonal movers compared to orthogonal ones.

Given that 𝜙 = {(1,1), … , (𝑛, 𝑚)} the set of points which constitute this optimal path the DTW

distance between two series is:

𝑚𝜙 𝑙𝑚𝑐(𝑘)𝑝
𝐷𝑇𝑊𝑝 (𝑥, 𝑦) = (∑ ) , 𝑤ℎ𝑒𝑟𝑒 𝑘 ∈ 𝜙

Where 𝑚𝜙 is a per-step weighting coefficient and 𝑀𝜙 is the normalization constant, both are

defined in depth in Giorgino (2009).

This is the algorithm employed in the dtwclust package for the standard DTW distance, but in

order to improve computational times we introduce global constrains which are a slight

modification which is commonly used by practitioners.

Figure 3: from Espinosa 2018, structure of a lower bound constraint in the lcm matrix for the calculation of DTW distance

A lower bound can be imposed on the lcm matrix which is then restricted as seen in figure three

above, this both restricts the possible optimal paths and reduces computational load as it permits

us to skip performing the calculations for a good part of the lcm matrix which are normally not

part of the optimal path in any case.

In our clustering model the “improved lower bound” (Lemire 2009) is used as it is described in

Espinosa (2018) as one of the two most effective.

DBA centroids:

The partitional clustering procedure we are applying requires both the definition of a distance

function which has been given in the previous sect and the definition of the procedure for

estimating the centroids of each cluster. In order to solve this second problem the DTW

barycenter averaging procedure which is presented in Petitjean et al.(2011) specifically in order to

deal efficiently with the estimation of centroids with the DTW distance.

This function is particularly useful as the centroid structure is independent of the order in which

the series considered enter the calculation meaning that the computational requirements for

updating the centroid in case of changes in the clusters are greatly reduced.

The DBA algorithm starts by selecting a random series in the cluster and then proceeds by

computing iteratively the DTW alignment of the centroid with each series in the cluster in a

procedure that ends with a centroid that is as closely aligned as possible to all the members of the


Clustering Structure:

Having defined the structure of the time series clustering algorithm employed it is relevant to note

that the structure of the clusters thus obtained is very different from that of the original hierarchy

as the new hierarchical structure seems to be based mainly on quantitative metrics: the level of

the series, their seasonality and their trend. In the pictures below the time series for the sales of

the product categories belonging to two of the 40 clusters we are using can be seen.

Figure 4: product categories belonging to cluster 11

Cluster 11 seems to contain products with very strong seasonal sales, note that thanks to the use

of the Dynamic Time Warping distance seasonal peaks which happen in different months are put

together in the same cluster thanks to the time series alignment procedure which is a central part

of this distance measure.

Cluster 30 on the other hand groups together product categories which display consistent sales

fluctuating around 20 million euros per month and displaying a strong positive trend.

Empirical application:

The aim of this section is twofold: measures of the forecast errors at the highest level of

aggregation for hierarchical models built according to the ECR hierarchical structure and also those

of the same models employing the time-series clustering derived classification are going to be

compared not only with each other but also against forecasts for the univariate time series of the

total sales computed with the three methods described in section one.

All the models considered display a negative mean error at all 12 forecast horizons, this seems

preferable to a positive one as a negative mean error implies an overestimation of future sales

which can result in unsold stock while a positive mean error can more easily be interpreted as lost

sales which are the bane of any inventory management problem.

In order to assess the distribution of the forecast errors related to the Arima forecasting model

we produced a series of simulations of two year ahead forecasts starting from September 2016,

figure five, which appears below, displays the residuals for the forecasts built with the Arima

model from the clustering-based hierarchy.

Figure 5: residuals for two years ahead forecast for the highest level of the clustering based hierarchical model with Arima forecasts

Table 5 below shows the results of the same procedure of rolling origin cross validation which has

been employed in the previous sections which is going to produce the final set of results of this

body of work. The sample for the computation of the out of sample forecasts is the same as in the

previous example, that is, between January 2015 and September 2017.

While time series data at the total sales levels are available since 2002 in order to compare the

accuracy of the univariate forecasts methods with that of hierarchical forecasts for which data are

available only from 2012 the length of the estimation period for univariate analysis has been

reduced to match that of the hierarchical models.

The computational time was of 15 minutes for execution of the clustering algorithm and of 2 hours

for the computation of the time series rolling forecast origin cross validation procedure related to

the custom hierarchy. The program was executed on Microsoft R Open 3.4.3 using Intel MKL for

parallel computing on a laptop with i5-7200 CPU, clock speed 2.5 GHz, and 8 Gb DDR4 RAM.

MAPE for total Italian retail sales forecast horizon (h)

1 2 3 4 5 6 7 8 9 10 11 12 Average
Top level: Italy
Clustering hierarchy ETS 0.91 0.83 0.95 0.95 1.07 1.13 1.13 1.18 1.30 1.24 1.39 1.40 1.12
ECR hierarchy ETS 1.03 1.00 1.07 1.10 1.19 1.30 1.38 1.49 1.52 1.58 1.78 1.82 1.35
Clustering hierarchy Arima 1.46 1.37 1.44 1.45 1.44 1.56 1.71 1.74 1.77 1.85 2.02 1.95 1.65
ECR hierarchy Arima 2.23 2.26 2.02 2.35 2.49 2.66 2.74 2.78 2.69 3.00 2.99 3.05 2.61
NN-AR 1.77 1.97 2.05 2.03 2.08 2.23 2.24 2.26 2.29 2.44 2.55 2.53 2.20
Arima 1.54 1.65 1.78 1.86 2.07 2.21 2.31 2.54 2.64 2.82 3.02 3.04 2.29
ETS 1.75 1.81 1.88 1.91 1.98 2.06 2.11 2.18 2.28 2.33 2.43 2.51 2.10

Table 5: out-of-sample MAPE for forecasts of total Italian retail sales in the period 2015-2017 for different

forecasting methods at different horizons

These results are coherent with the relevant literature as they point out that the forecasts for the

highest level series obtained through hierarchical models are more accurate than the forecasts

produced by univariate models.

It is also noteworthy that the forecasting error for the hierarchical models structured according to

the custom hierarchy is univocally lower than that of the hierarchies built along the ECR
classification. The accuracy gain is more pronounced when using Arima models compared to ETS

models, this observation could be motivated by the already lower forecast error of the ETS

models. If we assume that our target series have a stochastic component which cannot be

accurately modelled we could also argue that the ETS model is already close to a forecast which

can accurately model the deterministic component of total sales thus limiting the potential gains

from the modification of the hierarchy.

7. Conclusion:

In this last section the results of the empirical tests conducted in this work are summarised,

regarding the first hypothesis about the best univariate time series forecasting model for total

retail sales of packaged products at the Italian level the results obtained are not conclusive. The

Neural Network Autoregressive model shows the best performance in the case of total sales at the

Italian level with the Arima as a close second, however we also tested the forecasting accuracy for

the sales of the 4 macro-areas which divide the Italian area and here the former model fares

worse as it shows the worst performance while the ETS models excels and displays the lowest

measures of forecast error in three macro areas of four.

Regarding the second hypothesis the simulations of out-of-sample forecasting built for hierarchical

models on Italian retail sales produced results which are easier to assess: the outcome of the first

battery of tests on the accuracy of different hierarchical reconciliation methods aligns with the

results of Hyndman et al (2011) showing that the optimal combination hierarchical approach

outperforms both the bottom up and the top down methods when applied to Italian retail sales.

Moreover it appears that the ETS forecasting model consistently outperform the Arima models in

producing forecasts for single retail sales series, this is probably caused by the strong seasonality

and trend which are better modelled by exponential smoothing techniques.

The simulations conducted in order to test different weighing procedures also show easily

interpretable results as it appears that in the out-of-sample forecast simulations conducted on the

available data the nseries weights are preferable if the analyst is interested in accurately modelling

the top level series while the wls weights are more accurate for the bottom and at the

departments levels.

In merit to the third hypothesis some interesting results have also been produced in section 6

which concludes that hierarchical models show better performances than the univariate ones

when forecasting total Italian retail sales of packaged goods.

Moreover in section 6 some evidence is provided pointing to potential reductions in forecast error

achievable by changing the hierarchical structure employed from the standard ECR one to a

custom hierarchy built through time series clustering. It should be noted however that this

approach is only viable if one is not interested in the intermediate levels of the hierarchy but only

in the total sales, which is, in the series at the highest level of aggregation.

Summing up the main conclusion of this work is that a hierarchical model with ETS independent

forecasts reconciled with the combination method and with nseries or wls weights seems the

optimal approach to forecasting Italian retail sales of packaged goods provided a good

informational base is available in term of time series of sales for single product categories.


Aghabozorgi, S., Seyed Shirkhorshidi, A., & Ying Wah, T. (2015). Time-series clustering - A decade
review. Information Systems, 53, 16–38.

Alon, I., Qi, M., & Sadowski, R. J. (2001). Forecasting aggregate retail sales: Journal of Retailing and
Consumer Services, 8(3), 147–156.

Arbelaitz, O., Gurrutxaga, I., Muguerza, J., Pérez, J. M., & Perona, I. (2013). An extensive
comparative study of cluster validity indices. Pattern Recognition, 46(1), 243–256.

Arlot, S., & Celisse, A. (2009). A survey of cross-validation procedures for model selection, 4, 40–

Athanasopoulos, G., Ahmed, R. A., Hyndman, R. J., Lee, A. J., Feinberg, E. A., Genethliou, D., … Fan,
S. (2014). hts : An R Package for Forecasting Hierarchical or Grouped Time Series. Iranian
Journal of Electrical and Electronic Engineering, 55(June), 146–166.

Ben Taieb, S., Taylor, J. W., & Hyndman, R. J. (2017). Hierarchical Probabilistic Forecasting of
Electricity Demand with Smart Meter Data, 1–30. Retrieved from http://souhaib-

Bergmeir, C. (2015). Validity of Cross-Validation for Evaluating Time Series Prediction. Elsevier,

Chen, F., & Ou, T. (2011). Constructing a Sales Forecasting Model by Integrating GRA and ELM: A
Case Study for Retail Industry. International Journal of Electronic Business …, 9(2), 107–121.
Retrieved from

Crone, S. F., Hibon, M., & Nikolopoulos, K. (2011). Advances in forecasting with neural networks?
Empirical evidence from the NN3 competition on time series prediction. International Journal
of Forecasting, 27(3), 635–660.

de Livera, A. M., Hyndman, R. J., & Snyder, R. D. (2011). Forecasting time series with complex
seasonal patterns using exponential smoothing. Journal of the American Statistical
Association, 106(496), 1513–1527.

Fan, S., Hyndman, R., & Hyndman, R. J. (2010). Short-term load forecasting based on a semi-
parametric additive model. Business, (August).

Fisher, W. D. (1979). A Note on Aggregation and Disaggregation. Econometrica, 47(3), 739–746.

Fliedner, G. (1999). An investigation of aggregate variable time series forecast strategies with
specific subaggregate time series statistical correlation. Computers & Operations Research,
26(10-11), 1133-1149.

Giacomini, R., & Granger, C. W. (2004). Aggregation of space-time processes. Journal of

econometrics, 118(1-2), 7-26.

Giorgino, T. (2009). Computing and Visualizing Dynamic Time Warping Alignments in R : The dtw
Package. Journal of Statistical Software, 31(7), 1–24.

Guoqiang Zhang, B., & Eddy Patuwo, M. Y. H. (1998). Full-Text. International Journal of
Forecasting, 14, 35–62.

Hendry, D. F., & Hubrich, K. (2011). Combining disaggregate forecasts or combining disaggregate
information to forecast an aggregate. Journal of Business and Economic Statistics, 29(2), 216–

Hyndman, R.J.; Koehler, A.B.; Snyder, R.D.; Grose, S. (2002). A state space framework for
automatic forecasting using exponential smoothing methods. International Journal of
Forecasting, 18(3), 439–454.

Hyndman, R. J. (2008). Forecasting with exponential smoothing : the state space approach.

Hyndman, R. J., & Khandakar, Y. (2008). Automatic time series forecasting : the forecast package
for R Automatic time series forecasting : the forecast package for R. Journal Of Statistical
Software, 27(3), 1–22.

Hyndman, R. J., & Lee, A. J. (2014). Fast computation of reconciled forecasts for hierarchical and
grouped time series Fast computation of reconciled forecasts for hierarchical and grouped time
series, 97(June), 16–32. Retrieved from

Hyndman, R. J., Ahmed, R. A., Athanasopoulos, G., & Shang, H. L. (2011). Optimal combination
forecasts for hierarchical time series. Computational Statistics & Data Analysis, 55(9), 2579-

Kang, Y., Hyndman, R. J., & Smith-miles, K. (2016). Visualising Forecasting Algorithm Performance
using Time Series Instance Spaces Visualising Forecasting Algorithm Performance using Time
Series Instance Spaces, (May).

Kinney, W. R. (1971). Predicting earnings: entity versus subentity data. Journal of Accounting
Research, 127-136.

Kotsialos, A., Papageorgiou, M., & Poulimenos, A. (2005). Long-term sales forecasting using holt-
winters and neural network methods. Journal of Forecasting, 24(5), 353–368.

Kremer, M., Siemsen, E., & Thomas, D. J. (2015). The Sum and Its Parts: Judgmental Hierarchical
Forecasting. Management Science, (December), mnsc.2015.2259.

Marcellino, M., Stock, J. H., & Watson, M. W. (2003). Macroeconomic forecasting in the Euro area:
Country specific versus area-wide information. European Economic Review, 47(1), 1–18.

Bienstock, C. C., Mentzer, J. T., & Bird, M. M. (1997). Measuring physical distribution service
quality. Journal of the Academy of Marketing Science, 25(1), 31.

Modeling, D. (1992). Stacked Generalization ( Stacking ), 5(505), 241–259.

Narasimhan, S. L., McLeavey, D. W., & Billington, P. J. (1995). Production planning and inventory
control (pp. 1-165). Englewood Cliffs: Prentice Hall.

Orcutt, G. H., Watts, H. W. & Edwards, J. B. (1968), ‘Data aggregation and information loss’, The
American Economic Review 58(4), 773–787.

Peña, D., Tiao, G. C., & Tsay, R. S. (2001). A Course in Time Series Analysis.

Podobnik, B., & Stanley, H. E. (2007). Detrended Cross-Correlation Analysis: A New Method for
Analyzing Two Non-stationary Time Series, 1–11.

Rodrigues, P. P., Gama, J., & Pedroso, J. P. (2008). Hierarchical clustering of time-series data
streams. IEEE Transactions on Knowledge and Data Engineering, 20(5), 615–627.

Sarda-Espinosa, A. (2017). Comparing Time-Series Clustering Algorithms in R Using the dtwclust

Package, 41. Retrieved from https://cran.r-

Smith, K. A., & Gupta, J. N. (2000). Neural networks in business: techniques and applications for
the operations researcher. Computers & Operations Research, 27(11-12), 1023-1044.

Stock, J. H., & Watson, M. W. (2002). Forecasting using principal components from a large number
of predictors. Journal of the American statistical association, 97(460), 1167-1179.

Taieb, S. Ben, Taylor, J. W., & Hyndman, R. J. (2017). Coherent Probabilistic Forecasts for
Hierarchical Time Series. Proceedings of the 34th International Conference on Machine
Learning, 70(April), 3348–3357. Retrieved from

Wang, Y., & Powell, W. (2017). MOLTE: a Modular Optimal Learning Testing Environment, 27(3).

Warren Liao, T. (2005). Clustering of time series data - A survey. Pattern Recognition, 38(11),

Wickramasuriya, S. L., Athanasopoulos, G., & Hyndman, R. J. (2018). Optimal forecast

reconciliation for hierarchical and grouped time series through trace minimization. Journal of
the American Statistical Association, (December), 1–45.

Zhang, X. (2009). Retailers' multichannel and price advertising strategies. Marketing Science,
28(6), 1080-1094.

R packages:

R Core Team (2017). R: A language and environment for statistical computing. R Foundation for
Statistical Computing, Vienna, Austria. URL

Microsoft R Open 3.4.3 The enhanced R distribution from Microsoft Using the Intel MKL for
parallel mathematical computing, 2017 Microsoft Corporation. URL

Hyndman RJ (2017). _forecast: Forecasting functions for time series and linear models_. R package
version 8.2. URL:>.

Hyndman and Khandakar (2008). “Automatic time series forecasting: the forecast package for R.”
_Journal of Statistical Software_, *26*(3), pp. 1-22.

Rob Hyndman, Alan Lee and Earo Wang (2017). hts: Hierarchical and Grouped Time Series. R
package version 5.1.4. URL:

Alexis Sarda-Espinosa (2017). dtwclust: Time Series Clustering Along with Optimizations for the
Dynamic Time Warping Distance. R package version 5.1.0.

Hadley Wickham, Romain Francois, Lionel Henry and Kirill Müller (2017). dplyr: A Grammar of Data
Manipulation. R package version 0.7.4. URL:

Hadley Wickham and Lionel Henry (2017). tidyr: Easily Tidy Data with 'spread()' and 'gather()'
Functions. R package version 0.7.2. URL:

Hadley Wickham. ggplot2: Elegant Graphics for Data Analysis.Springer-Verlag New York, 2009.