You are on page 1of 6

2014 IEEE International Conference on Big Data

On the Impact of Socio-economic Factors on Power Load Forecasting

Yufei Han Xiaolan Sha Etta Grover-Silva Pietro Michiardi


GridPocket Eurecom GridPocket Eurecom

Abstract a large number of data points to make statistical


inference, which hinders the forecasting task due to
the lack of historical information for households that
In this paper, we analyze a public dataset of elec- are not equipped with smart metering solutions. We
tricity consumption collected over 3,800 households note that energy consumption is a direct product of
for one year and half. We show that some socio- (collective) human behavior, which in turn depends on
economic factors are critical indicators to forecast several characteristics including demographic informa-
households’ daily peak (and total) load. By using a tion [11], [12], [13], and socio-economic factors.
random forests model, we show that the daily load can
be predicted accurately at a fine temporal granularity. In this work, we focus on two load forecasting
Differently from many state-of-the-art techniques based tasks: 1) daily peak load and 2) daily total electricity
on support vector machines, our model allows to derive consumption. We first propose a variety of predictors
a set of heuristic rules that are highly interpretable and ranging from meteorological to socio-economic fea-
easy to fuse with human experts domain knowledge. tures in Section 3. We integrate such features in a
Lastly, we quantify the different importance of each novel approach to the forecasting task, which is based
socio-economic feature in the prediction task. on random forests. Then, we compare in Section 4
our approach to a state-of-the-art modeling technique
based on support vector machines (SVM). Our results
1. Introduction pinpoint an improved prediction accuracy, and illus-
trate that random forest models allow deriving rule sets
that can be easily interpreted and fused with human
Electricity load forecasting is one of the main
domain knowledge.
challenges in smart meter technology. Accurate load
forecasting at various temporal granularity can help
utilities design flexible energy supply strategies (e.g., 2. Background and Related Work
by applying load switch schemes [1]) to meet the
supply-demand balance, and directly benefit individual
households by offering a transparent view on energy Most research work on load forecasting models can
consumption, thus enabling learning mechanisms to be categorized into three groups – time-series based
compare and eventually reduce energy waste [2], [3], models, end user models and econometric models.
[4], [5], [6]. Time-series models are often designed to identify
temporal causality of the energy demand between
Energy load forecasting requires understanding past and future [4], [5], [6], [7], [10], [14], [15],
individual behavioral patterns, to build models that [16]. With such models, the task of forecasting is
predict accurately future energy consumption. Many accomplished by following the learned functional map-
studies used time series modeling techniques [2], [5], ping between past and future energy usage patterns.
[7], [8], [9], [10] to achieve such goal, while little Popular machine learning algorithms are applied in
attention has been devoted to understand the impact these models, such as autoregressive integrated moving
of socio-economic factors on the forecasting task. In average model (ARIMA) [16], support vector machines
this paper, we address this challenge, and explicitly (SVM) [4], [15] and neural networks [10], [14]. These
introduce socio-economic factors into the forecasting algorithms serve as black-box tools to capture the
task to overcome the limitation of existing approaches. temporal dynamics in energy consumption with high
Indeed, time series modeling techniques often require accuracy. However, they often offer no insights on the

978-1-4799-5666-1/14/$31.00 ©2014 IEEE 742


underlying phenomenon behind the time evolution of are combined to achieve a complete decision making
energy demand, and are difficult to generalize. procedure.
To overcome lack of interpretability, an alternative
is to build end-user models [1], [2], [3], [5], [9], 3. Load Forecasting using Socio-
[17]. The principle of such models is to disaggregate economic Factors
daily energy consumption into elementary components
– including heating/cooling, water usage, cooking and
other behaviors – which are used to interpret the We now present the public dataset used in this work
temporal variation of the energy demand from each and describe the three types of features that are used in
household. End-user models can easily explain the re- our forecasting models – temporal, environmental and
lation between predictions and household information. socio-economic factors. As a general remark, we note
However, their performance highly depends on the that the volume of the available training samples in our
quality of available information, which makes them dataset offers a large coverage of energy and socio-
sensitive to noise. economic factors, which improve the generalizability
of our model.
Another group of popular models to forecast energy
The dataset. The CER ISSDA dataset is a publicly
load are econometric models [2], [3], [5], [12], [13],
available energy consumption trace [18], containing
[17], which combine the two techniques described
electricity consumption data of 4,225 private house-
above. These models identify the main factors which
holds and 485 small / medium enterprises; the trace
influence the consumption behavior of households,
covers 1.5 years (from July 2009 to December 2010).
and learn the mapping between key factors and the
For each customer, the daily load curve is sampled
energy consumption profiles. Though these models are
every 30 minutes and comprises a series of surveys
appealing, they require human intervation to tune the
describing their socio-economic factors. 41 survey
input factor set.
questions belonging to six categories were carefully
To the best of our knowledge, the integration and selected as a complete list of features for our statistical
ranking of socio-economic features for load forecasting model. They cover household profiles, chief owner
through automatic learning from energy usage data has income level, residential behavioral related features,
remained elusive in the literature. In this paper, we housing conditions, properties of main electrical appli-
design a methodology to automatically select socio- ances (heating and cooking devices) and information
economic factors to build an energy forecast model. about other auxiliary electrical appliances, which are
In doing so, we unveil which are the most relevant commonly recognized as influencing factors in energy
features to accurate predictions, and which can be consumption behavior. After data curation, our dataset
safely dismissed as they are of little contribution to covers 3,822 households.
the predictive power of our model. Predictors used in the forecasting task. In addi-
tion to the socio-economic factors outlined above, we
Our forecasting model is based on random forests,
include environmental features, which are significant
which we briefly overview here. The central idea of
for energy consumption profiling [14]. The first two
random forests is to construct a set of independent
are heating degree days and cooling degree days,
decision trees by bootstrapping training data. The
evaluating quantitatively the needs to start heating
output of a random forest is obtained, e.g., through
or air conditioning appliances to keep an adequate
majority voting or with a simple average of the output
environmental temperature. We also consider the daily
of each individual tree model. Although bootstrap
average humidity, which represents the average air
sampling introduces random bias into each decision
humidity level during a one-day interval. Finally, we
tree as byproduct, the voting scheme reduces variance
include month of year, day of week and holiday index:
of tree output, which compensates the random bias thus
the first feature represents the seasonal weather change,
improves the model fitting accuracy. Each decision tree
directly and strongly affecting energy usage profiles of
in a random forest is a tree-like rule chain based on
all involved customers; the others features differentiate
input variables for classification or regression. Each
occupancy patterns of residential customers.
rule is a branch-split operation comparing one input
variable with a predefined threshold. The hierarchical Statistical profile of the load forecast task. We
split-branch operations form a coarse-to-fine white box now build the statistical profile of our dataset, using
model, which can explain explicitly how input factors well-known feature ranking statistics to provide formal

743
Prediction of daily peak loads Prediction of daily total en-
ergy usage indicates that house category and usage of stand-alone
Total number of occupants Total number of occupants freezer devices are key factors for the daily total energy
Number of occupants > 15 Number of occupants > 15 consumption, but are weakly related to the daily peak
years old years old load. Such results follow common sense: the type of
Whether there is a dishwasher Whether there is a dishwasher
Whether there is a tumble dryer Number of bedrooms the house is related to the thermal performance and
Number of occupants < 15 Whether there is a tumble dryer inertia of the house, which crucially affects the total
years old amount of energy – required to maintain a desirable
Whether there is an electric Number of occupants < 15
cooker years old temperature in the house. Next, we show that a data-
Number of bedrooms Whether there is a stand-by driven model based on socio-economic factors can
freezer indeed find associations between user profiles and their
Type of the cooking appliances Average age of occupants
Average age of occupants Type of the house consumption behavior.
Employment status Employment status

Table 1: Top 10 factors in prediction tasks using mutual 4. Experimental Evaluation


information criterion

We now report our experimental results, where we


evaluate the performance of our random forests model
grounding to the key idea of this work, i.e., that socio- to address the load forecasting task, and compare it to a
economic factors are important predictors for a fore- state-of-the-art technique based on SVM. In addition,
casting model. Thus, we use the mutual information we present an analysis of the importance ranking of
criterion1 [19] to compute the correlation between each socio-economic and environmental factors.
input factor and the forecasting target: the larger the
mutual information, the higher the correlation between
the two random variables. In practice, we use the 4.1.Comparison of Forecasting Models
Kullbeck-Leiber divergence of the product of marginal
distributions of two random variables x and y from For each of the 3,822 users and for the 1.5 year
their joint distribution: duration of our dataset, we extract daily peak load
measurement (in KiloWatt) and daily total electricity
 
XX p(x, y)
I(X; Y ) = p(x, y) log (1) consumption (in KiloWattHour) to build the target
p(x) p(y)
y∈Y x∈X features of our model. Input predictors are the 41
features described in Sec. 3. Therefore, we obtain
Tab. 1 reports the top-10 factors (sorted in de- 1,615,541 training samples, containing the input-output
scending order of mutual information), that are mostly pairs for each of the forecasting tasks.
related to daily peak loads and daily total energy
usage respectively. Occupancy attributes, the number Metrics. The forecasting accuracy is measured using
of bedrooms, age of the clients, employment status the coefficient of determination R2 [19] defined by the
of the clients, usage of tumble dryer and usage of ratio between the sum of square regression residual
dishwashers, are the common key elements influencing error and total sum of squares of the forecasting target.
both peak loads and total amount of energy consump- Large values of R2 indicate an agreement between the
tion, which are related to clients consuming capacities, model and the underlying real output, which translates
potential electricity needs to keep proper temperatures in higher accuracy. Note that for really biased predic-
and daily housework. tions, the coefficient can be zero or negative, while it
is upper bounded by 1.
For the purpose of the forecasting task, our analysis
indicates that cooking appliances – because they con- Forecasting models. The main parameter of random
sume a lot of energy – largely contribute to the mag- forests is the number of trees constructed in the
nitude of the daily peak energy consumption. Instead, ensemble model to perform a final vote. We vary
such appliances only marginally affect the the daily this parameter in the range {50, 100, 200, 300, 400},
total energy consumption, as their usage is restricted to study the stability of forecasting performance. Once
to a small interval of time. In contrast, our analysis the random forest model is constructed, the out-of-
bag error [19] of the random forest is used directly
1. Note that the mutual information is not used to rank predictors as the estimation of the model generalization error.
for the purpose of feature selection. Note that the split-branch operation in the random

744
Model Peak Load: R2 Total Load: R2 Peak loads prediction Daily total consumption prediction

Daily total consumption in KWH


10 15 20 25 30 35
6
Peak load level in KiloWatts
RF(50) 0.5214 0.7151

5
RF(100) 0.5240 0.7261

4
RF(200) 0.5127 0.7236

3
2
RF(300) 0.5208 0.7100

1
RF(400) 0.5157 0.7070

5
0 5 10 15 20 25 30 0 5 10 15 20 25 30
SVM 0.4816 0.6572 Day Day

Table 2: Generalization error of peak load and total (a) Prediction of daily peak (b) Prediction of daily to-
electricity consumption prediction, with the Random load of one user during one tal consumption usage of one
Forest Model (RF) and SVM. Results for the RF model month. user during one month.
include values for the tree parameter (in parenthesis)
we used. Figure 1: An example of energy consumption indicator
forecasting.

forest constructs a piecewise linear regression model,


which can handle non-linearly distributed data without For illustrative purposes, Fig. 1a and 1b show two
introducing additional algorithmic components. instances of the forecasting task for daily peak load
and daily total consumption respectively: the solid line
Since the training set consists in more than 1.8 corresponds to the ground truth of daily peak load
million records, a standard approach such as SVM or daily total consumption measurements, while the
can not afford the construction of a huge kernel dashed line represents the estimated values using the
matrix to enforce a non-linear regression in memory: random forest model. These figures reveal that our
hence, we use a linear kernel SVM [20]. The training method achieves higher predictive power for the total
configuration of SVM is carefully selected through a daily consumption than for the daily peak load, which
5-fold cross-validation. After fixing the configuration confirms our findings in Tab. 2. Indeed, peak loads are
parameter, the SVM training ends by providing a linear easily affected by householders’ occupancy duration,
regressor: 5-fold cross-validation is employed again to appliances’ working status, and in general, time-of-day
estimate its generalization error. related behavior. Such effects inject randomness into
daily peak load patterns, which are hardly reflected
Overall, the computational complexity of building
comprehensively in the questionnaire. In contrast, the
the random forest model is O(mn log(n)), where m is
daily total consumption, as a cumulated sum of energy
the number of trees, n is the number of bootstrapped
consumption measurements, is less prone to random
training samples to construct each tree in the model.
fluctuations due to time-of-day behavior. As a conse-
Considering that m is usually much smaller than n,
quence, the model built with the approach contains a
the complexity of random forest is comparable to or
strong association to the static socio-economic features
smaller than the quadratic programming cost of SVM,
available in the dataset.
which is between O(n2 ) and O(n3 ).
Results. Tab. 2 summarizes the generalization error
4.2.Importance Ranking of the Socio-economic
for our method and SVM. In general, our modeling
approach outperforms SVM: the highest accuracy is and Environmental Factors
achieved with an ensemble of 100 trees, for both
forecasting tasks. Indeed, although more trees can Our method generates a quantitative score for each
reduce variance, they might increase the model bias. input factor to evaluate its association importance with
daily energy usage profiles. All input factors are ranked
There are two main reasons for the superiority
by the importance score in a descending order. The
of our approach. First, SVM treats categorical inputs
mean, along with standard deviation of each feature
as numerical variables, which loses the categorical
importance score is estimated. For all top 10 factors,
information. Furthermore, due to the computational
the standard deviations are at least three times smaller
complexity and memory constraints, the selected SVM
than the mean values. This indicates the ranking result
is linear, thus it cannot capture the non-linear relation
is statistically stable.
between socio-economic features and energy consump-
tion profiles. In contrast, the random forest method Our experimental results are summarized in Tab. 3,
does not suffer from such limitations. which reports the top-10 input factors for predicting

745
Prediction of daily peak loads Prediction of daily total en-
ergy usage lack of accurate temperature data that affects our study,
Total number of occupants Total number of occupants since (due to privacy reasons) it was not possible to
Number of occupants > 15 Number of occupants > 15 join our temperature dataset to specific geographical
years old years old coordinates corresponding to the households in our
Type of the cooking appliances Average daily heating degree
days study. Overall, the ranking produced by the random
Whether there is an electric Number of bedrooms forest method is in line to that obtained through a
cooker manual inspection of the data by domain experts.
Average daily heating degree Whether there is a dishwasher
days
Whether there is a tumble dryer Month of year
Whether there is a dishwasher Whether there is a tumble dryer
Type of the appliances to heat Whether there is an electric
water in the house heater 5. Conclusion
Whether there is an instant elec- Whether there is a stand-by
tric shower facility freezer
Whether there is an electric Type of the appliances to heat
heater water in the house
In this paper, the issue of forecasting daily elec-
Table 3: Top 10 factors in prediction tasks. tricity usage patterns and evaluating the influence of
clients’ socio-economic factors on energy consumption
has been addressed.

peak loads and total energy consumption. The ranking A forecasting method based on random forest that
obtained is consistent to that obtained using the mutual seamlessly incorporates both clients’ socio-economic
information criterion in Sec. 3. The number of occu- factors and environmental factors has been proposed,
pants is ranked as the most important feature, affecting resulting in an ensemble of split-branch decision rule
both daily peak load level and the total amount of chains, whereby a voting mechanism is used to achieve
energy consumption, which follows the intuition of a stable prediction. The rule-chain based structure
a direct relation between household size and energy enabled the explaination of how each input features
consumption. Main appliances, such as cooking and contributes to energy consumption forecasting, thus
heating devices, are common features for the two unveiling the underlying physical association between
forecasting tasks. Indeed, they contribute to a large predictors and the daily energy usage pattern. These
extent to instantaneous peaks in energy consumption, learned associations can either be used to discover
and are generally responsible for aggregate energy con- unknown factors of energy consumption behaviors,
sumption as well. Appliances that have a daily cycle or they can be applied as a complementary decision
(e.g. freezers) and average daily heating degree days support to human experts.
are specially useful for estimating daily consumption,
rather than representative of peak demand. Experimental results based on a large-scale energy
consumption dataset showed that our methodology
We remark that, for peak load forecasting, the dramatically outperforms state-of-the-art approaches,
random forest method boosts the importance of the both in terms of prediction accuracy and in terms
water heating appliances, while it completely neglects of scalability. Additionally, our work evaluated the
indirect factors, such as the number of bedrooms, importance of socio-economic factors in the fore-
social class of the chief income and employment status. casting tasks. Our result can guide the construction
For total daily energy consumption, the random forests of socio-economic surveys without requiring human
method assigns higher importance scores to factors intervention, leading to a decreased intrusiveness of
related to heating usage – including environmental measurement campaigns.
heating degree index, month of year, the year when
the house is built, the number of bedrooms – while As shown in the experiments conducted in this
it decreases the importance of social class, age and work, daily peak load forecasting is a difficult task,
employment status of the owner. Interestingly, environ- compared to aggregate energy consumption. As such,
mental temperature seems not play an important role an extension of this work will build detailed usage
in both forecasting tasks, which is counter intuitive. patterns of large power appliances that play a key role
A possible reason for this result is related to high in determining peak loads and will integrate historical
heating loads at all times due to Irish weather status. peak load information to extend the input set of the
Furthermore, such results can also be explained by the forecasting model.

746
References [11] McLoughlin, F., Duffy, A., and Conlon,M., “Char-
acterising domestic electricity consumption patterns
[1] Aman, S., Simmhan, Y., and Prasanna, by dwelling and occupant socio-economic variables:
V.K.,“Improving Energy Use Forecast for Campus An Irish case study”, Energy and Buildings, vol.48,
Micro-grids using Indirect Indicators”, in Proceedings pp.240-248, 2012.
of the International Workshop on Domain Driven [12] Beckel, C., Sadamori, L., and Santini,S.,“Towards
Data Mining (DDDM), pp.1-9, 2011. automatic classification of private households using
[2] Tennesee Valley Authority, “Energy Vision 2020: In- electricity consumption data”, in Proceedings of the
tegrated Resource Plan/Environmental Impact State- Fourth ACM Workshop on Embedded Sensing Systems
ment”, Chapter 6, December, 1995. for Energy-Efficiency in Buildings, pp.169-176, 2012.

[3] Kolter, Z. and Ferreira, J., “A large-scale study on [13] Beckel, C., Sadamori, L., and Santini, S., “Auto-
predicting and contextualizing building energy usage”, matic socio-economic classification of households us-
in Proceedings of the Twenty-Fifth AAAI Conference ing electricity consumption data”, in Proceedings of
on Artificial Intelligence, pp.330-338, 2011. the fourth international conference on Future energy
systems, pp.75-86, 2013.
[4] Centra, M., “Hourly Electricity Load Forecasting: An
Empirical Application to the Italian Railways”, World [14] Taylor, J.W., and Buizza,R., “Neural Network Load
Academy of Science, Engineering and Technology, Forecasting With Weather Ensemble Prediction”,
vol.5, pp.888-895, 2011. IEEE Transactions on Power Systems, vol.17, pp.626-
632, 2002.
[5] Hesham K.A. and Nazeeruddin, M., “Electric load
forecasting: literature survey and classification of [15] Mohandes, M., “Support vector machines for short-
methods”, International Journal of Systems Science, term electrical load forecasting”, International Journal
vol.33, pp.23-34, 2002. of Energy Research, vol.26, pp.335-345, 2002.
[6] Humeau, S.F.R.J., Wijaya, T. K., Vasirani, M. and
[16] El Desouky, A., and Elkateb, M., “Hybrid adaptive
Aberer, K.,“Electricity Load Forecasting for Residen-
techniques for electric-load forecast using ANN and
tial Customers: Exploiting Aggregation and Correla-
ARIMA”, in Proceedings of Generation, Transmission
tion between households”, Sustainable Internet and
and Distribution, vol.147, pp.213-217, 2000.
ICT for Sustainability (SustainIT), pp.1-6, 2013.
[7] Aung, Z., Touky, M., Williams, J.R., and Herrero, [17] Abreu, J., and Pereira, F., “Household Electricity
S., “Towards Accurate Electricity Load Forecasting Consumption Routines and Tailored Feedback”, in
in Smart Grids”, in Proceedings of The Fourth In- Proceedings of ACEEE Summer Study on Energy
ternational Conference on Advances in Databases, Efficiency, pp.193-206, 2012.
Knowledge, and Data Applications, pp.52-57, 2012.
[18] “Eletricity Customer Behaviour Trial”, Commission
[8] Miranda, V., and Monteiro, C., “Fuzzy Inference Ap- for Energy Regulation, Ireland,2011.
plied to Spatial Load Forecasting”, in Proceedings of
International Conference on Electric Power Engineer- [19] Bishop, C.M., “Pattern Recognition and Machine
ing, pp.35-40, 1999. Learning”, Spring Verlag, 2006.

[9] Fu, C.W., and Nguyen, T.T, “Models for long-term [20] Fan, R.E., Chang, K.W., Hsieh, C.J., Wang, X.R.,
energy forecasting”, in Proceedings of IEEE Power and Lin, C.J., “LIBLINEAR: A library for large
Engineering Society General Meeting, pp.235-239, linear classification”, Journal of Machine Learning
2003. Research, vol.9, pp.1871-1874, 2008.
[10] Kermanshahi B. S., and Iwamiya H., “Up to year 2020
load forecasting using neural nets”, Electric Power
System Research, Elsevier, vol.24, pp.787-797, 2002.

747

You might also like