3 Steps To Forecast Time Series - LSTM With TensorFlow Keras - Towards Data Science

13/10/2021 08:17 3 Steps to Forecast Time Series: LSTM with TensorFlow Keras | Towards Data Science
Open in app
Following 587K Followers
3 Steps to Forecast Time Series: LSTM with

TensorFlow Keras
A Practical Example in Python with useful Tips
Lianne & Justin @ Just into Data Mar 22, 2020 · 8 min read
Source: Adobe Stock
In this tutorial, we present a deep learning time series analysis example with Python.
You’ll see:
How to preprocess/transform the dataset for time series forecasting.
How to handle large time series datasets when we have limited computer memory.
How to fit Long Short-Term Memory (LSTM) with TensorFlow Keras neural
networks model.
And More.
If you want to analyze large time series dataset with machine learning techniques, you’ll
love this guide with practical tips.
Let’s begin now!
https://towardsdatascience.com/3-steps-to-forecast-time-series-lstm-with-tensorflow-keras-ba88c6f05237 1/16
Open in app
T he dataset we are using is the Household Electric Power Consumption from

Kaggle. It provides measurements of electric power consumption in one
household with a one-minute sampling rate.
There are 2,075,259 measurements gathered within 4 years. Different electrical

quantities and some sub-metering values are available. But we’ll only focus on three
features:
Date: date in format dd/mm/yyyy
Time: time in format hh:mm:ss
Global_active_power: household global minute-averaged active power (in kilowatt)
In this project, we will predict the amount of Global_active_power 10 minutes ahead.
1 # import packages
2 import math
3
4 import tensorflow as tf
5 from tensorflow import keras
6 from tensorflow.keras import layers
7 from tensorflow.keras.utils import Sequence
8 from datetime import timedelta
9 from sklearn.preprocessing import MinMaxScaler
10 from sklearn.metrics import mean_squared_error
11
12 import numpy as np
13 import pandas as pd
14 import time
15
16 import os
17
18 # read the dataset into python
19 df = pd.read_csv('household_power_consumption.txt', delimiter=';')
20 df.head()
reading_data.py
hosted with ❤ by GitHub view raw
Step #1: Preprocessing the Dataset for Time Series Analysis

Open in app
To begin, let’s process the dataset to get ready for time series analysis.
We transform the dataset df by:
creating feature date_time in DateTime format by combining Date and Time.
converting Global_active_power to numeric and remove missing values (1.25%).
ordering the features by time in the new dataset.
1 %%time
2
3 # This code is copied from https://towardsdatascience.com/time-series-analysis-visualization-for
4 # with a few minor changes.
5 #
6 df['date_time'] = pd.to_datetime(df['Date'] + ' ' + df['Time'])
7 df['Global_active_power'] = pd.to_numeric(df['Global_active_power'], errors='coerce')
8 df = df.dropna(subset=['Global_active_power'])
9
10 df['date_time'] = pd.to_datetime(df['date_time'])
11
12 df = df.loc[:, ['date_time', 'Global_active_power']]
13 df.sort_values('date_time', inplace=True, ascending=True)
14 df = df.reset_index(drop=True)
15
16 print('Number of rows and columns after removing missing values:', df.shape)
17 print('The time series starts from: ', df['date_time'].min())
18 print('The time series ends on: ', df['date_time'].max())
preprocessing_data.py
Now we have a dataset df as below.
1 df.info()
2
3 df.head(10)
time_series_data.py
Open in app
Next, we split the dataset into training, validation, and test datasets.
df_test holds the data within the last 7 days in the original dataset. df_val has data 14
days before the test dataset. df_train has the rest of the data.
1 # Split into training, validation and test datasets.

2 # Since it's timeseries we should do it by date.
3 test_cutoff_date = df['date_time'].max() - timedelta(days=7)
4 val_cutoff_date = test_cutoff_date - timedelta(days=14)
5
6 df_test = df[df['date_time'] > test_cutoff_date]
7 df_val = df[(df['date_time'] > val_cutoff_date) & (df['date_time'] <= test_cutoff_date)]
8 df_train = df[df['date_time'] <= val_cutoff_date]
9
10 #check out the datasets
11 print('Test dates: {} to {}'.format(df_test['date_time'].min(), df_test['date_time'].max()))
12 print('Validation dates: {} to {}'.format(df_val['date_time'].min(), df_val['date_time'].max()))
13 print('Train dates: {} to {}'.format(df_train['date_time'].min(), df_train['date_time'].max()))
splitting_data_training_validation_test.py
Related article: Time Series Analysis, Visualization & Forecasting with LSTM
This article forecasted the Global_active_power only 1 minute ahead of historical data.
But practically, we want to forecast over a more extended period, which we’ll do in this
Open in app
article.
Step #2: Transforming the Dataset for TensorFlow Keras

Before we can fit the TensorFlow Keras LSTM, there are still other processes that need to
be done.
Let’s deal with them little by little!
Dividing the Dataset into Smaller Dataframes

As mentioned earlier, we want to forecast the Global_active_power that’s 10 minutes in
the future.
The graph below visualizes the problem: using the lagged data (from t-n to t-1) to
predict the target (t+10).
Unit: minutes
It is not efficient to loop through the dataset while training the model. So we want to
transform the dataset with each row representing the historical data and the target.
Unit: minutes
In this way, we only need to train the model using each row of the above matrix.
Now here comes the challenges:
How do we convert the dataset to the new structure?
How do we handle this larger new data structure when our computer memory is
limited?
As a result, the function create_ts_files is defined:
to convert the original dataset to the new dataset above.

Open in app
at the same time, to divide the new dataset into smaller files, which is easier to
process.
Within this function, we define the following parameters:
start_index: the earliest time to be included in all the historical data for forecasting.
In this practice, we want to include history from the very beginning, so we set the
default of it to be 0.
end_index: the latest time to be included in all the historical data for forecasting.
In this practice, we want to include all the history, so we set the default of it to be
None.
history_length: this is n mentioned earlier, which is the number of timesteps to look

back for each forecasting.
step_size: the stride of the history window.
Global_active_power doesn’t change fast throughout time. So to be more efficient,

we can let step_size = 10. In this way, we downsample to use every 10 minutes of
data in the past to predict the future amount. We are only looking at t-1, t-11, t-21
until t-n to predict t+10.
target_step: the number of periods in the future to predict.
As mentioned earlier, we are trying to predict the global_active_power 10 minutes

ahead. So this feature = 10.
num_rows_per_file: the number of records to put in each file.
This is necessary to divide the large new dataset into smaller files.
data_folder: the one single folder that will contain all the files.
That’s a lot of complicated parameters!
In the end, just know that this function creates a folder with files.
And each file contains a pandas dataframe that looks like the new dataset in the chart
above.
Each of these dataframes has columns:
y, which is the target to predict. This will be the value at t + target_step (t + 10).
x_lag{i}, the value at time t + target_step — i (t + 10–11, t + 10–21, and so on),

i.e., the lagged value compared to y.
At the same time, the function also returns the number of lags (len(col_names)-1) in the
Open in app
dataframes. This number will be required when defining the shape for TensorFlow
models later.
1 # Goal of the model:

2 # Predict Global_active_power at a specified time in the future.
3 # Eg. We want to predict how much Global_active_power will be ten minutes from now.
4 # We can use all the values from t-1, t-2, t-3, .... t-history_length to predict t+10
5
6
7 def create_ts_files(dataset,
8 start_index,
9 end_index,
10 history_length,
11 step_size,
12 target_step,
13 num_rows_per_file,
14 data_folder):
15 assert step_size > 0
16 assert start_index >= 0
17
18 if not os.path.exists(data_folder):
19 os.makedirs(data_folder)
20
21 time_lags = sorted(range(target_step+1, target_step+history_length+1, step_size), reverse=Tr
22 col_names = [f'x_lag{i}' for i in time_lags] + ['y']
23 start_index = start_index + history_length
24 if end_index is None:
25 end_index = len(dataset) - target_step
26
27 rng = range(start_index, end_index)
28 num_rows = len(rng)
29 num_files = math.ceil(num_rows/num_rows_per_file)
30
31 # for each file.
32 print(f'Creating {num_files} files.')
33 for i in range(num_files):
34 filename = f'{data_folder}/ts_file{i}.pkl'
35
36 if i % 10 == 0:
37 print(f'{filename}')
38
39 # get the start and end indices.
40 ind0 = i*num_rows_per_file
41 ind1 = min(ind0 + num_rows_per_file, end_index)
42 data_list = []
43
44 # j in the current timestep. Will need j-n to j-1 for the history. And j + target_step fo
45 for j in range(ind0, ind1):
46 indices = range(j-1, j-history_length-1, -step_size)
47 data = dataset[sorted(indices) + [j+target_step]]
48
48
49 # append data to the list.
Open in app
50 data_list.append(data)
51
52 df_ts = pd.DataFrame(data=data_list, columns=col_names)
53 df_ts.to_pickle(filename)
54
55 return len(col_names)-1
Before applying the function create_ts_files, we also need to:
scale the global_active_power to work with Neural Networks.
define n, the history_length, as 7 days (7*24*60 minutes).
define step_size within historical data to be 10 minutes.
set the target_step to be 10, so that we are forecasting the global_active_power 10

minutes after the historical data.
After these, we apply the create_ts_files to:
create 158 files (each including a pandas dataframe) within the folder ts_data.
return num_timesteps as the number of lags.
1 %%time
2
3 global_active_power = df_train['Global_active_power'].values
4
5 # Scaled to work with Neural networks.
6 scaler = MinMaxScaler(feature_range=(0, 1))
7 global_active_power_scaled = scaler.fit_transform(global_active_power.reshape(-1, 1)).reshape(-1
8
9 history_length = 7*24*60 # The history length in minutes.
10 step_size = 10 # The sampling rate of the history. Eg. If step_size = 1, then values from every
11 # If step size = 10 then values every 10 m
12 target_step = 10 # The time step in the future to predict. Eg. If target_step = 0, then predict
13 # If target_step = 10 then predict
14
15 # The csv creation returns the number of rows and number of features. We need these values below
16 num_timesteps = create_ts_files(global_active_power_scaled,
17 start_index=0,
18 end_index=None,
19 history_length=history_length,
20 step_size=step_size,
21 target_step=target_step,
22 num_rows_per_file=128*100,
23 data_folder='ts_data')
24
25 # I found that the easiest way to do time series with tensorflow is by creating pandas files wit
Open in26
app # the value to predict y = x{t+n}. We tried doing it using TFRecords, but that API is not very i
27 # The resulting file using these parameters is over 17GB. If history_length is increased, or st
28 # Hard to fit into laptop memory, so need to use other means to load the data from the hard driv
transform data neural networks time series.py hosted with ❤ by GitHub view raw
As the function runs, it prints the name of every 10 files.
The folder ts_data is around 16 GB, and we were only using the past 7 days of data to
predict. Now you can see why it’s necessary to divide the dataset into smaller
dataframes!
Defining the Time Series Object Class

In this procedure, we create a class TimeSeriesLoader to transform and feed the
dataframes into the model.
There are built-in functions from Keras such as Keras Sequence, tf.data API. But they are
not very efficient for this purpose.
Within this class, we define:
__init__: the initial settings of the object, including:
- ts_folder, which will be ts_data that we just created.
- filename_format, which is the string format of the file names in the ts_folder.
For example, when the files are ts_file0.pkl, ts_file1.pkl, …, ts_file100.pkl, the
format would be ‘ts_file{}.pkl’.
num_chunks: the total number of files (chunks).
get_chunk: this method takes the dataframe from one of the files, processes it to be
Open in app
ready for training.
shuffle_chunks: this method shuffles the order of the chunks that are returned in
get_chunk. This is a good practice for modeling.
The definitions might seem a little confusing. But keep reading, you’ll see this object in
action within the next step.
1 #
2 # So we can handle loading the data in chunks from the hard drive instead of having to load every
3 #
4 # The reason we want to do this is so we can do custom processing on the data that we are feeding
5 # LSTM requires a certain shape and it is tricky to get it right.
6 #
7 class TimeSeriesLoader:
8 def __init__(self, ts_folder, filename_format):
9 self.ts_folder = ts_folder
10
11 # find the number of files.
12 i = 0
13 file_found = True
14 while file_found:
15 filename = self.ts_folder + '/' + filename_format.format(i)
16 file_found = os.path.exists(filename)
17 if file_found:
18 i += 1
19
20 self.num_files = i
21 self.files_indices = np.arange(self.num_files)
22 self.shuffle_chunks()
23
24 def num_chunks(self):
25 return self.num_files
26
27 def get_chunk(self, idx):
28 assert (idx >= 0) and (idx < self.num_files)
29
30 ind = self.files_indices[idx]
31 filename = self.ts_folder + '/' + filename_format.format(ind)
32 df_ts = pd.read_pickle(filename)
33 num_records = len(df_ts.index)
34
35 features = df_ts.drop('y', axis=1).values
36 target = df_ts['y'].values
37
38 # reshape for input into LSTM. Batch major format.
39 features_batchmajor = np.array(features).reshape(num_records, -1, 1)
40 return features_batchmajor, target
41
42 # this shuffles the order the chunks will be outputted from get_chunk.
43 def shuffle_chunks(self):
Open in44
app np.random.shuffle(self.files_indices)
i i l d h d i h ❤ b Gi H b i
After defining, we apply this TimeSeriesLoader to the ts_data folder.
1 ts_folder = 'ts_data'
2 filename_format = 'ts_file{}.pkl'
3 tss = TimeSeriesLoader(ts_folder, filename_format)
time_series_split.py
Now with the object tss points to our dataset, we are finally ready for LSTM!
Step #3: Creating the LSTM Model
Long short-term memory (LSTM) is an artificial recurrent neural network (RNN)

architecture used in the field of deep learning.
LSTM networks are well-suited to classifying, processing and making predictions based on
time series data, since there can be lags of unknown duration between important events in a
time series.
As mentioned before, we are going to build an LSTM model based on the TensorFlow
Keras library.
We all know the importance of hyperparameter tuning based on our guide. But in this
article, we are simply demonstrating the model fitting without tuning.
The procedures are below:
define the shape of the input dataset:
- num_timesteps, the number of lags in the dataframes we set in Step #2.
- the number of time series as 1. Since we are only using one feature of
global_active_power.
define the number of units, 4*units*(units+2) is the number of parameters of the

LSTM.
The higher the number, the more parameters in the model.
define the dropout rate, which is used to prevent overfitting.
specify the output layer to have a linear activation function.
define the model.
1 # Create the Keras model

1 # Create the Keras model.
2 # Use hyperparameter optimization if you have the time.
Open in app
3
4 ts_inputs = tf.keras.Input(shape=(num_timesteps, 1))
5
6 # units=10 -> The cell and hidden states will be of dimension 10.
7 # The number of parameters that need to be trained = 4*units*(units+2)
8 x = layers.LSTM(units=10)(ts_inputs)
9 x = layers.Dropout(0.2)(x)
10 outputs = layers.Dense(1, activation='linear')(x)
11 model = tf.keras.Model(inputs=ts_inputs, outputs=outputs)
keras_model_lstm.py
Then we also define the optimization function and the loss function. Again, tuning these
hyperparameters to find the best option would be a better practice.
1 # Specify the training configuration.

2 model.compile(optimizer=tf.keras.optimizers.SGD(learning_rate=0.01),
3 loss=tf.keras.losses.MeanSquaredError(),
4 metrics=['mse'])
training_configuration.py
To take a look at the model we just defined before running, we can print out the
summary.
1 model.summary()
model_summary.py
You can see that the output shape looks good, which is n / step_size (7*24*60 / 10 =
1008). The number of parameters that need to be trained looks right as well (4*units*
(units+2) = 480).
Let’s start modeling!

We train each chunk in batches, and only run for one epoch. Ideally, you would train for
Open in app
multiple epochs for neural networks.
1 %%time
2
3 # train in batch sizes of 128.
4 BATCH_SIZE = 128
5 NUM_EPOCHS = 1
6 NUM_CHUNKS = tss.num_chunks()
7
8 for epoch in range(NUM_EPOCHS):
9 print('epoch #{}'.format(epoch))
10 for i in range(NUM_CHUNKS):
11 X, y = tss.get_chunk(i)
12
13 # model.fit does train the model incrementally. ie. Can call multiple times in batches.
14 # https://github.com/keras-team/keras/issues/4446
15 model.fit(x=X, y=y, batch_size=BATCH_SIZE)
16
17 # shuffle the chunks so they're not in the same order next time around.
18 tss.shuffle_chunks()
model_fit.py
After fitting the model, we may also evaluate the model performance using the
validation dataset.
Same as the training dataset, we also create a folder of the validation data, which
prepares the validation dataset for model fitting.
1 # evaluate the model on the validation set

1 # evaluate the model on the validation set.
2 #
Open in app
3 # Create the validation CSV like we did before with the training.
4 global_active_power_val = df_val['Global_active_power'].values
5 global_active_power_val_scaled = scaler.transform(global_active_power_val.reshape(-1, 1)).reshape
6
7 history_length = 7*24*60 # The history length in minutes.
8 step_size = 10 # The sampling rate of the history. Eg. If step_size = 1, then values from every
9 # If step size = 10 then values every 10 m
10 target_step = 10 # The time step in the future to predict. Eg. If target_step = 0, then predict
11 # If target_step = 10 then predict
12
13 # The csv creation returns the number of rows and number of features. We need these values below
14 num_timesteps = create_ts_files(global_active_power_val_scaled,
15 start_index=0,
16 end_index=None,
17 history_length=history_length,
18 step_size=step_size,
19 target_step=target_step,
20 num_rows_per_file=128*100,
21 data_folder='ts_val_data')
evaluate_model_validation.py
Besides testing using the validation dataset, we also test against a baseline model using
only the most recent history point (t + 10–11).
The detailed Python code is below.
1 # If we assume that the validation dataset can fit into memory we can do this.
2 df_val_ts = pd.read_pickle('ts_val_data/ts_file0.pkl')
3
4
5 features = df_val_ts.drop('y', axis=1).values
6 features_arr = np.array(features)
7
8 # reshape for input into LSTM. Batch major format.
9 num_records = len(df_val_ts.index)
10 features_batchmajor = features_arr.reshape(num_records, -1, 1)
11
12
13 y_pred = model.predict(features_batchmajor).reshape(-1, )
14 y_pred = scaler.inverse_transform(y_pred.reshape(-1, 1)).reshape(-1 ,)
15
16 y_act = df_val_ts['y'].values
17 y_act = scaler.inverse_transform(y_act.reshape(-1, 1)).reshape(-1 ,)
18
19 print('validation mean squared error: {}'.format(mean squared error(y act, y pred)))
p ( q {} ( _ q _ (y_ , y_p )))
20
Open in21
app #baseline
22 y_pred_baseline = df_val_ts['x_lag11'].values
23 y_pred_baseline = scaler.inverse_transform(y_pred_baseline.reshape(-1, 1)).reshape(-1 ,)
24 print('validation baseline mean squared error: {}'.format(mean_squared_error(y_act, y_pred_basel
validation.py
The validation dataset using LSTM gives Mean Squared Error (MSE) of 0.418. While the
baseline model has MSE of 0.428. The LSTM does slightly better than the baseline.
We could do better with hyperparameter tuning and more epochs. Plus, some other
essential time series analysis tips such as seasonality would help too.
Related article: Hyperparameter Tuning with Python: Complete Step-by-Step Guide
Thank you for reading!
Hope you found something useful in this guide. Leave a comment if you have any
questions.
Before you leave, don’t forget to sign up for the Just into Data newsletter! Or connect with
us on Twitter, Facebook.
So you won’t miss any new data science articles from us!
Originally published at https://www.justintodata.com on March 22, 2020.
How to Learn Data Science Online: ALL You Need to Know - Just into
Data
This is a complete roadmap/curriculum of getting into data science
with online resources. Whether you want to learn for…
www.justintodata.com
What is the Coronavirus Death Rate with Hyperparameter Tuning -

Just into Data
We present a deep learning time series analysis example with Python.
You'll see: - How to preprocess/transform the…
Open in app
How to use NLP in Python: a Practical Step-by-Step Example - Just

into Data
In this article, we present a step-by-step NLP application on Indeed
job postings. It is the technical explanation of…
Data Science Deep Learning Machine Learning TensorFlow Python
About Write Help Legal
Get the Medium app

3 Steps To Forecast Time Series - LSTM With TensorFlow Keras - Towards Data Science

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

3 Steps To Forecast Time Series - LSTM With TensorFlow Keras - Towards Data Science

Uploaded by

Copyright:

Available Formats

13/10/2021 08:17 3 Steps to Forecast Time Series: LSTM with TensorFlow Keras | Towards Data Science

Following 587K Followers

3 Steps to Forecast Time Series: LSTM with

Source: Adobe Stock

How to preprocess/transform the dataset for time series forecasting.

Let’s begin now!

T he dataset we are using is the Household Electric Power Consumption from

There are 2,075,259 measurements gathered within 4 years. Different electrical

Date: date in format dd/mm/yyyy

Time: time in format hh:mm:ss

Global_active_power: household global minute-averaged active power (in kilowatt)

In this project, we will predict the amount of Global_active_power 10 minutes ahead.

Step #1: Preprocessing the Dataset for Time Series Analysis

We transform the dataset df by:

creating feature date_time in DateTime format by combining Date and Time.

converting Global_active_power to numeric and remove missing values (1.25%).

ordering the features by time in the new dataset.

Now we have a dataset df as below.

1 # Split into training, validation and test datasets.

Step #2: Transforming the Dataset for TensorFlow Keras

Let’s deal with them little by little!

Dividing the Dataset into Smaller Dataframes

Now here comes the challenges:

How do we convert the dataset to the new structure?

As a result, the function create_ts_files is defined:

to convert the original dataset to the new dataset above.

Within this function, we define the following parameters:

history_length: this is n mentioned earlier, which is the number of timesteps to look

step_size: the stride of the history window.

Global_active_power doesn’t change fast throughout time. So to be more efficient,

target_step: the number of periods in the future to predict.

As mentioned earlier, we are trying to predict the global_active_power 10 minutes

num_rows_per_file: the number of records to put in each file.

That’s a lot of complicated parameters!

Each of these dataframes has columns:

x_lag{i}, the value at time t + target_step — i (t + 10–11, t + 10–21, and so on),

1 # Goal of the model:

Before applying the function create_ts_files, we also need to:

scale the global_active_power to work with Neural Networks.

define n, the history_length, as 7 days (7*24*60 minutes).

define step_size within historical data to be 10 minutes.

set the target_step to be 10, so that we are forecasting the global_active_power 10

After these, we apply the create_ts_files to:

return num_timesteps as the number of lags.

As the function runs, it prints the name of every 10 files.

Defining the Time Series Object Class

Within this class, we define:

__init__: the initial settings of the object, including:

- ts_folder, which will be ts_data that we just created.

num_chunks: the total number of files (chunks).

After defining, we apply this TimeSeriesLoader to the ts_data folder.

Step #3: Creating the LSTM Model

Long short-term memory (LSTM) is an artificial recurrent neural network (RNN)

The procedures are below:

define the shape of the input dataset:

- num_timesteps, the number of lags in the dataframes we set in Step #2.

define the number of units, 4*units*(units+2) is the number of parameters of the

The higher the number, the more parameters in the model.

define the dropout rate, which is used to prevent overfitting.

specify the output layer to have a linear activation function.

define the model.

1 # Create the Keras model

define n, the history_length, as 7 days (72460 minutes).

init: the initial settings of the object, including:

define the number of units, 4units(units+2) is the number of parameters of the