Professional Documents
Culture Documents
Open in app
Lianne & Justin @ Just into Data Mar 22, 2020 · 8 min read
In this tutorial, we present a deep learning time series analysis example with Python.
You’ll see:
How to handle large time series datasets when we have limited computer memory.
How to fit Long Short-Term Memory (LSTM) with TensorFlow Keras neural
networks model.
And More.
If you want to analyze large time series dataset with machine learning techniques, you’ll
love this guide with practical tips.
https://towardsdatascience.com/3-steps-to-forecast-time-series-lstm-with-tensorflow-keras-ba88c6f05237 1/16
13/10/2021 08:17 3 Steps to Forecast Time Series: LSTM with TensorFlow Keras | Towards Data Science
Open in app
1 # import packages
2 import math
3
4 import tensorflow as tf
5 from tensorflow import keras
6 from tensorflow.keras import layers
7 from tensorflow.keras.utils import Sequence
8 from datetime import timedelta
9 from sklearn.preprocessing import MinMaxScaler
10 from sklearn.metrics import mean_squared_error
11
12 import numpy as np
13 import pandas as pd
14 import time
15
16 import os
17
18 # read the dataset into python
19 df = pd.read_csv('household_power_consumption.txt', delimiter=';')
20 df.head()
reading_data.py
hosted with ❤ by GitHub view raw
https://towardsdatascience.com/3-steps-to-forecast-time-series-lstm-with-tensorflow-keras-ba88c6f05237 2/16
13/10/2021 08:17 3 Steps to Forecast Time Series: LSTM with TensorFlow Keras | Towards Data Science
1 %%time
2
3 # This code is copied from https://towardsdatascience.com/time-series-analysis-visualization-for
4 # with a few minor changes.
5 #
6 df['date_time'] = pd.to_datetime(df['Date'] + ' ' + df['Time'])
7 df['Global_active_power'] = pd.to_numeric(df['Global_active_power'], errors='coerce')
8 df = df.dropna(subset=['Global_active_power'])
9
10 df['date_time'] = pd.to_datetime(df['date_time'])
11
12 df = df.loc[:, ['date_time', 'Global_active_power']]
13 df.sort_values('date_time', inplace=True, ascending=True)
14 df = df.reset_index(drop=True)
15
16 print('Number of rows and columns after removing missing values:', df.shape)
17 print('The time series starts from: ', df['date_time'].min())
18 print('The time series ends on: ', df['date_time'].max())
preprocessing_data.py
hosted with ❤ by GitHub view raw
1 df.info()
2
3 df.head(10)
time_series_data.py
hosted with ❤ by GitHub view raw
https://towardsdatascience.com/3-steps-to-forecast-time-series-lstm-with-tensorflow-keras-ba88c6f05237 3/16
13/10/2021 08:17 3 Steps to Forecast Time Series: LSTM with TensorFlow Keras | Towards Data Science
Open in app
Next, we split the dataset into training, validation, and test datasets.
df_test holds the data within the last 7 days in the original dataset. df_val has data 14
days before the test dataset. df_train has the rest of the data.
splitting_data_training_validation_test.py
hosted with ❤ by GitHub view raw
Related article: Time Series Analysis, Visualization & Forecasting with LSTM
This article forecasted the Global_active_power only 1 minute ahead of historical data.
https://towardsdatascience.com/3-steps-to-forecast-time-series-lstm-with-tensorflow-keras-ba88c6f05237 4/16
13/10/2021 08:17 3 Steps to Forecast Time Series: LSTM with TensorFlow Keras | Towards Data Science
But practically, we want to forecast over a more extended period, which we’ll do in this
Open in app
article.
The graph below visualizes the problem: using the lagged data (from t-n to t-1) to
predict the target (t+10).
Unit: minutes
It is not efficient to loop through the dataset while training the model. So we want to
transform the dataset with each row representing the historical data and the target.
Unit: minutes
In this way, we only need to train the model using each row of the above matrix.
How do we handle this larger new data structure when our computer memory is
limited?
https://towardsdatascience.com/3-steps-to-forecast-time-series-lstm-with-tensorflow-keras-ba88c6f05237 5/16
13/10/2021 08:17 3 Steps to Forecast Time Series: LSTM with TensorFlow Keras | Towards Data Science
start_index: the earliest time to be included in all the historical data for forecasting.
In this practice, we want to include history from the very beginning, so we set the
default of it to be 0.
end_index: the latest time to be included in all the historical data for forecasting.
In this practice, we want to include all the history, so we set the default of it to be
None.
This is necessary to divide the large new dataset into smaller files.
data_folder: the one single folder that will contain all the files.
In the end, just know that this function creates a folder with files.
And each file contains a pandas dataframe that looks like the new dataset in the chart
above.
y, which is the target to predict. This will be the value at t + target_step (t + 10).
https://towardsdatascience.com/3-steps-to-forecast-time-series-lstm-with-tensorflow-keras-ba88c6f05237 6/16
13/10/2021 08:17 3 Steps to Forecast Time Series: LSTM with TensorFlow Keras | Towards Data Science
At the same time, the function also returns the number of lags (len(col_names)-1) in the
Open in app
dataframes. This number will be required when defining the shape for TensorFlow
models later.
create 158 files (each including a pandas dataframe) within the folder ts_data.
1 %%time
2
3 global_active_power = df_train['Global_active_power'].values
4
5 # Scaled to work with Neural networks.
6 scaler = MinMaxScaler(feature_range=(0, 1))
7 global_active_power_scaled = scaler.fit_transform(global_active_power.reshape(-1, 1)).reshape(-1
8
9 history_length = 7*24*60 # The history length in minutes.
10 step_size = 10 # The sampling rate of the history. Eg. If step_size = 1, then values from every
11 # If step size = 10 then values every 10 m
12 target_step = 10 # The time step in the future to predict. Eg. If target_step = 0, then predict
13 # If target_step = 10 then predict
14
15 # The csv creation returns the number of rows and number of features. We need these values below
16 num_timesteps = create_ts_files(global_active_power_scaled,
17 start_index=0,
18 end_index=None,
19 history_length=history_length,
20 step_size=step_size,
21 target_step=target_step,
22 num_rows_per_file=128*100,
23 data_folder='ts_data')
24
https://towardsdatascience.com/3-steps-to-forecast-time-series-lstm-with-tensorflow-keras-ba88c6f05237 8/16
13/10/2021 08:17 3 Steps to Forecast Time Series: LSTM with TensorFlow Keras | Towards Data Science
25 # I found that the easiest way to do time series with tensorflow is by creating pandas files wit
Open in26
app # the value to predict y = x{t+n}. We tried doing it using TFRecords, but that API is not very i
27 # The resulting file using these parameters is over 17GB. If history_length is increased, or st
28 # Hard to fit into laptop memory, so need to use other means to load the data from the hard driv
transform data neural networks time series.py hosted with ❤ by GitHub view raw
The folder ts_data is around 16 GB, and we were only using the past 7 days of data to
predict. Now you can see why it’s necessary to divide the dataset into smaller
dataframes!
There are built-in functions from Keras such as Keras Sequence, tf.data API. But they are
not very efficient for this purpose.
- filename_format, which is the string format of the file names in the ts_folder.
For example, when the files are ts_file0.pkl, ts_file1.pkl, …, ts_file100.pkl, the
format would be ‘ts_file{}.pkl’.
https://towardsdatascience.com/3-steps-to-forecast-time-series-lstm-with-tensorflow-keras-ba88c6f05237 9/16
13/10/2021 08:17 3 Steps to Forecast Time Series: LSTM with TensorFlow Keras | Towards Data Science
get_chunk: this method takes the dataframe from one of the files, processes it to be
Open in app
ready for training.
shuffle_chunks: this method shuffles the order of the chunks that are returned in
get_chunk. This is a good practice for modeling.
The definitions might seem a little confusing. But keep reading, you’ll see this object in
action within the next step.
1 #
2 # So we can handle loading the data in chunks from the hard drive instead of having to load every
3 #
4 # The reason we want to do this is so we can do custom processing on the data that we are feeding
5 # LSTM requires a certain shape and it is tricky to get it right.
6 #
7 class TimeSeriesLoader:
8 def __init__(self, ts_folder, filename_format):
9 self.ts_folder = ts_folder
10
11 # find the number of files.
12 i = 0
13 file_found = True
14 while file_found:
15 filename = self.ts_folder + '/' + filename_format.format(i)
16 file_found = os.path.exists(filename)
17 if file_found:
18 i += 1
19
20 self.num_files = i
21 self.files_indices = np.arange(self.num_files)
22 self.shuffle_chunks()
23
24 def num_chunks(self):
25 return self.num_files
26
27 def get_chunk(self, idx):
28 assert (idx >= 0) and (idx < self.num_files)
29
30 ind = self.files_indices[idx]
31 filename = self.ts_folder + '/' + filename_format.format(ind)
32 df_ts = pd.read_pickle(filename)
33 num_records = len(df_ts.index)
34
35 features = df_ts.drop('y', axis=1).values
36 target = df_ts['y'].values
37
38 # reshape for input into LSTM. Batch major format.
39 features_batchmajor = np.array(features).reshape(num_records, -1, 1)
40 return features_batchmajor, target
41
42 # this shuffles the order the chunks will be outputted from get_chunk.
https://towardsdatascience.com/3-steps-to-forecast-time-series-lstm-with-tensorflow-keras-ba88c6f05237 10/16
13/10/2021 08:17 3 Steps to Forecast Time Series: LSTM with TensorFlow Keras | Towards Data Science
43 def shuffle_chunks(self):
Open in44
app np.random.shuffle(self.files_indices)
i i l d h d i h ❤ b Gi H b i
1 ts_folder = 'ts_data'
2 filename_format = 'ts_file{}.pkl'
3 tss = TimeSeriesLoader(ts_folder, filename_format)
time_series_split.py
hosted with ❤ by GitHub view raw
Now with the object tss points to our dataset, we are finally ready for LSTM!
LSTM networks are well-suited to classifying, processing and making predictions based on
time series data, since there can be lags of unknown duration between important events in a
time series.
As mentioned before, we are going to build an LSTM model based on the TensorFlow
Keras library.
We all know the importance of hyperparameter tuning based on our guide. But in this
article, we are simply demonstrating the model fitting without tuning.
- the number of time series as 1. Since we are only using one feature of
global_active_power.
keras_model_lstm.py
hosted with ❤ by GitHub view raw
Then we also define the optimization function and the loss function. Again, tuning these
hyperparameters to find the best option would be a better practice.
training_configuration.py
hosted with ❤ by GitHub view raw
To take a look at the model we just defined before running, we can print out the
summary.
1 model.summary()
model_summary.py
hosted with ❤ by GitHub view raw
You can see that the output shape looks good, which is n / step_size (7*24*60 / 10 =
1008). The number of parameters that need to be trained looks right as well (4*units*
(units+2) = 480).
We train each chunk in batches, and only run for one epoch. Ideally, you would train for
Open in app
multiple epochs for neural networks.
1 %%time
2
3 # train in batch sizes of 128.
4 BATCH_SIZE = 128
5 NUM_EPOCHS = 1
6 NUM_CHUNKS = tss.num_chunks()
7
8 for epoch in range(NUM_EPOCHS):
9 print('epoch #{}'.format(epoch))
10 for i in range(NUM_CHUNKS):
11 X, y = tss.get_chunk(i)
12
13 # model.fit does train the model incrementally. ie. Can call multiple times in batches.
14 # https://github.com/keras-team/keras/issues/4446
15 model.fit(x=X, y=y, batch_size=BATCH_SIZE)
16
17 # shuffle the chunks so they're not in the same order next time around.
18 tss.shuffle_chunks()
model_fit.py
hosted with ❤ by GitHub view raw
After fitting the model, we may also evaluate the model performance using the
validation dataset.
Same as the training dataset, we also create a folder of the validation data, which
prepares the validation dataset for model fitting.
evaluate_model_validation.py
hosted with ❤ by GitHub view raw
Besides testing using the validation dataset, we also test against a baseline model using
only the most recent history point (t + 10–11).
1 # If we assume that the validation dataset can fit into memory we can do this.
2 df_val_ts = pd.read_pickle('ts_val_data/ts_file0.pkl')
3
4
5 features = df_val_ts.drop('y', axis=1).values
6 features_arr = np.array(features)
7
8 # reshape for input into LSTM. Batch major format.
9 num_records = len(df_val_ts.index)
10 features_batchmajor = features_arr.reshape(num_records, -1, 1)
11
12
13 y_pred = model.predict(features_batchmajor).reshape(-1, )
14 y_pred = scaler.inverse_transform(y_pred.reshape(-1, 1)).reshape(-1 ,)
15
16 y_act = df_val_ts['y'].values
17 y_act = scaler.inverse_transform(y_act.reshape(-1, 1)).reshape(-1 ,)
18
19 print('validation mean squared error: {}'.format(mean squared error(y act, y pred)))
https://towardsdatascience.com/3-steps-to-forecast-time-series-lstm-with-tensorflow-keras-ba88c6f05237 14/16
13/10/2021 08:17 3 Steps to Forecast Time Series: LSTM with TensorFlow Keras | Towards Data Science
p ( q {} ( _ q _ (y_ , y_p )))
20
Open in21
app #baseline
22 y_pred_baseline = df_val_ts['x_lag11'].values
23 y_pred_baseline = scaler.inverse_transform(y_pred_baseline.reshape(-1, 1)).reshape(-1 ,)
24 print('validation baseline mean squared error: {}'.format(mean_squared_error(y_act, y_pred_basel
validation.py
hosted with ❤ by GitHub view raw
The validation dataset using LSTM gives Mean Squared Error (MSE) of 0.418. While the
baseline model has MSE of 0.428. The LSTM does slightly better than the baseline.
We could do better with hyperparameter tuning and more epochs. Plus, some other
essential time series analysis tips such as seasonality would help too.
Hope you found something useful in this guide. Leave a comment if you have any
questions.
Before you leave, don’t forget to sign up for the Just into Data newsletter! Or connect with
us on Twitter, Facebook.
So you won’t miss any new data science articles from us!
How to Learn Data Science Online: ALL You Need to Know - Just into
Data
This is a complete roadmap/curriculum of getting into data science
with online resources. Whether you want to learn for…
www.justintodata.com
Open in app
https://towardsdatascience.com/3-steps-to-forecast-time-series-lstm-with-tensorflow-keras-ba88c6f05237 16/16