You are on page 1of 20

Dr Mohd Hilmi Hasan

DATA ANALYTICS
OAU5362/DAM5362

May 2021
OUTCOMES & OUTLINE

OUTCOMES OUTLINES

At the end of this session, you will be able to: • Overview of data preparation
• Explain data preparation and design the related • Data Understanding
Python script. • Data Extraction
• Data Cleaning
• Data Transformation
• Features Settings

2
OVERVIEW OF DATA PREPARATION

• Data preparation – difficult since it is different according to dataset and specific to project, yet it is critical.

• The objectives are to make sure the dataset is accurate, complete, and relevant.

• People agree on:


o Garbage in, garbage out
o 70%-80% of the Machine Learning project time is spent on data preparation

• However, there are common processes which are implemented in various projects.

• The processes are:


o Data quality/understanding
o Data acquiring/extraction
o Data cleaning
o Data transformation
o Features setting for modelling

3
OVERVIEW OF DATA PREPARATION

Data Quality

Features Data
Setting Understanding

Data
Preparation

Data Data Acquiring


Transformation / Extraction

Data Cleaning

4
DATA QUALITY / UNDERSTANDING

• A very important phase in machine learning project development.

• Normally conducted in the first stage of a project.

• Among the activities are:


o Understanding infra/network/system setup
o Understanding data and source
o Determine bad values, non-numeric values etc
o Know the behavior of the equipment
o Tags, number of tags, related equipment/system
o Failure report, FFN
o Data uniqueness, completeness

5
DATA ACQUIRING / EXTRACTION

• Acquire data from the source – engineers, server etc.

• E.g. Data is acquired every 15 minutes from server and contains data points of various sensor tags of different
equipment.
• Need to extract tags and values by equipment
• Save the extracted data accordingly in respective file (e.g. csv)

6
DATA ACQUIRING / EXTRACTION
CASE STUDY
• Assume:
o The acquired data is in ex_acquired.csv – containing data of 20 sensor tags for 3 time frames.
o The 20 tags are tags of 2 compressor equipment – Comp1 and Comp2 → this information is contained in
ex_eqlist.csv.

• We are to extract the acquired data and save them into 2 separate files by equipment.

• Firstly, read the datasets and display the information:

import pandas as pd
from pandas import DataFrame

#read the datasets


df_acquired=pd.read_csv('ex_acquired.csv')
df_eqlist=pd.read_csv('ex_eqlist.csv’)

#display info
df_acquired.head()
7
DATA ACQUIRING / EXTRACTION

CASE STUDY
• Create 2 empty dataframes – df_comp1 and df_comp2:

df_comp1 = pd.DataFrame(columns=['Time', 'Tag', 'Value'])


df_comp2 = pd.DataFrame(columns=['Time', 'Tag', 'Value’])

• Select the data from ex_acquired.csv and compare with the equipment data in ex_eqlist.csv. Then display the
assigned equipment of each tags:

for i in range (len(df_acquired)): #loop until 60 observations


for j in range (len(df_eqlist)): #loop until 20 observations of tags
if (df_acquired.iloc[i,1]==df_eqlist.iloc[j,0]):
print(df_acquired.iloc[i,1], " : ", df_eqlist.iloc[j,1])

8
DATA ACQUIRING / EXTRACTION

CASE STUDY
• Copy and insert the data into df_comp1 or df_comp2 according to the equipment:

if (df_eqlist.iloc[j,1]==“Comp1"):
df_comp1=df_comp1.append({'Time': df_acquired.iloc[i,0], 'Tag’:
df_acquired.iloc[i,1],'Value': df_acquired.iloc[i,2]}, ignore_index=True)
else:
df_comp2=df_comp2.append({'Time': df_acquired.iloc[i,0], 'Tag': df_acquired.iloc[i,1],
'Value': df_acquired.iloc[i,2]}, ignore_index=True)

• The whole codes:

9
DATA ACQUIRING / EXTRACTION

CASE STUDY
• Save the dataframes to csv files:

df_comp1.to_csv(‘comp1_data.csv')
df_comp2.to_csv(‘comp2_data.csv')

10
DATA CLEANING

• Certain tag values contain bad values which need to be removed.

• The definition of bad values is based on the project.

• E.g.:
TagsValue Definition Value to be assigned
Alarm Good value 1
Bad Bad value Garbage
Calc Failed Bad value Garbage
Configure Bad value Garbage
Connected Good value 1
FALSE Good value 0
FAULT Good value 0
Good Good value 1
I/O Timeout Bad value Garbage
Intf Shut Bad value Garbage
NOTE:
Normal Good value 1
Not Connect Bad value Garbage • Bad values are removed
Out of Serv Bad value Garbage
Pt Created Bad value Garbage • The remaining values in non-numerical form
Scan Off Bad value Garbage are converted into numeric value e.g. 1 or 0
TRUE Good value 1
11
DATA CLEANING

Start

Search bad Bad value


Data set value list

Found YES Remove bad


bad value
value?

NO
Convert non-
Non-numeric numeric
values list values

Cleaned data

End

12
DATA CLEANING

CASE STUDY

• We want to remove bad values from ex2_raw.csv. The references of the bad/good values contained in
ex2_badvalref.csv.

• Firstly, retrieve the datasets:


import pandas as pd
from pandas import DataFrame

df_raw=pd.read_csv('ex2_raw.csv')
df_ref=pd.read_csv('ex2_badvalref.csv’)

• You may check the dataframes’ contents – use head() or describe() functions.

• Create 2 new dataframes – one to store the cleaned data, another one to store the removed data (Note: in
some cases, the info on the removed bad values are required by domain experts).

df_cleaned=df_raw
df_logrm=pd.DataFrame(columns=['Time', 'Tag', 'Value'])
13
DATA CLEANING

CASE STUDY

• Evaluate the dataframes and do the cleaning:

for i in range (len(df_raw)):


for j in range (len(df_ref)):
if ((df_raw.iloc[i,2]==df_ref.iloc[j,0]) and (df_ref.iloc[j,2]=='Garbage')):
#to display the 5 bad values
print(df_raw.iloc[i,2])
#remove bad values from dataframe
df_cleaned=df_cleaned.drop(i)
#log the removed bad values
df_logrm=df_logrm.append({'Time': df_raw.iloc[i,0], 'Tag': df_raw.iloc[i,1], \
'Value': df_raw.iloc[i,2]}, ignore_index=True)

• Check the contents of df_cleaned and df_logrm.

14
DATA CLEANING

CASE STUDY

• Now we change the non-numeric values in df_cleaned into respective numerical values:

for i in range (len(df_ref)):


for j in range (len(df_cleaned)):
if (df_ref.iloc[i,0]==df_cleaned.iloc[j,2]):
df_cleaned.iloc[j,2]=df_ref.iloc[i,2]

• Check the df_cleaned

• Save df_cleaned and df_logrm dataframes into csv:

df_cleaned.to_csv('ex2res_cleaned.csv')
df_logrm.to_csv('ex2res_logrm.csv')

15
DATA TRANSFORMATION

• Purpose: To transform dataset’s dimension to follow the required format for modeling.

• E.g. Cleaned data sets comprise data organized in 3 columns x n rows format. The 3 columns are Tag ID, Time and
Value. That means, rows consist of tags with their time and value. This format requires transformation as the
modeling process expects data sets to be Tag ID as the column, and samples or rows are listed by Time.

• This requires data sets to be transformed, i.e. pivot process is imposed upon the cleaned data so that the Tag ID
that is originally listed down by row, now must become the column.

• The pivot process requires a lot of computational power since the data sets are all huge in size. This requires the
data sets to be split into several pieces so that each small part can be computed with reasonable time. Finally,
the pieces of data sets are combined to be used in modeling.

• Other activities that may be carried out:


o Imputation
o Aggregation/clustering

16
DATA TRANSFORMATION

• Purpose: To transform dataset’s dimension to follow the required format for modeling (using ex3_notrfm.csv).

import pandas as pd
from pandas import DataFrame
df_trfm=pd.read_csv('ex3_notrfm.csv')

df2=df_trfm.pivot(index='Time',columns
='Tag',values='Value')

17
DATA TRANSFORMATION

• Another activity that may be required at this stage is imputation – to fill up the missing values.

• We may see a lot of cells containing NaN which show missing values.

• Some of the techniques:


o Mean – replace missing values with the mean value
o Median – replace missing values with the median value
o Interpolation – take the points before the missing value and after the missing value, then connect the points
with values in between
o K-Nearest Neighbors

o In some cases, rows with missing values may need to be dropped (removed).

18
Quiz

• Find the dataset “Assignment_cv19_dataset”.

• Using Python, generate three different types of visualization/plots/charts for the data.

• The visualization/plots/charts may represent any parts of the data or the whole data itself.

• Submission: individual assignment, copy and paste your codes and the visualization/plots/charts in a single
pdf document.

• Deadline: Friday, 7/5/2021, 11.55PM (via ULearnX)

19
20

You might also like