Intro To Py and ML - Part 3

Dr Mohd Hilmi Hasan
DATA ANALYTICS
OAU5362/DAM5362
May 2021
OUTCOMES & OUTLINE
OUTCOMES OUTLINES
At the end of this session, you will be able to: • Overview of data preparation
• Explain data preparation and design the related • Data Understanding
Python script. • Data Extraction
• Data Cleaning
• Data Transformation
• Features Settings
2
OVERVIEW OF DATA PREPARATION
• Data preparation – difficult since it is different according to dataset and specific to project, yet it is critical.
• The objectives are to make sure the dataset is accurate, complete, and relevant.
• People agree on:

o Garbage in, garbage out
o 70%-80% of the Machine Learning project time is spent on data preparation
• However, there are common processes which are implemented in various projects.
• The processes are:

o Data quality/understanding
o Data acquiring/extraction
o Data cleaning
o Data transformation
o Features setting for modelling
3
OVERVIEW OF DATA PREPARATION
Data Quality
Features Data
Setting Understanding
Data
Preparation
Data Data Acquiring

Transformation / Extraction
Data Cleaning
4
DATA QUALITY / UNDERSTANDING
• A very important phase in machine learning project development.
• Normally conducted in the first stage of a project.
• Among the activities are:

o Understanding infra/network/system setup
o Understanding data and source
o Determine bad values, non-numeric values etc
o Know the behavior of the equipment
o Tags, number of tags, related equipment/system
o Failure report, FFN
o Data uniqueness, completeness
5
DATA ACQUIRING / EXTRACTION
• Acquire data from the source – engineers, server etc.
• E.g. Data is acquired every 15 minutes from server and contains data points of various sensor tags of different
equipment.
• Need to extract tags and values by equipment
• Save the extracted data accordingly in respective file (e.g. csv)
6
CASE STUDY
• Assume:
o The acquired data is in ex_acquired.csv – containing data of 20 sensor tags for 3 time frames.
o The 20 tags are tags of 2 compressor equipment – Comp1 and Comp2 → this information is contained in
ex_eqlist.csv.
• We are to extract the acquired data and save them into 2 separate files by equipment.
• Firstly, read the datasets and display the information:
import pandas as pd
from pandas import DataFrame
#read the datasets

df_acquired=pd.read_csv('ex_acquired.csv')
df_eqlist=pd.read_csv('ex_eqlist.csv’)
#display info
df_acquired.head()
7
CASE STUDY
• Create 2 empty dataframes – df_comp1 and df_comp2:
df_comp1 = pd.DataFrame(columns=['Time', 'Tag', 'Value'])

df_comp2 = pd.DataFrame(columns=['Time', 'Tag', 'Value’])
• Select the data from ex_acquired.csv and compare with the equipment data in ex_eqlist.csv. Then display the
assigned equipment of each tags:
for i in range (len(df_acquired)): #loop until 60 observations

for j in range (len(df_eqlist)): #loop until 20 observations of tags
if (df_acquired.iloc[i,1]==df_eqlist.iloc[j,0]):
print(df_acquired.iloc[i,1], " : ", df_eqlist.iloc[j,1])
8
CASE STUDY
• Copy and insert the data into df_comp1 or df_comp2 according to the equipment:
if (df_eqlist.iloc[j,1]==“Comp1"):
df_comp1=df_comp1.append({'Time': df_acquired.iloc[i,0], 'Tag’:
df_acquired.iloc[i,1],'Value': df_acquired.iloc[i,2]}, ignore_index=True)
else:
df_comp2=df_comp2.append({'Time': df_acquired.iloc[i,0], 'Tag': df_acquired.iloc[i,1],
'Value': df_acquired.iloc[i,2]}, ignore_index=True)
• The whole codes:
9
CASE STUDY
• Save the dataframes to csv files:
df_comp1.to_csv(‘comp1_data.csv')
df_comp2.to_csv(‘comp2_data.csv')
10
DATA CLEANING
• Certain tag values contain bad values which need to be removed.
• The definition of bad values is based on the project.
• E.g.:
TagsValue Definition Value to be assigned
Alarm Good value 1
Bad Bad value Garbage
Calc Failed Bad value Garbage
Configure Bad value Garbage
Connected Good value 1
FALSE Good value 0
FAULT Good value 0
Good Good value 1
I/O Timeout Bad value Garbage
Intf Shut Bad value Garbage
NOTE:
Normal Good value 1
Not Connect Bad value Garbage • Bad values are removed
Out of Serv Bad value Garbage
Pt Created Bad value Garbage • The remaining values in non-numerical form
Scan Off Bad value Garbage are converted into numeric value e.g. 1 or 0
TRUE Good value 1
11
DATA CLEANING
Start
Search bad Bad value

Data set value list
Found YES Remove bad

bad value
value?
NO
Convert non-
Non-numeric numeric
values list values
Cleaned data
End
12
DATA CLEANING
CASE STUDY
• We want to remove bad values from ex2_raw.csv. The references of the bad/good values contained in
ex2_badvalref.csv.
• Firstly, retrieve the datasets:

import pandas as pd
df_raw=pd.read_csv('ex2_raw.csv')
df_ref=pd.read_csv('ex2_badvalref.csv’)
• You may check the dataframes’ contents – use head() or describe() functions.
• Create 2 new dataframes – one to store the cleaned data, another one to store the removed data (Note: in
some cases, the info on the removed bad values are required by domain experts).
df_cleaned=df_raw
df_logrm=pd.DataFrame(columns=['Time', 'Tag', 'Value'])
13
DATA CLEANING
CASE STUDY
• Evaluate the dataframes and do the cleaning:
for i in range (len(df_raw)):

for j in range (len(df_ref)):
if ((df_raw.iloc[i,2]==df_ref.iloc[j,0]) and (df_ref.iloc[j,2]=='Garbage')):
#to display the 5 bad values
print(df_raw.iloc[i,2])
#remove bad values from dataframe
df_cleaned=df_cleaned.drop(i)
#log the removed bad values
df_logrm=df_logrm.append({'Time': df_raw.iloc[i,0], 'Tag': df_raw.iloc[i,1], \
'Value': df_raw.iloc[i,2]}, ignore_index=True)
• Check the contents of df_cleaned and df_logrm.
14
DATA CLEANING
CASE STUDY
• Now we change the non-numeric values in df_cleaned into respective numerical values:
for i in range (len(df_ref)):

for j in range (len(df_cleaned)):
if (df_ref.iloc[i,0]==df_cleaned.iloc[j,2]):
df_cleaned.iloc[j,2]=df_ref.iloc[i,2]
• Check the df_cleaned
• Save df_cleaned and df_logrm dataframes into csv:
df_cleaned.to_csv('ex2res_cleaned.csv')
df_logrm.to_csv('ex2res_logrm.csv')
15
DATA TRANSFORMATION
• Purpose: To transform dataset’s dimension to follow the required format for modeling.
• E.g. Cleaned data sets comprise data organized in 3 columns x n rows format. The 3 columns are Tag ID, Time and
Value. That means, rows consist of tags with their time and value. This format requires transformation as the
modeling process expects data sets to be Tag ID as the column, and samples or rows are listed by Time.
• This requires data sets to be transformed, i.e. pivot process is imposed upon the cleaned data so that the Tag ID
that is originally listed down by row, now must become the column.
• The pivot process requires a lot of computational power since the data sets are all huge in size. This requires the
data sets to be split into several pieces so that each small part can be computed with reasonable time. Finally,
the pieces of data sets are combined to be used in modeling.
• Other activities that may be carried out:

o Imputation
o Aggregation/clustering
16
DATA TRANSFORMATION
• Purpose: To transform dataset’s dimension to follow the required format for modeling (using ex3_notrfm.csv).
import pandas as pd
df_trfm=pd.read_csv('ex3_notrfm.csv')
df2=df_trfm.pivot(index='Time',columns
='Tag',values='Value')
17
DATA TRANSFORMATION
• Another activity that may be required at this stage is imputation – to fill up the missing values.
• We may see a lot of cells containing NaN which show missing values.
• Some of the techniques:

o Mean – replace missing values with the mean value
o Median – replace missing values with the median value
o Interpolation – take the points before the missing value and after the missing value, then connect the points
with values in between
o K-Nearest Neighbors
o In some cases, rows with missing values may need to be dropped (removed).
18
Quiz
• Find the dataset “Assignment_cv19_dataset”.
• Using Python, generate three different types of visualization/plots/charts for the data.
• The visualization/plots/charts may represent any parts of the data or the whole data itself.
• Submission: individual assignment, copy and paste your codes and the visualization/plots/charts in a single
pdf document.
• Deadline: Friday, 7/5/2021, 11.55PM (via ULearnX)
19
20

Intro To Py and ML - Part 3

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Intro To Py and ML - Part 3

Uploaded by

Copyright:

Available Formats

Dr Mohd Hilmi Hasan

• People agree on:

• The processes are:

Data Data Acquiring

• A very important phase in machine learning project development.

• Normally conducted in the first stage of a project.

• Among the activities are:

• Acquire data from the source – engineers, server etc.

• Firstly, read the datasets and display the information:

#read the datasets

df_comp1 = pd.DataFrame(columns=['Time', 'Tag', 'Value'])

for i in range (len(df_acquired)): #loop until 60 observations

• The whole codes:

• Certain tag values contain bad values which need to be removed.

• The definition of bad values is based on the project.

Search bad Bad value

Found YES Remove bad

• Firstly, retrieve the datasets:

• Evaluate the dataframes and do the cleaning:

for i in range (len(df_raw)):

• Check the contents of df_cleaned and df_logrm.

for i in range (len(df_ref)):

• Check the df_cleaned

• Save df_cleaned and df_logrm dataframes into csv:

• Other activities that may be carried out:

• Some of the techniques:

• Find the dataset “Assignment_cv19_dataset”.

• Deadline: Friday, 7/5/2021, 11.55PM (via ULearnX)

You might also like