Professional Documents
Culture Documents
Intro To Py and ML - Part 3
Intro To Py and ML - Part 3
DATA ANALYTICS
OAU5362/DAM5362
May 2021
OUTCOMES & OUTLINE
OUTCOMES OUTLINES
At the end of this session, you will be able to: • Overview of data preparation
• Explain data preparation and design the related • Data Understanding
Python script. • Data Extraction
• Data Cleaning
• Data Transformation
• Features Settings
2
OVERVIEW OF DATA PREPARATION
• Data preparation – difficult since it is different according to dataset and specific to project, yet it is critical.
• The objectives are to make sure the dataset is accurate, complete, and relevant.
• However, there are common processes which are implemented in various projects.
3
OVERVIEW OF DATA PREPARATION
Data Quality
Features Data
Setting Understanding
Data
Preparation
Data Cleaning
4
DATA QUALITY / UNDERSTANDING
5
DATA ACQUIRING / EXTRACTION
• E.g. Data is acquired every 15 minutes from server and contains data points of various sensor tags of different
equipment.
• Need to extract tags and values by equipment
• Save the extracted data accordingly in respective file (e.g. csv)
6
DATA ACQUIRING / EXTRACTION
CASE STUDY
• Assume:
o The acquired data is in ex_acquired.csv – containing data of 20 sensor tags for 3 time frames.
o The 20 tags are tags of 2 compressor equipment – Comp1 and Comp2 → this information is contained in
ex_eqlist.csv.
• We are to extract the acquired data and save them into 2 separate files by equipment.
import pandas as pd
from pandas import DataFrame
#display info
df_acquired.head()
7
DATA ACQUIRING / EXTRACTION
CASE STUDY
• Create 2 empty dataframes – df_comp1 and df_comp2:
• Select the data from ex_acquired.csv and compare with the equipment data in ex_eqlist.csv. Then display the
assigned equipment of each tags:
8
DATA ACQUIRING / EXTRACTION
CASE STUDY
• Copy and insert the data into df_comp1 or df_comp2 according to the equipment:
if (df_eqlist.iloc[j,1]==“Comp1"):
df_comp1=df_comp1.append({'Time': df_acquired.iloc[i,0], 'Tag’:
df_acquired.iloc[i,1],'Value': df_acquired.iloc[i,2]}, ignore_index=True)
else:
df_comp2=df_comp2.append({'Time': df_acquired.iloc[i,0], 'Tag': df_acquired.iloc[i,1],
'Value': df_acquired.iloc[i,2]}, ignore_index=True)
9
DATA ACQUIRING / EXTRACTION
CASE STUDY
• Save the dataframes to csv files:
df_comp1.to_csv(‘comp1_data.csv')
df_comp2.to_csv(‘comp2_data.csv')
10
DATA CLEANING
• E.g.:
TagsValue Definition Value to be assigned
Alarm Good value 1
Bad Bad value Garbage
Calc Failed Bad value Garbage
Configure Bad value Garbage
Connected Good value 1
FALSE Good value 0
FAULT Good value 0
Good Good value 1
I/O Timeout Bad value Garbage
Intf Shut Bad value Garbage
NOTE:
Normal Good value 1
Not Connect Bad value Garbage • Bad values are removed
Out of Serv Bad value Garbage
Pt Created Bad value Garbage • The remaining values in non-numerical form
Scan Off Bad value Garbage are converted into numeric value e.g. 1 or 0
TRUE Good value 1
11
DATA CLEANING
Start
NO
Convert non-
Non-numeric numeric
values list values
Cleaned data
End
12
DATA CLEANING
CASE STUDY
• We want to remove bad values from ex2_raw.csv. The references of the bad/good values contained in
ex2_badvalref.csv.
df_raw=pd.read_csv('ex2_raw.csv')
df_ref=pd.read_csv('ex2_badvalref.csv’)
• You may check the dataframes’ contents – use head() or describe() functions.
• Create 2 new dataframes – one to store the cleaned data, another one to store the removed data (Note: in
some cases, the info on the removed bad values are required by domain experts).
df_cleaned=df_raw
df_logrm=pd.DataFrame(columns=['Time', 'Tag', 'Value'])
13
DATA CLEANING
CASE STUDY
14
DATA CLEANING
CASE STUDY
• Now we change the non-numeric values in df_cleaned into respective numerical values:
df_cleaned.to_csv('ex2res_cleaned.csv')
df_logrm.to_csv('ex2res_logrm.csv')
15
DATA TRANSFORMATION
• Purpose: To transform dataset’s dimension to follow the required format for modeling.
• E.g. Cleaned data sets comprise data organized in 3 columns x n rows format. The 3 columns are Tag ID, Time and
Value. That means, rows consist of tags with their time and value. This format requires transformation as the
modeling process expects data sets to be Tag ID as the column, and samples or rows are listed by Time.
• This requires data sets to be transformed, i.e. pivot process is imposed upon the cleaned data so that the Tag ID
that is originally listed down by row, now must become the column.
• The pivot process requires a lot of computational power since the data sets are all huge in size. This requires the
data sets to be split into several pieces so that each small part can be computed with reasonable time. Finally,
the pieces of data sets are combined to be used in modeling.
16
DATA TRANSFORMATION
• Purpose: To transform dataset’s dimension to follow the required format for modeling (using ex3_notrfm.csv).
import pandas as pd
from pandas import DataFrame
df_trfm=pd.read_csv('ex3_notrfm.csv')
df2=df_trfm.pivot(index='Time',columns
='Tag',values='Value')
17
DATA TRANSFORMATION
• Another activity that may be required at this stage is imputation – to fill up the missing values.
• We may see a lot of cells containing NaN which show missing values.
o In some cases, rows with missing values may need to be dropped (removed).
18
Quiz
• Using Python, generate three different types of visualization/plots/charts for the data.
• The visualization/plots/charts may represent any parts of the data or the whole data itself.
• Submission: individual assignment, copy and paste your codes and the visualization/plots/charts in a single
pdf document.
19
20