Professional Documents
Culture Documents
Lecture #2/8
Data analysis pre-processing
gathering/preparation/summarization
Dr.Ghoniem Lawaty
GHONIEM.GHONIEM@GMAIL.COM
MIS, DA, ML, Digitization and Micro-Services, TOGAF, DEVOPS
Certified ATM for CMMI SCAMPI (A) method, DA,DS,ML, ICAgile
https://www.linkedin.com/in/ghoniem-abdel-azim-mostafa-33860691
Session Topics
Data analysis lifecycle model (Practical guide)
Problem definition approaches
Data definition and understanding
Data gathering process
Data preparation process
Data cleaning phase
Date Summarization
Case Study:
Budget Sample
DA Lifecycle, Brief intro
Define problem statement
Data understanding
Data preparation
Collection
Cleaning
Integration
Reduction
Transformation
Central tendency measures
Desperation measures
Correlation measures
Anomalies detection
Forecasting
Data visualization
Data interpretation (Storification)
Data analysis process
Problem definition approaches
We have 2 method: Forward and backward
Forward definition: we mean what kind of analytics that
we can have using the available datasets, considering
the following:
Business domain Standards
Organizational objectives
BU key measurements
Backward definition: which focus on the following:
Problems currently the organization faces
Explanations required by organization to support the
decision making
Required Forecasting of the futures
Data definition and understanding
In order to understand the data you need to do the
following:
Obtain domain knowledge
Understand the problem and target objective
Understand the data objects and relationships between them
Understand object features(Attributes), and the objective of
each one
Understand features domain values, as it will help you in
increasing the quality of your data by cleaning, grouping, and
handling missing data
You need to involve domain expert of the business to
acquire the domain knowledge
You need to involve data expert, as he will clarify more
about the dataset structure, and content.
As a development team, most of mentioned requirements
will not be valid, as you are the owner of data engineering
process.
Data gathering phase
• Data gathering techniques
• Observations
• Surveys / Questionnaires
• Interviews
• Focus Groups
• Data sources
• Software's
• Databases
• XLSX sheets
• Text files
• Big data engines
• Tools:
• Development tools
• Crawling
• Database management
• BI tools
Data preparation process
The objective of data preparation is to have data
quality, and enhance quality factors like:
Problem definition
Accuracy
Data collection
Completeness What data to be
collected
Collection
Data preparation Cleaning data
What data to be collected, related to your problem
Challenge when you have different data sources Sampling data
Data sources may be structured and un-structured
Scaling and
normalization
Data
Decomposition
transformation
Aggregation
Data preparation-Cont
Preprocessing
Formatting: Unify datasets format according to target
one that you will build your DA models upon
Cleaning:
How can you handle missing and inconsistent data
Sampling
Which sample of the data you will select in order to
achieve your target level of accuracy
Integration
How shall you integrate your different data sources
together
Which attributes are the selected to build relations
Date cleaning phase
• Up to date:
• Check data for required sample fit the objectivity
• Missing data
• Removal
• Filling by default values
• Filling by average values
• Filling by Min/Max
• Filling by nearest forecasted
• Duplication:
• Removal
• Consolidation and grouping using suitable function
• Outliers handling
• Define valid output structure
• Attributes reduction:
• Have only the attributes that support your model
• Data merging using join techniques
Data cleaning -cont
Transformation
Scaling
Equal width : the same period/class and the frequency
changes
Equal frequency: the same frequency, and the class
changes
Reduction
Dimensionality reduction
Feature selection
Feature extraction
Projection
Grouping by aggregation
Numerosity reduction
Parametric
Regression
Non-Parametric
Sampling
Matrixes
Histograms
Data Summarization(Grouping)
• How to group data
• Tally method
• Class intervals
• Class central point
• frequency table
• Single
• Matrix
• Class frequencies type
• Absolute frequency
• Absolute percentage
• Cumulative frequency (Up and down)
• Cumulative percent