You are on page 1of 14

Data Analysis

From theoretical to implementation


Using Excel, python, flourish

Lecture #2/8
Data analysis pre-processing
gathering/preparation/summarization

Dr.Ghoniem Lawaty
GHONIEM.GHONIEM@GMAIL.COM
MIS, DA, ML, Digitization and Micro-Services, TOGAF, DEVOPS
Certified ATM for CMMI SCAMPI (A) method, DA,DS,ML, ICAgile
https://www.linkedin.com/in/ghoniem-abdel-azim-mostafa-33860691
Session Topics
 Data analysis lifecycle model (Practical guide)
 Problem definition approaches
 Data definition and understanding
 Data gathering process
 Data preparation process
 Data cleaning phase
 Date Summarization
 Case Study:
 Budget Sample
DA Lifecycle, Brief intro
 Define problem statement
 Data understanding
 Data preparation
 Collection
 Cleaning
 Integration
 Reduction
 Transformation
 Central tendency measures
 Desperation measures
 Correlation measures
 Anomalies detection
 Forecasting
 Data visualization
 Data interpretation (Storification)
Data analysis process
Problem definition approaches
 We have 2 method: Forward and backward
 Forward definition: we mean what kind of analytics that
we can have using the available datasets, considering
the following:
 Business domain Standards
 Organizational objectives
 BU key measurements
 Backward definition: which focus on the following:
 Problems currently the organization faces
 Explanations required by organization to support the
decision making
 Required Forecasting of the futures
Data definition and understanding
 In order to understand the data you need to do the
following:
 Obtain domain knowledge
 Understand the problem and target objective
 Understand the data objects and relationships between them
 Understand object features(Attributes), and the objective of
each one
 Understand features domain values, as it will help you in
increasing the quality of your data by cleaning, grouping, and
handling missing data
 You need to involve domain expert of the business to
acquire the domain knowledge
 You need to involve data expert, as he will clarify more
about the dataset structure, and content.
 As a development team, most of mentioned requirements
will not be valid, as you are the owner of data engineering
process.
Data gathering phase
• Data gathering techniques
• Observations
• Surveys / Questionnaires
• Interviews
• Focus Groups
• Data sources
• Software's
• Databases
• XLSX sheets
• Text files
• Big data engines
• Tools:
• Development tools
• Crawling
• Database management
• BI tools
Data preparation process
 The objective of data preparation is to have data
quality, and enhance quality factors like:
Problem definition
 Accuracy
Data collection
 Completeness What data to be
collected

Data preparation process


 Consistency
 Time lined Formatting data

 Collection
Data preparation Cleaning data
 What data to be collected, related to your problem
 Challenge when you have different data sources Sampling data
 Data sources may be structured and un-structured
Scaling and
normalization
Data
Decomposition
transformation

Aggregation
Data preparation-Cont
 Preprocessing
 Formatting: Unify datasets format according to target
one that you will build your DA models upon
 Cleaning:
 How can you handle missing and inconsistent data
 Sampling
 Which sample of the data you will select in order to
achieve your target level of accuracy
 Integration
 How shall you integrate your different data sources
together
 Which attributes are the selected to build relations
Date cleaning phase
• Up to date:
• Check data for required sample fit the objectivity
• Missing data
• Removal
• Filling by default values
• Filling by average values
• Filling by Min/Max
• Filling by nearest forecasted
• Duplication:
• Removal
• Consolidation and grouping using suitable function
• Outliers handling
• Define valid output structure
• Attributes reduction:
• Have only the attributes that support your model
• Data merging using join techniques
Data cleaning -cont
 Transformation
 Scaling
 Equal width : the same period/class and the frequency
changes
 Equal frequency: the same frequency, and the class
changes
 Reduction
 Dimensionality reduction
 Feature selection
 Feature extraction
 Projection
 Grouping by aggregation
 Numerosity reduction
 Parametric
 Regression
 Non-Parametric
 Sampling
 Matrixes
 Histograms
Data Summarization(Grouping)
• How to group data
• Tally method
• Class intervals
• Class central point
• frequency table
• Single
• Matrix
• Class frequencies type
• Absolute frequency
• Absolute percentage
• Cumulative frequency (Up and down)
• Cumulative percent

• Select grouping function


• Min
• Max
• Sum
• Count
• Average
• Statistical measurements (Will be mentioned in the next slides)
Data Summarization(Grouping)
• How Many groups:
• No. Items <= 2^K
• For example: 1000 item by maximum should have up to
10 classes according to the equation, because 2^10 =
1024
• Classes start value = Min
• Classes Ranges= Max – Min
• Class length = Ranges/ K to the nearest range
• Period center = (ML+LL)/2
• Class relative frequency = Frequency/ Overall
frequencies
Data Summarization: Budget Sample
• Data set:
• https://github.com/Ghoniem-
Ghoniem/DataAnalysis/blob/main/Budget.xlsx
• In this example, We have implemented the data
summarization using excel pivot table
• Steps:
• Go to first tab : 01-Budget
• Go to insert->Pivot table
• Use pivot table setting:
• Filtering fields : the columns that will be used in filtering the pivot
table
• Columns: the attributes that will be used in the pivot table as
columns
• Rows: the attributes that will be used as rows
• Values, with selection to aggregate functions

You might also like