You are on page 1of 12

CRISP DM >>>--------------------------------------------------------->>> CRISP MLQ

Cross Industry Standard Process for Data Mining CRISP for ML with Quality Assurance
this is the earlier used methodology Currently we are working in this methodology

PAST

DATA ANALYST (Work in tableau & BI)


PRESENT

FUTURE DATA SCIENTIST (Should know the quality of data Analyst as well)

ALGORITHM DATA ENGINEERS (who suggest which algorithm to used)

Stages of Analytics -
Descriptive –” What happened?” Past <<-------- Data Analyst
Diagnostic – “Why did it happen?” Past <<-------- Data Analyst
Predictive – “What will happen?” Future <<-------- Data Scientist
Prescriptive – “How can we make it happen?” Future <<-------- Data Scientist

PROJECT MANAGEMENT METHODOLOGY

CRISP MLQ – Cross Industry Standard Process for ML with Quality Assurance.
CRISP MLQ process model describes on 6phases –

 Business & Data Understanding


 Data preparation (Data Engineering)
 Model building & Tuning
 Model Evaluation (Testing & Evaluation)
 Deployment
 Monitoring & Maintenance

1a. Understand Business Problem & Create Project Charter


What is the Data Optimization terms?
Objective (Max, Min) & Constraints.
Maximum words to describe the Objective & constraints is 5-6 & minimum words will be 2-3

1b. Data Understanding

 Continuous vs Discrete data –


Key characteristics of discrete data
Discrete data is often used in simple statistical analysis because it's easy to summarize and compute. Let's
look at some of the other key characteristics of discrete data.

Discrete data includes discrete variables that are finite, numeric, countable, and non-negative integers (5,
10, 15, and so on).
Discrete data can be easily visualized and demonstrated using simple statistical methods such as bar
charts, line charts, or pie charts.
Discrete data can also be categorical - contain a finite number of data values, such as the gender of a
person.
Discrete data is distributed discretely in terms of time and space. Discrete distributions make analyzing
discrete values more practical

Key characteristics of continuous data


Unlike discrete data, continuous data can be either numeric or distributed over date and time. This data
type uses advanced statistical analysis methods taking into account the infinite number of possible values.
Key characteristics of continuous data are:

Continuous data changes over time and can have different values at different time intervals.
Continuous data is made up of random variables, which may or may not be whole numbers.
Continuous data is measured using data analysis methods such as line graphs, skews, and so on.
Regression analysis is one of the most common types of continuous data analysis.

Continuous Discrete
1. Continuous data is one that falls on a continuous 1. Discrete data is one that has clear spaces between
sequence. values.
Any kind of data that in numerical decimal format & Any kind of data that in numerical decimal format & not
making sense is called Continuous data. making sense is called Discrete data
2. It’s a measurable 2. It’s a countable
3. It can take any value in some interval. 3. It can take only specific data (distinct or separate values.)
3. Tabulation is known as grouped frequency 3. Tabulation is known as Ungrouped frequency
distribution. distribution.
4. A diagram of continuous functions graph shows the 4. A diagram of discrete functions shows a distinct point
point is connected with an unbroken line. that remains unconnected.
5. It includes any value within a preferred range 5. It contain distinct or separate value.
6. Graphical representation – Histogram or Line graph. 6. Graphical Representation-Bar graph.

7. e.g., Market price of a product. 7. e.g., Days of the week.


The weight of new born babies. The no. of customer who bought different item.
The daily wind speeds. The no. of computers in each department.
The temperature of freezer. 83.6 – 99.9 degree The no. of items you buy at grocery store each week.
Examples of discrete data
Discrete data can also be qualitative. The nationality you select on a form is a piece of discrete data. The
nationalities of everyone in your workplace, when grouped, can be valuable information in evaluating your
hiring practices.

The national census consists of discrete data, both qualitative and quantitative. Counting and collecting
this identifying information deepens our understanding of the population. It helps us make predictions
while documenting history. This is a great example of discrete data's power.

Examples of continuous data


When you think of experiments or studies involving constant measurements, they're likely to be
continuous variables to some extent. If you have a number like “2.86290” anywhere on a spreadsheet, it's
not a number you could have quickly arrived at yourself — think measurement devices like stopwatches,
scales, thermometers, and the like.

A task involving these tools probably applies to continuous data. For example, if we’re clocking every
runner in the Olympics, the times will be shown on a graph along an applicable line. Although our athletes
get faster and stronger over the years, there should never be an outlier that skews the rest of the data.
Even Usain Bolt is only a few seconds faster than the historical field when it comes down to it.

There are infinite possibilities along this line (for example, 5.77 seconds, 5.772 seconds, 5.7699 seconds,
etc.), but every new measurement is always somewhere within the range.

Not every example of continuous data falls neatly into a straight line. Still, over time a range becomes more
apparent, and you can bet on new data points sticking inside those parameters.

What is Nominal, Ordinal, Interval and Ratio Scales?


(Data type – scale of measurement)
Nominal, Ordinal, Interval, and Ratio are defined as the four fundamental levels of measurement scales
that are used to capture data in the form of surveys and questionnaires, each being a multiple-choice
question.

Discrete Data
Nominal variable (categorical) (least preferred)
- Data can be put into the categories
- They are variables with no numeric value.
- Cannot be assigned any order.
- It Cannot be quantified, i.e... you can’t perform arithmetic operations on them, like addition, subtraction,
logical operations like equal or greater then on them

Ordinal scale
- It classifies according to rank.
- It has all its variables in a specific order, beyond just naming them.

A major disadvantage with using the ordinal scale over other scales is that the distance between
measurements is not always equal. If you have a list of numbers like 1,2 and 3, you know that the distance
between the numbers in this case is exactly 1. But if you had “very satisfied”, “satisfied” and “neutral”,
there’s nothing to say if the different between the three ordinal variables is equal. In the list of five movies
listed above, there’s a small difference in my preference for Jaws or Children of Men, but a huge difference
between Children of Men (which I enjoyed…twice!) and The Sound of Music (which I do not like at all). This
inability to tell how much is in between each variable is one reason why other scales of measurement are
usually preferred in statistics.

Continuous Data
Interval scale
- It has value of equal intervals that mean something. (e.g., thermometer might have interval of 10
degrees)
- Offers labels, order, as well as, a specific interval between each of its variable options.

Ratio scale – (Most preferred data type)


- It’s exactly the same as the interval scale
- Except that the zero on the scale means: doesn’t exist

Nominal Ordinal Interval Ratio


Gender, High school class ranking: 1st, 9th, 87th… Temperature Age,
Color, Socioeconomic status: poor, middle class, rich. Weight,
Country, The Likert Scale: strongly disagree, disagree, IQ rankings Height,
Type of House / neutral, agree, strongly agree. Sales Figures
Accommodation, Level of Agreement: yes, maybe, no. SAT scores Ruler measurements,
Genotype (AA, Time of Day: dawn, morning, noon, Income earned in a
Aa or aa), afternoon, evening, night. Time on a week,
Religious Political Orientation: left, center, right. clock with Years of education,
preference etc. Military rank, Clothing size – small medium large etc. Hands Number of children

 Qualitative vs Quantitative data

Qualitative Data Quantitative Data


1. This type of data analysis is based on human 1. This type of data analysis is based on numerical information
understanding how they think & feel. & facts by using mathematical logic & techniques.
2. Qualitative data is text based. 2. Quantitative data is numerical based.
3.These data collected using interview & observation 3. These data collected using surveys, measuring & counting.
4. Analyzing the data by grouping data into 4. Analyzing the data using statistical analysis
meaningful themes & categories
5. Qualitative data is subjective & dynamic 5. Quantitative data is fixed & universal
6. e.g., My best friend has curly brown hair 6. e.g., My best friend is 5feet & 7inch tall
They have green eyes They have size 6feet
They have a friendly face & a contagious laugh My best friend have one older sibling & two younger siblings
They can also be a quite impatient & impulsive at time They go swimming 4 times a week

 Structured vs semi-structured vs Unstructured Data

Structured Data Semi-Structured Data Unstructured Data


1. Data with high degree of 1. Data with some degree of predefined 1. Data with no predefined organizational
predefined organization. organization & structure. form & no specific format.
2. Data in spreadsheet (Excel) or 2. Data in text file that has some 2. Data which is not structured or
in tabular format structure (header paragraph etc.) unstructured.
3. e.g., formats - 3. e.g., formats - 3. e.g., formats -
Excel sheet, Comma separated HTML, XML etc. Image, Videos, Word files, Pdf Files
values file(.csv) etc.

 Big Data vs Non-Big data

Big Data - Any kind of data that gives you two problem that is Computational burden & Storage burden is
called big data
To deal with the Storage problem we use Hadoop
To deal with the computational problem w use Spark

Non-Big data – Data which is not big data in which we are having Computational & storage burden.

 Cross-sectional vs Timeseries Data


Cross-sectional Data Time series Data
1. It’s an observation that coming from different individual 1. It’s a set of observation that collected at usually
or group at a single point in time. equally spaced time intervals.
2. Focuses in several variables at the same point in time 2. Focuses on same variable over a period of time.
3. e.g., Maximum temperature of several cities on single day 3. e.g., Profit of an organization over a period of time
The closing price of group of 20 different stocks on The daily closing price of a certain stock recorder over
December 15,1986. the last 6 weeks.
4.Day, Time & Sequence doesn’t matter. 4. Day, Time & Sequence matter

Longitudinal data = Cross-sectional data + Time Series data

 Online vs Offline Data


DATA COLLECTION -
e.g., Jio want to launch 5g tariff for Villages in India
.
1st approach-------------------You took data from Online of Vodafone company but can’t use the same one
need to develop data set according to our needs.
In village we, need to see Pricing & person living there is employable or not, High or low earning.
Used google map to see how many people r doing farming & created a new data set. (Doing research about
pricing villagers can afford)
To know farmers are high or low earning, approach drone company for data (you buy the data called
Syndicate data) to check which type of crop they r growing. (You can evaluate which crop pricing give profit
to whom).
These data we collected from Vodafone & drone company may don’t have exact information we need but
it’s not time taking & also not costly.

2nd approach ----------------- You hire 20guys with feedback form in village having population 10,000, it will
get the exact information we needed but the whole process is Costly & time taken.

Primary Data – 2nd Approach


Secondary Data – 1st Approach, we always use the secondary data.

2. Data Cleansing / Preparation / Organizing


1. Typecasting - Converting one data type into another
e.g., integer into float, Array to series or data frame

2. Handling duplicates - create a new dataset by dropping all the duplicates (zero duplicates)
3. Outlier Analysis / Treatment -
3R technique Rectify Retain Remove
Example:
While working in a data for a company we found that some data have is reflecting more different than
others (e.g., like high sale compared to others). Here we are Rectifying.
We search the data with company if it's correct or wrong. If correct, we proceed. Here we are Retaining.
But if the company is not certain or sure about the data is correct or wrong, we remove it. Here we are
Removing

Masking - Outlier exist but you fail to detect - False Negative

Swamping - Detecting non-outliers as Outliers - False Positive

Winsorization - This technique modifies sample distributing of random variables by removing outliers.
There are 2 capping method – 1. IQR method (Q3 – Q1)
2. Gaussian method (Mean or standard deviation)

Example1: 90% winsoriztaion means all data below 5th percentile is set to 5th percentile & data above 95
percentile is set to 95th percentile.

Example2: Upper Maximum value:(100) It will change all the upper maximum values to this 100
Lower minimum value :(50) it will change all the lower minimum values to this 50
Trimming - If alpha is set as 5% then lower & upper 5% values are trimmed or removed.
Trimming is different than winsorization.

First need to tell for winsorizer


Capping Method – IQR or Gaussian
Tail = which side Upper / Lower / Both
Fold = 1.5
Variable = Column name

4. Zero & Near Zero Variance feature – Usually ignores the column with same entries throughout or if
majority of entries are same.
e.g., All entries of a column called Country shows the name as USA
5. Discretization – Converting continuous data to discrete data (normal distribution to standard
distribution)
Binarization – Converting continuous data into 2 categories data (Yes or No, True or false etc.)
Rounding – Rounding to nearest value. e.g., 5.6 will become 6 & 5.4 become 5
Binning – Fixed width Binning & Adaptive Binning

5. Missing Value
Imputation – It’s a technique used for replacing missing data with some substitute value to retain most of
data/information of data set.
Imputation Variant – MAR - Missingness Random
MNAR - Missingness Not at Random
MCAR - Missing completely at Random
Imputation Techniques –
Model Based Method – Maximum likelihood
– Multiple Imputation
Deletion Method – Simple strategies – Case-wise deletion or list-wise deletion or complete case analysis
– Pair wise deletion or available case analysis

Single Imputation Methods – (Most used method) – Mean, Median & Mode Imputation

6. Dummy variables- is variable that takes Dichotomous variable (0,1), Where the value indicates the
presence of absence of the somethings. 

-One hot encoding: It converts the categorical variables to numerical values (0,1). It increases the accuracy
and prediction of the model. 

-Label hot encoding: We need to replace the categorical value using a numerical value ranging between
zero and total number of classes. Ex: if we have six classes, we use 0,1,2,3,4 and 5

-Dummy coding scheme: Dummy coding scheme is similar to one hot encoding. This categorical data
encoding method transforms the categorical variable into a set of binary variables also known as dummy
variables

7. Transformation- In order to make our distributions normal we use some different kinds for
transformations.

8. Standardization - It used as a scaling technique to establish the mean & Standard Deviation.

Normalization- The goal of normalization is to change the value of numeric columns in the dataset to use
common scale, without dis-sorting difference in the ranges of values or losing information.
Normalization is a user defined value.
Syntax -
def norm_func(i)
x = (i-i.min()) / (i.max()-i.min())
return (x) # norm_func is a function name for normalization
# i means all the values

Exploratory Data Analysis (EDA) is an approach to analyze the data using visual techniques. It's
used to discover trends, patterns or to check assumption with the help of statistical summary & graphical
representation.
EDA have 4 type of Business decision

1. First Moment Business Decision also called Measure of Central Tendency

Mean - Also called average & Influenced by Outliers (use Median to solve this)
 It also includes the magnitude of scores

Median - Middle most value of dataset & doesn't get influenced by the outliers

Mode - Most repeated value is called as mode.

-If one value in mode repeated more times it's called Unimodal

-If two values in mode repeated more times it's called Bimodal

-If value > 2 in mode it's called Multimodal

Central tendency – Population Parameter (Geek Notation) & Sample Statistics (Regular Notation)

2. Second Moment business decision also called Measures of dispersion


Dispersion is the term used for the spread of data.

Variance - The average of the squared distance of each data point from the center (mean) is defined as
Variance.
We have two predefined formula 1st - For Population 2nd - For Sample

Disadvantage - In Variance, we calculate the square for distance, but along with the distance, the unit also
gets squared, so to get back units we use Standard Deviation.

Standard Deviation - Difference between each data points & the mean.
- When the value in the data set is grouped together, we get smallest standard deviation
- When the value in the data set is scattered or more spread, we get more standard deviation /
higher standard deviation
Merit - It uses the original unit of any measurement
We use range when, we have Normal data.

Range - Largest & Smallest value in the data set


We use range when, we have Same sample size of different data set & there should not be any outlier.

3.Third Moment business decision- Measures of asymmetry or Skewness


Symmetric graph (no skewness) & Asymmetric graph (rightly skewness)
-A measure of asymmetry in distribution. is measure of If no. of record > 300 then (-2 to 2 skewness) will be our
wheatear the data are heavy-tailed normal limit
If data is symmetry, then skewness is zero or normally
-We can point the criteria which not having good numbers. distributed

-Long tail in left area refers to negatively skewed or Left


Skewed
-Long tail in Right area refers to positively skewed or right
skewed
-Normal curve with perfect symmetrical distribution refers
to normally skewed Zero skewness

By using this code in python, we can know skewness.


dataset.column.skew()

In measures of skewness, the absolute skewness = mean - mode

4.Fouth Moment business decision- Measures of peakeness


It gives information about tailness of our distribution. Tails are touching axis or not, or how far from axis.

How fat or thin the distribution is.

T Types of Kurtosis: -
1. No Kurtosis – Mesokurtosis Distribution
2. Negative Kurtosis – Platykurtosis distribution wide peak /
t thin tails.
3. Positive Kurtosis – Leptokurtosis Distribution Sharp tail
Thin peak / thick tails

FEATURE ENGINEERING
Feature Engineering - Machine learning algorithm doesn’t always work on raw data so we need to convert
that raw data into format where machine learning algorithms understand it. It’s called feature engineering.
It’s start with your best guess about what feature might influence the thing we r trying to predict.
It’s an iterative process. 1. Create new features. 2. Add to model. 3.See the result improve

e.g., Credit card transaction (Fraudulent/Not Fraudulent) -


Place of 1st transaction – Hyderabad – X1
Time of 1st transaction – 10:00am – X2
Place of 2nd transaction – Bengaluru – X3
Time of 2nd transaction – 10:05am – X4
X is I/p ------------------------------------Y is O/p
We have two methodologies for Simpler the model the better it’s. (If we have complex equation then make it
simpler)
1. Principle of Parsimony
2. Occam’s razor theorem
3 DATA MINING /MACHINE LEARNING
Supervised vs Unsupervised learning
Supervised Learning (Predictive learning) Un-Supervised Learning (Descriptive learning)
1. Algorithm are trained using labelled data 1. Algorithm are trained using unlabelled date
2. SL model predicts O/P 2. USL model finds hidden pattern data
3.In SL, I/P data is provided along with O/P 3. In USL, only I/P data is provided to model.
4.SL is not close to true AI; we have to 1 st train the model 4. USL is closer to AI as it learns similarly as a child learns
for each data & then only it can predict the O/P daily routine things by his experience.
5. Algorithm included - Linear Regression, Logistic 5. Algorithm included – Clustering, KNN & Apriori algorithm
Regression, Multiclass classification, Decision tree etc

Supervised Learning – Split Data

Compare the training error & testing error -


- If both training & testing error are low & close to each other then it is called as Right fit
- If training error is low & testing error is high it is called overfitting or variance. To fix this problem
each algorithm has different set of techniques called Regularisation techniques
- If training error is high & testing error is low it is called underfitting or Bias. To fix this problem
transform the data or perform better feature engineering to get more observation or more features.

UN-SUPERVISED LEARNING
CLUSTERING / Segmentation - Types - Hierarchical Clustering & K-means Clustering
STP framework – Segmentation --- Targeting --- Positioning

-- Cluster analysis also called Data segmentation, it is an explanatory method


-- It identifies homogeneous group of records
-- Similar item should be grouped together into homogeneous group. (Cohesive within cluster) (Distance
between each data point in cluster would be less)
-- Dissimilar item should be grouped into heterogeneous groups. (Distinctive between cluster) (Distance
between each cluster would be more)

Distance between clusters.

-- Single linkage also called Nearest neighbour (Minimum dist. between members of 2 clusters)
-- Complete linkage also called farthest neighbour (Max dist. Between members of 2 clusters)

-- Average Linkage – Avg of all dist. between member of 2 cluster


-- Centroid Linkage – Dist. between centroid of 2 cluster.

You might also like