Professional Documents
Culture Documents
Cross Industry Standard Process for Data Mining CRISP for ML with Quality Assurance
this is the earlier used methodology Currently we are working in this methodology
PAST
FUTURE DATA SCIENTIST (Should know the quality of data Analyst as well)
Stages of Analytics -
Descriptive –” What happened?” Past <<-------- Data Analyst
Diagnostic – “Why did it happen?” Past <<-------- Data Analyst
Predictive – “What will happen?” Future <<-------- Data Scientist
Prescriptive – “How can we make it happen?” Future <<-------- Data Scientist
CRISP MLQ – Cross Industry Standard Process for ML with Quality Assurance.
CRISP MLQ process model describes on 6phases –
Discrete data includes discrete variables that are finite, numeric, countable, and non-negative integers (5,
10, 15, and so on).
Discrete data can be easily visualized and demonstrated using simple statistical methods such as bar
charts, line charts, or pie charts.
Discrete data can also be categorical - contain a finite number of data values, such as the gender of a
person.
Discrete data is distributed discretely in terms of time and space. Discrete distributions make analyzing
discrete values more practical
Continuous data changes over time and can have different values at different time intervals.
Continuous data is made up of random variables, which may or may not be whole numbers.
Continuous data is measured using data analysis methods such as line graphs, skews, and so on.
Regression analysis is one of the most common types of continuous data analysis.
Continuous Discrete
1. Continuous data is one that falls on a continuous 1. Discrete data is one that has clear spaces between
sequence. values.
Any kind of data that in numerical decimal format & Any kind of data that in numerical decimal format & not
making sense is called Continuous data. making sense is called Discrete data
2. It’s a measurable 2. It’s a countable
3. It can take any value in some interval. 3. It can take only specific data (distinct or separate values.)
3. Tabulation is known as grouped frequency 3. Tabulation is known as Ungrouped frequency
distribution. distribution.
4. A diagram of continuous functions graph shows the 4. A diagram of discrete functions shows a distinct point
point is connected with an unbroken line. that remains unconnected.
5. It includes any value within a preferred range 5. It contain distinct or separate value.
6. Graphical representation – Histogram or Line graph. 6. Graphical Representation-Bar graph.
The national census consists of discrete data, both qualitative and quantitative. Counting and collecting
this identifying information deepens our understanding of the population. It helps us make predictions
while documenting history. This is a great example of discrete data's power.
A task involving these tools probably applies to continuous data. For example, if we’re clocking every
runner in the Olympics, the times will be shown on a graph along an applicable line. Although our athletes
get faster and stronger over the years, there should never be an outlier that skews the rest of the data.
Even Usain Bolt is only a few seconds faster than the historical field when it comes down to it.
There are infinite possibilities along this line (for example, 5.77 seconds, 5.772 seconds, 5.7699 seconds,
etc.), but every new measurement is always somewhere within the range.
Not every example of continuous data falls neatly into a straight line. Still, over time a range becomes more
apparent, and you can bet on new data points sticking inside those parameters.
Discrete Data
Nominal variable (categorical) (least preferred)
- Data can be put into the categories
- They are variables with no numeric value.
- Cannot be assigned any order.
- It Cannot be quantified, i.e... you can’t perform arithmetic operations on them, like addition, subtraction,
logical operations like equal or greater then on them
Ordinal scale
- It classifies according to rank.
- It has all its variables in a specific order, beyond just naming them.
A major disadvantage with using the ordinal scale over other scales is that the distance between
measurements is not always equal. If you have a list of numbers like 1,2 and 3, you know that the distance
between the numbers in this case is exactly 1. But if you had “very satisfied”, “satisfied” and “neutral”,
there’s nothing to say if the different between the three ordinal variables is equal. In the list of five movies
listed above, there’s a small difference in my preference for Jaws or Children of Men, but a huge difference
between Children of Men (which I enjoyed…twice!) and The Sound of Music (which I do not like at all). This
inability to tell how much is in between each variable is one reason why other scales of measurement are
usually preferred in statistics.
Continuous Data
Interval scale
- It has value of equal intervals that mean something. (e.g., thermometer might have interval of 10
degrees)
- Offers labels, order, as well as, a specific interval between each of its variable options.
Big Data - Any kind of data that gives you two problem that is Computational burden & Storage burden is
called big data
To deal with the Storage problem we use Hadoop
To deal with the computational problem w use Spark
Non-Big data – Data which is not big data in which we are having Computational & storage burden.
2nd approach ----------------- You hire 20guys with feedback form in village having population 10,000, it will
get the exact information we needed but the whole process is Costly & time taken.
2. Handling duplicates - create a new dataset by dropping all the duplicates (zero duplicates)
3. Outlier Analysis / Treatment -
3R technique Rectify Retain Remove
Example:
While working in a data for a company we found that some data have is reflecting more different than
others (e.g., like high sale compared to others). Here we are Rectifying.
We search the data with company if it's correct or wrong. If correct, we proceed. Here we are Retaining.
But if the company is not certain or sure about the data is correct or wrong, we remove it. Here we are
Removing
Winsorization - This technique modifies sample distributing of random variables by removing outliers.
There are 2 capping method – 1. IQR method (Q3 – Q1)
2. Gaussian method (Mean or standard deviation)
Example1: 90% winsoriztaion means all data below 5th percentile is set to 5th percentile & data above 95
percentile is set to 95th percentile.
Example2: Upper Maximum value:(100) It will change all the upper maximum values to this 100
Lower minimum value :(50) it will change all the lower minimum values to this 50
Trimming - If alpha is set as 5% then lower & upper 5% values are trimmed or removed.
Trimming is different than winsorization.
4. Zero & Near Zero Variance feature – Usually ignores the column with same entries throughout or if
majority of entries are same.
e.g., All entries of a column called Country shows the name as USA
5. Discretization – Converting continuous data to discrete data (normal distribution to standard
distribution)
Binarization – Converting continuous data into 2 categories data (Yes or No, True or false etc.)
Rounding – Rounding to nearest value. e.g., 5.6 will become 6 & 5.4 become 5
Binning – Fixed width Binning & Adaptive Binning
5. Missing Value
Imputation – It’s a technique used for replacing missing data with some substitute value to retain most of
data/information of data set.
Imputation Variant – MAR - Missingness Random
MNAR - Missingness Not at Random
MCAR - Missing completely at Random
Imputation Techniques –
Model Based Method – Maximum likelihood
– Multiple Imputation
Deletion Method – Simple strategies – Case-wise deletion or list-wise deletion or complete case analysis
– Pair wise deletion or available case analysis
Single Imputation Methods – (Most used method) – Mean, Median & Mode Imputation
6. Dummy variables- is variable that takes Dichotomous variable (0,1), Where the value indicates the
presence of absence of the somethings.
-One hot encoding: It converts the categorical variables to numerical values (0,1). It increases the accuracy
and prediction of the model.
-Label hot encoding: We need to replace the categorical value using a numerical value ranging between
zero and total number of classes. Ex: if we have six classes, we use 0,1,2,3,4 and 5
-Dummy coding scheme: Dummy coding scheme is similar to one hot encoding. This categorical data
encoding method transforms the categorical variable into a set of binary variables also known as dummy
variables
7. Transformation- In order to make our distributions normal we use some different kinds for
transformations.
8. Standardization - It used as a scaling technique to establish the mean & Standard Deviation.
Normalization- The goal of normalization is to change the value of numeric columns in the dataset to use
common scale, without dis-sorting difference in the ranges of values or losing information.
Normalization is a user defined value.
Syntax -
def norm_func(i)
x = (i-i.min()) / (i.max()-i.min())
return (x) # norm_func is a function name for normalization
# i means all the values
Exploratory Data Analysis (EDA) is an approach to analyze the data using visual techniques. It's
used to discover trends, patterns or to check assumption with the help of statistical summary & graphical
representation.
EDA have 4 type of Business decision
Mean - Also called average & Influenced by Outliers (use Median to solve this)
It also includes the magnitude of scores
Median - Middle most value of dataset & doesn't get influenced by the outliers
-If one value in mode repeated more times it's called Unimodal
-If two values in mode repeated more times it's called Bimodal
Central tendency – Population Parameter (Geek Notation) & Sample Statistics (Regular Notation)
Variance - The average of the squared distance of each data point from the center (mean) is defined as
Variance.
We have two predefined formula 1st - For Population 2nd - For Sample
Disadvantage - In Variance, we calculate the square for distance, but along with the distance, the unit also
gets squared, so to get back units we use Standard Deviation.
Standard Deviation - Difference between each data points & the mean.
- When the value in the data set is grouped together, we get smallest standard deviation
- When the value in the data set is scattered or more spread, we get more standard deviation /
higher standard deviation
Merit - It uses the original unit of any measurement
We use range when, we have Normal data.
T Types of Kurtosis: -
1. No Kurtosis – Mesokurtosis Distribution
2. Negative Kurtosis – Platykurtosis distribution wide peak /
t thin tails.
3. Positive Kurtosis – Leptokurtosis Distribution Sharp tail
Thin peak / thick tails
FEATURE ENGINEERING
Feature Engineering - Machine learning algorithm doesn’t always work on raw data so we need to convert
that raw data into format where machine learning algorithms understand it. It’s called feature engineering.
It’s start with your best guess about what feature might influence the thing we r trying to predict.
It’s an iterative process. 1. Create new features. 2. Add to model. 3.See the result improve
UN-SUPERVISED LEARNING
CLUSTERING / Segmentation - Types - Hierarchical Clustering & K-means Clustering
STP framework – Segmentation --- Targeting --- Positioning
-- Single linkage also called Nearest neighbour (Minimum dist. between members of 2 clusters)
-- Complete linkage also called farthest neighbour (Max dist. Between members of 2 clusters)