Professional Documents
Culture Documents
FOR
DATA SCIENCE
Classification
Helps in segmenting say customers into
similar groups based on their
characteristics
Statistics - Association
Methods
Which items match together?
Predictive Modeling
Methods
Different
types of
data
QUALITATIVE DATA (ATTRIBUTE)
Qualitative
Vs
Quantitative
QUANTITATIVE DATA
Ordinal variable
It has all the properties of the Norminal data and additiionally can be
ordered and measured on a scale. differences between each one is
not really known. For example, is the difference between “OK” and
“Unhappy” the same as the difference between “Very Happy” and
“Happy?” We can’t say.
data
E.g of Ordinal
Variable
unsatisfied, satisfied, very satisfied
Interval
Interval scales are numeric scales in which we know both the
order and the exact differences between the values. E.g
measure of temperature, the difference between 60 and 50
degrees is a measurable 10 degrees, as is the difference
between 80 and 70 degrees. “Interval” itself means “space in
between,” which is the important thing to remember–interval
scales not only tell us about order, but also about the value
between each item.
Ratio
It is similiar to interval data , however, in order for data to be
Quantitative
considered ratio data it must have a true zero, meaning it is not
possible to have negative values in ratio data. An example of ratio
data is measurements of height be that centimetres, metres, inches
or feet. It is not possible to have a negative height. When comparing
this to temperature it is easy to consider the difference between
data
interval and ratio (which may be a little confusing at first!), as it is
possible for the temperature to be -10 degrees, but nothing can be –
10 inches tall.
Discrete type
Data
location, length, or type.
Attributes
the features of a data set. For example you
Data
Sources Secondary
Data
1.9 Billion
Data that you get from other third-party sources such
as government departments, Gartner, Mckinsey, etc.
Types of
Statistics
Descriptive Statistics Inferential Statistics
Inferential statistics takes data from a sample and makes
uses the data to provide descriptions of the inferences about the larger population from which the
population, either through numerical calculations sample was drawn. Because the goal of inferential
or graphs . statistics is to draw conclusions from a sample and
generalize them to a population, we need to have
Data Summarisation, confidence that our sample accurately reflects the
Graphs/Charts, population.
Tables Drawing conclusions about the population based on the
inferences from the sample
POPULATION
Common
SAMPLE
Terms
PARAMETER
STATISTIC
Frequency
5 times): 2, 2, 3, 4, 6, 9, 9, 8, 5, 1, 1, 9, 9, 0,9, 6, 1.
Central
This happens when you measure items of
the same kind. You will realise that most of
the items are being clustered around the
Tendency
central value or the middle item. This value
Median
For example take these house prices 25,
27, 27, 29, and 250.
There are a total of 5 numbers. Median is (n+1)/2th value. Thus, the Median is
(5+1)/2th value.
Mode
That is the item that occurs most
often.
The Empirical
Range Rule
Inter-Quartile
Range(IQR) Chebyshev Rule
The Five Number
Summary
Definition
The variance (σ2) is a measure of how far each value in the data
set is from the mean. Here is how it is defined:Subtract the mean
from each value in the data. This gives you a measure of the
distance of each value from the mean.Square each of these
distances (so that they are all positive values), and add all of the
squares together.Divide the sum of the squares by the number of
values in the data set.
Variance Note
However, because of this squaring, the variance
SCIENCE
is no longer in the same unit of measurement as
the original data. Taking the root of the variance
V 1.0
means the standard deviation is restored to the
original unit of measure and therefore much
DATA
easier to measure.
DECK
TOTAL
PITCH
The standard deviation (σ) is
simply the (positive) square root of
the variance.
Coefficient of
Variation
Sample
The weights of the Baltimore Bullets
professional football team have a mean
of 224 pounds with a standard
deviation of 18 pounds, while the mean
Example
weight and standard deviation of their
Boxplot
Quartile Range and the Mean.
It can be used to show the distribution of
SCIENCE
the data and particularly spot outliers
V 1.0
easily.
TOTAL
PITCH DATA
DECK
Boxplot
Quatiles
The
Empirical Rule
• Approximately 95% of the data in a bell-shaped
distribution lies within two standard deviations of
the
mean, or µ ± 2σ
• Approximately 99.7% of the data in a bell-shaped
distribution lies within three standard deviations of
the
mean, or µ ± 3σ
Correlation
A correlation describes
the degree of
relationship between
two variables.
Some correlation
variables you can
measure
Coefficient
correlation coefficient value goes
If your caloric intake and your weight are correlated it doesn't necessarily mean that caloric intake is
causing your weight, it could be, but your weight might also be caused by other factors such as
genetics, lack of exercise or excessive exercise or some other factors
Mainly four types of correlations
Types of Pearson correlation
(widely used)
Kendall rank correlation,
Correlation Spearman correlation,
Point-Biserial correlation.
Pearson r correlation is the most widely
used correlation statistic to measure the
degree of the relationship between
linearly related variables. For example, in
the stock market, if we want to measure
correlation
SCIENCE
measure the degree of relationship
between the two.
TOTAL
PITCH V 1.0
DATA
DECK rxy = Pearson r correlation coefficient between x and y
n = number of observations
xi = value of x (for ith observation)
yi = value of y (for ith observation)
In Python-corr()
Types of research questions a Pearson correlation
can examine:
Probability Example
EXPERIMENT
Concepts
The process undertaken to obtained our outcome.
Example the process of tossing the coin is our
experiment.
SAMPLE SPACE
Set of all outcomes of an experiment. Example; obtaing a head
for first trial, tail on second trial, tail again on 3rd trial, head on
4th trial, etc. All these combine to form the sample space
Probability is between 0 and 1
Probability
expression
Mutually Exclusive Events
H T H T
mutually exclusive or
disjoint if they cannot
both occur at the same
time.
A clear example is the
set of outcomes of a
single coin toss, which
H T
can result in either
heads or tails, but not
both.
A die is rolled. Let us define
event E1 as the set of possible
outcomes where the number on
the face of the die is even and
event E2 as the set of possible
outcomes where the number on
Independent
SCIENCE
Events
Example
V 1.0
A jar contains 3 red, 5 green, 2
DATA
blue and 6 yellow marbles. A
DECK
marble is chosen at random from
the jar. After replacing it, a
TOTAL
second marble is chosen. What
PITCH is the probability of choosing a
green and then a yellow marble?
Multiplication Rule(independent event)
If two events A and B are independent, then this rule says that
the probability of the simultaneous occurrence of A and B is given
as the product of A and B
Rules for
Computing
Computing
probability that Event A occurs times the probability that Event B
occurs, given that A has occurred.
It is an extension of conditional
probability.
TOTAL DATA SCIENCE
Bayes’ Theorem
Classify emails as spam or not
spam given the words in the
email.
Chebyshev
Rule
Normal
Distribution
Skewed Distribution
Let's say these diagrams represents the salaries of workers of TDS in $1000s, if we consider the left diagram,
we can say that most of the workers earn between 80,000-120,000 with as high as 25 people earning
120,000. Few are earning below 80,000. This shows unfairness in the distribution of the salary. In the same
way, the diagram at the right shows majority earning between 20,000-30,000 with few earning 40,000 and
above. However, in the diagram in the middle, the salary is evenly distributed with as many people earning as
much high salary as those earning low salary
In Python-displot/kde
to see the distribution of a particular variable
Central Limit
Theorem
Sampling Distribution of the mean of any
independent random variable will be normal
This is the result of rolling the Plotting the resulting
simulated die 100 times. distribution of sample means.
Testing
Hypothesis
Hypothesis
Hypothesis testing
is the use of statistics to determine the
Testing Example
The 600ml coke bottle contains 600ml of
coke.
But is it really true that the 600ml Coke
bottle really contains 600ml coke?
Null Hypothesis
the status quo
The 600ml coke bottle contains 600ml of coke.
Hypothesis
Testing Alternate Hypothesis
If we fail to reject the null hypothesis, it does not mean that we have
proven the null hypothesis is true, it means we do not have enough
evidence to reject it.
Type 1 and Type 2 errors
Therefore, the smaller the p-value, the more important (“significant”) your results.
That's why you have to set a significant level in order to have something to comapre
your results with.
P-Value
Alpha levels are controlled by the researcher and are related to confidence levels.
You get an alpha level by subtracting your confidence level from 100%. For
example, if you want to be 98 percent confident in your research, the alpha level
would be 2% (100% – 98%). When you run the hypothesis test, the test will give
you a value for p. Compare that value to your chosen alpha level. For example, let’s
say you chose an alpha level of 5% (0.05). If the results from the test give you:
A small p (≤ 0.05), reject the null hypothesis. This is strong evidence that the null
hypothesis is invalid.
A large p (> 0.05) means the alternate hypothesis is weak, so you do not reject
the null.