Professional Documents
Culture Documents
Intro To Statistics For Data Analysts
Intro To Statistics For Data Analysts
Statistics
Data analysis Program
Hi,
I’m Adi Cohen
● M.sc. Degree in Neuroscience
to mind when
you hear
“Statistics”?
Data analysis example proving that on average women used fewer medical services than men on 2021 (19% vs. 28%)
Women Men
Number of having health insurance 2000 2000
Number of Using medical services 372 568
Users % 19% 28%
By drilling down to the district level, you can see that there are no significant differences between genders.
If anything, women use more medical services.
Men Women
Number of Number of using
having health medical Number of having Number of using
insurance services % health insurance medical services %
North District 1200 168 14% 1800 270 15%
South District 800 400 50% 200 102 51%
X Y X Y X Y X Y
10.00 8.04 10.00 9.14 10.00 7.46 8.00 6.58
8.00 6.95 8.00 8.14 8.00 6.77 8.00 5.76
13.00 7.58 13.00 8.74 13.00 12.74 8.00 7.71
9.00 8.81 9.00 8.77 9.00 7.11 8.00 8.84
11.00 8.33 11.00 9.26 11.00 7.81 8.00 8.47
14.00 9.96 14.00 8.10 14.00 8.84 8.00 7.04
6.00 7.24 6.00 6.13 6.00 6.08 8.00 5.25
4.00 4.26 4.00 3.10 4.00 5.39 19.00 12.50
12.00 10.84 12.00 9.13 12.00 8.15 8.00 5.56
7.00 4.82 7.00 7.26 7.00 6.42 8.00 7.91
5.00 5.68 5.00 4.74 5.00 5.73 8.00 6.89
Average 8.06 8.06 8.06 8.06
Correlation 0.82 0.82 0.82 0.82
Standard Deviation 2.79 2.79 2.79 2.79
2 Median
6 Variation
7 Weighted average
1 Arithmetic Average Definition: sum of the values divided by their number
2 Median 1,3,3,3,3,5,9,10,10
3 Percentile
4 Mode
5 Standard deviation
6 Variation
7 Weighted average
1 Arithmetic Average Definition: sum of the values divided by their number
2 Median (1+3+3+3+3+5+9+10+10)/9
3 Percentile 1,3,3,3,3,5,9,10,10
4 Mode
5.2
5 Standard deviation
6 Variation
7 Weighted average
✔ Average Definition: the 50 percentile – the value which divides
the data set into two, having an equal number
of values above it and below it
2 Median
1,3,3,3,3,5,9,10,10
3 Percentile
4 Mode
5 Standard deviation
6 Variation
7 Weighted average
✔ Average Definition: the 50 percentile – the value which divides
the data set into two, having an equal number
of values above it and below it
2 Median
1,3,3,3,3,5,9,10,10
3 Percentile
4 Mode
Standard deviation
3
5
1,3,3,3,3,5,5,9,10,10
6 Variation
7 Weighted average
✔ Average Definition: the 50 percentile – the value which divides
the data set into two, having an equal number
of values above it and below it
2 Median
1,3,3,3,3,5,9,10,10
3 Percentile
4 Mode
Standard deviation
3
5
1,3,3,3,3,5,5,9,10,10
6 Variation
1,3,3,3,3,5,9,10,10
3 Percentile
4 Mode
5 Standard deviation
6 Variation
7 Weighted average
✔ Average Definition: A value from which a certain percentage
of a set of values is lower
✔ Median
1,3,3,3,3,5,9,10,10
3 Percentile
4 Mode
9
the value below a certain percent of a value group
Standard deviation
(70% of 9 values is 6.3 – the 7th item is 9)
5
6 Variation
7 Weighted average
✔ Average Definition: the value appearing most in the sample
✔ Median
1,3,3,3,3,5,9,10,10
✔ Percentile
4 Mode
5 Standard deviation
6 Variation
7 Weighted average
✔ Average Definition: the value appearing most in the sample
✔ Median
1,3,3,3,3,5,9,10,10
✔ Percentile
4 Mode
Standard deviation
3
5
6 Variation
7 Weighted average
✔ Average Definition: the deviation of numeric values
around their average, depending on their
distance from their average
✔ Median
1,3,3,3,3,5,9,10,10
✔ Percentile
✔ Mode
5 Standard deviation
6 Variation
7 Weighted average
✔ Average Definition: the deviation of numeric values
around their average, depending on their
distance from their average
✔ Median
(sqr((1-5.2)^2+(3-5.2)^2+(3-5.2)^2+(3-5.2)^2+(3-5.2)^2+(5-5.2)^2+(9-5.2)^2+(10-5.2)^2+(10-5.2)^2)/9
✔ Percentile
1,3,3,3,3,5,9,10,10
✔ Mode
5 Standard deviation
6 Variation 3.29
7 Weighted average
✔ Average Definition: standard deviation squared
✔ Median
1,3,3,3,3,5,9,10,10
✔ Percentile
✔ Mode
✔ Standard deviation
6 Variation
7 Weighted average
✔ Average Definition: standard deviation squared
✔ Median
1,3,3,3,3,5,9,10,10
✔ Percentile
✔ Mode
✔ Standard deviation
10.84
6 Variation
7 Weighted average
✔ Average
English English Math study Math grade Student
study units grade units name
3 68 3 94 Sara Sara
✔ Percentile
17 15 Total
✔ Standard deviation
✔ Variation
7 Weighted average
✔ Average
English English Math study Math grade Student
study units grade units name
3 68 3 94 Sara Sara
✔ Percentile
17 15 Total
✔ Standard deviation
✔ Variation
7
Average: 78.75
Weighted average
✔ Definition: an arithmetic average in which values are given different significance (“weight”).
The value of the weighted average is the sum of multiplications of each value and its weight,
✔ divided by the sum of weights (2534 / (17 + 15) = 79.18)
Weights
✔
English study English grade Math study Math grade Student name
units units
✔
86*5+56*5 = 710 5 56 5 86 John John
94*3+68*3 = 486 3 68 3 94 Sara Sara
✔ 66*3+100*5 = 698 5 100 3 66 Ben Ben
78*4+82*4 = 640 4 82 4 78 Amy Amy
✔ 2534 17 15 Total
Average: 79.18
7 Weighted average
Statistics – basic terminology
Link to file
Final Exercise Solution
Answers:
1. Average mother age: 25.54. Average father age: 28.9
2. Height median: 52
3. Weight 20-percentile: 2.768
4. Number of cigarettes smokes by mother mode: 0.
Number of cigarettes smokes by father mode: 0.
5. Average head circumference: 34.59. Standard deviation
for head circumference: 2.37.
6. Weighted average: 34.18.
Link to file
The Road Ahead
Exercise
1. What is the probability of getting 2 in a fair dice roll?
Probability - Definition
Exercise
1. What is the probability of getting 2 in a fair dice roll?
Probability - Definition
Exercise
1. What is the probability of getting 2 in a fair dice roll?
2. What is the probability of getting an even number in a fair dice roll?
Probability - Definition
Exercise
1. What is the probability of getting 2 in a fair dice roll?
2. What is the probability of getting an even number in a fair dice roll?
Probability - Definition
Legend
P(A) – probability of an event A occurring
#A – the possible results of an event A
#Ω – total possible results
Imagine you travel abroad and enter a
casino. You decide to use a relatively simple
game – the roulette.
Red Black
The Gambler
Fallacy
10%
● Out of the 990 children with no ear
Chance of receiving a positive diagnosis when infection, 99 will get a positive result
checked by a pediatrician for children with no ear when checked by a pediatrician.
infection
10%
● Out of the 990 children with no ear
Chance of receiving a positive diagnosis when infection, 99 will get a positive result
checked by a pediatrician for children with no ear when checked by a pediatrician.
infection
A question - Given that a child was diagnosed with an ear infection, Calculation -
what is the chance of her actually having an infection? 9/(9 + 99)=9/108=0.084
A Question in Probability
0.20
Large Org
0.75
Unsatisfied
Total
population
0.25
Satisfied 0.90
Small Org
0.10
Unsatisfied
Class Exercise Solution
Satisfied
0.80
Unsatisfied
Class Exercise Solution
Satisfied
0.80
The probability of a satisfied
0.20
Large Org
small org member is: 0.75
Unsatisfied
0.25 * 0.9 = 0.225 Total
population
0.25
Satisfied 0.90
The probability of a satisfied Small Org
0.10
large org member is:
Unsatisfied
0.75*0.8 = 0.6
Try it yourself
Try it Yourself 1
Two blood types can be found in the population with the following distribution:
40% with blood type A
20% with blood type B
5% with blood type AB
Questions
1. What percent of the population has blood type O?
2. Those with blood type B can receive blood from those with O and B. What is the probability for a random donor
to donate blood to a patient with type B?
3. Those with blood type B can donate blood to those with AB and B. What is the probability for a random donor
with type B to donate blood to a random patient?
4. Those with blood type O can donate to everybody, but receive donation only from people with type O.
1. What is the probability of a type O donor to donate blood to a random patient?
2. What is the probability that a random donor will be able to donate blood to an O type patient?
Try it Yourself 2
A certain fictional country has rain on 0.3 of the days in a year.
On rainy days, there’s a probability of 0.5 for traffic jams. On non-rainy days, the probability is only 0.25.
If it’s raining and there are traffic jams, I’ll be late for work in 0.5 probability. On non-rainy days with no jams, the
probability of me being late is only 0.125. In the rest of the days (rainy with no traffic jams or not rainy with jams) th
probability of me being late is 0.25.
On a random day:
1. What is the probability of this being a not-rainy day, with traffic jams, and I’m not late for work?
2. What is the probability of me being late?
3. Given I was late for work today, what is the probability of this being a rainy day?
Solution 1
Questions
1. What percent of the population has blood type O? 35% = 100%-40%-20%-5%
2. Those with blood type B can receive blood from those with O and B. What is the probability for a random donor
to donate blood to a patient with type B? 55% = 35%+20%
3. Those with blood type B can donate blood to those with AB and B. What is the probability for a random donor
with type B to donate blood to a random patient? 25% = 5%+20%
4. Those with blood type O can donate to everybody, but donate only to people with type O.
1. What is the probability of an O type donor to donate blood to a random patient?
If O can donate to everyone, it’s 100% of the population
2. What is the probability that a random donor will be able to donate blood to an O type patient?
Only O donors can donate, and it’s 35% of the population
Year
Solution 2 0.34
0.66
What is the probability of this being a not-rainy day, with traffic jams, and I’m not late for work?
0.66*0.25*0.25 = 0.041
Year
Solution 2 0.34
0.66
What is the probability of this being a not-rainy day, with traffic jams, and I’m late for work?
0.66*0.75*0.125 + 0.66*0.25*0.75 + 0.34*0.5*0.75 + 0.34*0.5*0.5 = 0.398125
Year
Solution 2 0.34
0.66
Given I was late for work today, what is the probability of this being a rainy day?
0.34*0.5*0.75 + 0.34*0.5*0.5 = 0.2125
The Road Ahead
Graphic illustration:
When do Scan the code or use the link
Distribution?
Why do we want to identify Distribution?
Day-to-day Example
The weather forecast
"There’s 80% of rain tomorrow"
● Characterizing the behavior
enables estimating future
behavior and preparing for it
Types of
Distribution
Unified Distribution
A train arrives at the station every 10 minutes. Probability of receiving any number in a
Assuming you arrive at the station randomly, random dice roll
the probability of waiting between 0 to 10
minutes is identical.
Normal Distribution
Gauss distribution / Bell curve
This is the most important statistical distribution, used in all science areas,
describing the distribution of values around their average value.
Normal Distribution
Example 1
Example 2
% of total sales
ɱ = 25
σ=6
Normal Distribution, Calculation
Normal Distribution, Calculation
Standard Score - Z
The standard score reflects the distance of the observation/desired outcome from the
average in terms of standard deviation units, meaning the difference in standard deviation
unit between the desired outcome and the average.
We use:
● X for the desired outcome
● ɱ for the average
● σ for the standard deviation
0.7967
Standard
Score -
Z = 0.83
Z
0.7967
1-0.7967
=
0.2033
Now you try it
A certain city has an average temperature in August of 25 degrees with standard deviation of 6 degrees.
ɱ = 25
σ=6
Based on the previous
example, calculate the
possible degrees in 5%
of the hottest days.
Now you try it Solution
1. We are looking for the last 5%, the highest.
2. To return to standard score Z, we must find the opposite probability:
1-0.05 = 0.95
3. We return to the Z table and look for the standard score giving a probability
of 0.95 -> 1.65
4. We calculate the opposite: (X-25)/6 = 1.65
5. The result is 34.9 degrees, meaning that 5% probability of hottest days
means above 34.9 degrees.
The Road Ahead
2 Correlation
3 Statistical significance ● The survey is conducted during the pollsters’ regular work hours
(9am-5pm)
4 Trendlines
Representative Sample
1
sample
Data about part of the population that should represent
2 Correlation the entire population and its behavior
3 Statistical significance
4 Trendlines
Representative Sample
1
sample
Data about part of the population that should represent
2 Correlation the entire population and its behavior
3 Statistical significance
● Your company runs a phone survey to test customer satisfaction
● You can customers’ phone numbers as listed in your system
4 Trendlines
● The survey is conducted during the pollsters’ regular work hours
(9am-5pm)
Representative Sample
1
sample
Size of minimal sample allowing considerable accuracy for
2 Correlation concluding about the entire population from the sample
3 Statistical significance
4 Trendlines
https://www.surveysystem.com/sscalc.htm
Representative Example
1
sample
Researches found a new water source and mapped its population
2 Correlation of marine organisms. The results indicated a limited list of fish of size
medium and large organisms.
3 Statistical significance
1 year later, other researches conducted a different research in
the same water source and found out to their surprise that tens types
4 Trendlines of small organisms inhabit it.
Representative Selection Bias
1
sample
2 Correlation
Distortion of research data caused
by a bias in the information
3 Statistical significance collection method.
Ignoring a selection bias might
4 Trendlines lead to a wrong interpretation
of the data and to false results.
✔ Representative sample Use Case 1
2 Correlation
3 Statistical significance
4 Trendlines
✔ Representative sample
Statistical index evaluating the consistency of relations between a few
quantitative variables, i.e., if there is a consistency between a change in
one variable to a change in the other.
2 Correlation
The value of the correlation index indicating a full correlation is 1, and
the opposite is -1. No relation at all is 0.
3 Statistical significance
Import
Export
✔ Representative sample Use Case 2
2 Correlation
3 Statistical significance
4 Trendlines
✔ Representative sample Use Case 3
✔ Correlation
Statistical
3
significance
4 Trendlines
✔ Representative sample
✔ Correlation
Alpha Is the maximal risk for reaching
a false conclusion (False Positive) we are
3
Statistical willing to accept as part of the analysis.
significance
4 Trendlines
P-value measures the probability of
getting a false positive results.
*TTEST(B2:B8,C2:C8,2,1)
Estimated entries Actual entries Day
100 102 1
100 85 2
0.090
100 7 5
0.001 100
100
38
71
15
16
100 10 17
100 47 18
0.000 100 15 20
100 47 21
100 54 22
100 23 23
100 76 24
100 106 25
✔ Estimated Actual Estimated Actual
Representative sample Day Day
entries entries entries entries
✔ Correlation
From the dictionary:
direction, tendency
✔ Statistical significance
4 Trendlines
✔ Trend Upward Downward
Representative sample No trend
trend trend
✔ Correlation
✔ Statistical significance
4 Trendlines
✔ Representative sample Trend
✔ Correlation
✔ Statistical significance
4 Trendlines
✔ Representative sample Trend
✔ Correlation
✔ Statistical significance
4 Trendlines
✔ Representative sample
Trendline reflects the aggregated behavior of a data set. They are
useful for a variety of applications, starting from an optimization of an
advertising campaign and ending in data monitoring.
✔ Correlation
Types of trendlines:
● Linear
4 Trendlines
✔ Representative sample Moving Average
Calculation: choosing a
✔ Statistical significance
period for which we
calculate an average and
4 Trendlines continue to do so for each
day while looking back at
that period.
Advantages:
✔ Correlation ● Easy to calculate
● Flexible in the period
✔ Statistical significance chosen
● Provides a clear view
4 on data
Trendlines
Disadvantages:
● Choosing the period
might cause bias
● Describes the past,
cannot indicate about
the future
✔ Representative sample Moving Average
✔ Correlation
✔ Statistical significance
4 Trendlines
✔ Correlation
✔ Statistical significance
4 Trendlines
The Road Ahead