You are on page 1of 118

Introduction to

Statistics
Data analysis Program
Hi,
I’m Adi Cohen
● M.sc. Degree in Neuroscience

● Research human MRI Properties and their


relation to anxiety through computational
analysis

● Trainer at Elevation: Python, SQL, Power BI


The Road Ahead

Basic terminology | 30 min.


Average, weighted average, Distribution | 40 min.
median, percentile, mode, Importance of distributions and
standard deviation, variation how to use normal distribution Final
Practice yourself Practice yourself practice | 30 min.

The power A taste of probability | 50 min. Additional concepts | 60 min.


and the risks of Probability of independent Correlation, statistical
statistics | 10 min. events vs. probability of significance, representative
dependent events sample, trendlines
Practice yourself Practice yourself
The Road Ahead

Basic terminology | 30 min.


Average, weighted average, Distribution | 40 min.
median, percentile, mode, Importance of distributions and
standard deviation, variation how to use normal distribution Final
Practice yourself Practice yourself practice | 30 min.

The power A taste of probability | 50 min. Additional concepts | 60 min.


and the risks of Probability of independent Correlation, statistical
statistics | 10 min. events vs. probability of significance, representative
dependent events sample, trendlines
Practice yourself Practice yourself
What comes Scan the code or click the link and answer in one word

to mind when
you hear
“Statistics”?
Data analysis example proving that on average women used fewer medical services than men on 2021 (19% vs. 28%)

Women Men
Number of having health insurance 2000 2000
Number of Using medical services 372 568
Users % 19% 28%

By drilling down to the district level, you can see that there are no significant differences between genders.
If anything, women use more medical services.

Men Women
Number of Number of using
having health medical Number of having Number of using
insurance services % health insurance medical services %
North District 1200 168 14% 1800 270 15%
South District 800 400 50% 200 102 51%
X Y X Y X Y X Y
10.00 8.04 10.00 9.14 10.00 7.46 8.00 6.58
8.00 6.95 8.00 8.14 8.00 6.77 8.00 5.76
13.00 7.58 13.00 8.74 13.00 12.74 8.00 7.71
9.00 8.81 9.00 8.77 9.00 7.11 8.00 8.84
11.00 8.33 11.00 9.26 11.00 7.81 8.00 8.47
14.00 9.96 14.00 8.10 14.00 8.84 8.00 7.04
6.00 7.24 6.00 6.13 6.00 6.08 8.00 5.25
4.00 4.26 4.00 3.10 4.00 5.39 19.00 12.50
12.00 10.84 12.00 9.13 12.00 8.15 8.00 5.56
7.00 4.82 7.00 7.26 7.00 6.42 8.00 7.91
5.00 5.68 5.00 4.74 5.00 5.73 8.00 6.89
Average 8.06 8.06 8.06 8.06
Correlation 0.82 0.82 0.82 0.82
Standard Deviation 2.79 2.79 2.79 2.79

From a 1973 article by Francis Anscombe


The Road Ahead

Basic terminology | 30 min.


Average, weighted average, Distribution | 40 min.
median, percentile, mode, Importance of distributions and
standard deviation, variation how to use normal distribution Final
Practice yourself Practice yourself practice | 30 min.

The power A taste of probability | 50 min. Additional concepts | 60 min.


and the risks of Probability of independent Correlation, statistical
statistics | 10 min. events vs. probability of significance, representative
dependent events sample, trendlines
Practice yourself Practice yourself
Basic Terminology
1 Average

2 Median

3 Percentile Basic Terms


4 Mode For Review
5 Standard deviation

6 Variation

7 Weighted average
1 Arithmetic Average Definition: sum of the values divided by their number

2 Median 1,3,3,3,3,5,9,10,10

3 Percentile

4 Mode

5 Standard deviation

6 Variation

7 Weighted average
1 Arithmetic Average Definition: sum of the values divided by their number

2 Median (1+3+3+3+3+5+9+10+10)/9

3 Percentile 1,3,3,3,3,5,9,10,10

4 Mode

5.2
5 Standard deviation

6 Variation

7 Weighted average
✔ Average Definition: the 50 percentile – the value which divides
the data set into two, having an equal number
of values above it and below it
2 Median

1,3,3,3,3,5,9,10,10
3 Percentile

4 Mode

5 Standard deviation

6 Variation

7 Weighted average
✔ Average Definition: the 50 percentile – the value which divides
the data set into two, having an equal number
of values above it and below it
2 Median

1,3,3,3,3,5,9,10,10
3 Percentile

4 Mode

Standard deviation
3
5
1,3,3,3,3,5,5,9,10,10
6 Variation

7 Weighted average
✔ Average Definition: the 50 percentile – the value which divides
the data set into two, having an equal number
of values above it and below it
2 Median

1,3,3,3,3,5,9,10,10
3 Percentile

4 Mode

Standard deviation
3
5
1,3,3,3,3,5,5,9,10,10
6 Variation

7 When the number of values is even,


4
Weighted average
the median is the average of the 2
central values
✔ Average Definition: A value from which a certain percentage
of a set of values is equal or lower
✔ Median

1,3,3,3,3,5,9,10,10
3 Percentile

4 Mode

5 Standard deviation

6 Variation

7 Weighted average
✔ Average Definition: A value from which a certain percentage
of a set of values is lower
✔ Median

1,3,3,3,3,5,9,10,10
3 Percentile

4 Mode

9
the value below a certain percent of a value group
Standard deviation
(70% of 9 values is 6.3 – the 7th item is 9)
5

6 Variation

7 Weighted average
✔ Average Definition: the value appearing most in the sample

✔ Median

1,3,3,3,3,5,9,10,10
✔ Percentile

4 Mode

5 Standard deviation

6 Variation

7 Weighted average
✔ Average Definition: the value appearing most in the sample

✔ Median

1,3,3,3,3,5,9,10,10
✔ Percentile

4 Mode

Standard deviation
3
5

6 Variation

7 Weighted average
✔ Average Definition: the deviation of numeric values
around their average, depending on their
distance from their average
✔ Median

1,3,3,3,3,5,9,10,10
✔ Percentile

✔ Mode

5 Standard deviation

6 Variation

7 Weighted average
✔ Average Definition: the deviation of numeric values
around their average, depending on their
distance from their average
✔ Median

(sqr((1-5.2)^2+(3-5.2)^2+(3-5.2)^2+(3-5.2)^2+(3-5.2)^2+(5-5.2)^2+(9-5.2)^2+(10-5.2)^2+(10-5.2)^2)/9
✔ Percentile

1,3,3,3,3,5,9,10,10
✔ Mode

5 Standard deviation

6 Variation 3.29
7 Weighted average
✔ Average Definition: standard deviation squared

✔ Median

1,3,3,3,3,5,9,10,10
✔ Percentile

✔ Mode

✔ Standard deviation

6 Variation

7 Weighted average
✔ Average Definition: standard deviation squared

✔ Median

1,3,3,3,3,5,9,10,10
✔ Percentile

✔ Mode

✔ Standard deviation
10.84

6 Variation

7 Weighted average
✔ Average
English English Math study Math grade Student
study units grade units name

✔ Median 5 56 5 86 John John

3 68 3 94 Sara Sara
✔ Percentile

5 100 3 66 Ben Ben


✔ Mode
4 82 4 78 Amy Amy

17 15 Total
✔ Standard deviation

✔ Variation

7 Weighted average
✔ Average
English English Math study Math grade Student
study units grade units name

✔ Median 5 56 5 86 John John

3 68 3 94 Sara Sara
✔ Percentile

5 100 3 66 Ben Ben


✔ Mode
4 82 4 78 Amy Amy

17 15 Total
✔ Standard deviation

✔ Variation

7
Average: 78.75
Weighted average
✔ Definition: an arithmetic average in which values are given different significance (“weight”).

The value of the weighted average is the sum of multiplications of each value and its weight,
✔ divided by the sum of weights (2534 / (17 + 15) = 79.18)

Weights

English study English grade Math study Math grade Student name
units units

86*5+56*5 = 710 5 56 5 86 John John
94*3+68*3 = 486 3 68 3 94 Sara Sara
✔ 66*3+100*5 = 698 5 100 3 66 Ben Ben
78*4+82*4 = 640 4 82 4 78 Amy Amy
✔ 2534 17 15 Total

Average: 79.18
7 Weighted average
Statistics – basic terminology

AVG 1/2 % CNT STD VAR

Average Median Percentile Mode Standard Variation


Sum of the values The value which The value below a The value deviation Standard
divided by their divides the data certain percent of appearing most in The deviation of deviation
number set into two, a value group the sample numeric values squared
having an equal around their
Weighted number of values average,
Average above it and depending on
Arithmetic below it their distance
average with from their
different weights average
for values
When to use ● Let’s consider the results of a
customer satisfaction
average & questionnaire from 9 customers
when to use on a 1-10 scale:
1, 3, 3, 3, 3, 5, 9, 10, 10
median?
● The feedback average is 5.22
Use case 1
● Past experience shows that
customers of 3 and below score
tend to abandon the company
When use
average &
when
median?
Use case 2
When to use average and when to use median?

When the sample is suitable for normal distribution –


average Is sufficient
Final Exercise
This data set indicates new-born indexes next to information
about age and smoking habits of their parents.
Calculate the following:
1. Average age of mother and average age of father
2. Height median for newborns
3. 20-percentile of newborn weights
4. Mode for number of cigarettes smoked by mother and
by father
5. Average and standard deviation of newborn head
circumference
6. Weighted average of newborn head circumference,
with the weight being the number of cigarettes per day
the mother smokes

Link to file
Final Exercise Solution
Answers:
1. Average mother age: 25.54. Average father age: 28.9
2. Height median: 52
3. Weight 20-percentile: 2.768
4. Number of cigarettes smokes by mother mode: 0.
Number of cigarettes smokes by father mode: 0.
5. Average head circumference: 34.59. Standard deviation
for head circumference: 2.37.
6. Weighted average: 34.18.

Link to file
The Road Ahead

Basic terminology | 30 min.


Average, weighted average, Distribution | 40 min.
median, percentile, mode, Importance of distributions and
standard deviation, variation how to use normal distribution Final
Practice yourself Practice yourself practice | 30 min.

The power A taste of probability | 50 min. Additional concepts | 60 min.


and the risks of Probability of independent Correlation, statistical
statistics | 10 min. events vs. probability of significance, representative
dependent events sample, trendlines
Practice yourself Practice yourself
A Taste of Probability
Probability - Definition

Probability is a numeric expression to the probability


of a certain event to occur.

Event’s probability can be anywhere between 0 to 1.


An impossible event is of probability 0,
and a certain event is of probability 1.
Probability - Definition

Probability is a numeric expression to the probability


of a certain event to occur.

Event’s probability can be anywhere between 0 to 1.


An impossible event is of probability 0,
and a certain event is of probability 1.

Exercise
1. What is the probability of getting 2 in a fair dice roll?
Probability - Definition

Probability is a numeric expression to the probability


of a certain event to occur.

Event’s probability can be anywhere between 0 to 1.


An impossible event is of probability 0,
and a certain event is of probability 1.

Exercise
1. What is the probability of getting 2 in a fair dice roll?
Probability - Definition

Probability is a numeric expression to the probability


of a certain event to occur.

Event’s probability can be anywhere between 0 to 1.


An impossible event is of probability 0,
and a certain event is of probability 1.

Exercise
1. What is the probability of getting 2 in a fair dice roll?
2. What is the probability of getting an even number in a fair dice roll?
Probability - Definition

Probability is a numeric expression to the probability


of a certain event to occur.

Event’s probability can be anywhere between 0 to 1.


An impossible event is of probability 0,
and a certain event is of probability 1.

Exercise
1. What is the probability of getting 2 in a fair dice roll?
2. What is the probability of getting an even number in a fair dice roll?
Probability - Definition

Probability is calculated as a relative part


of the desired event out of all possible events

Legend
P(A) – probability of an event A occurring
#A – the possible results of an event A
#Ω – total possible results
Imagine you travel abroad and enter a
casino. You decide to use a relatively simple
game – the roulette.

The first round you pick black – and win


The second round you pick black – and win
The third round you pick black – and win again
The fourth round you pick black – and win again
The fifth round you pick black – and win again

What will you pick the sixth time around?

Red Black
The Gambler
Fallacy

The fallacy occurs due to the confusion of general


probability of achieving a certain result in a set of
events, with the probability of achieving that result in
a particular case
Probability of Independent Events

Event A and event B are independent when information about


event B contributes no relevant knowledge regarding event A
A Question in Probability
The probability of children under 5 getting an ear infection is 1%.
It is known that children with ear infection have 90% chance of being diagnosed when
checked by a pediatrician.
Children with no ear infection have a 10% chance of receiving a positive diagnosis by a
pediatrician.
Given that a child was diagnosed with ear infection by a pediatrician, what is the chance of
her having an ear infection?
A Question in Probability
The probability of children under 5 getting an ear infection is 1%.
It is known that children with ear infection have 90% chance of being diagnosed when
checked by a pediatrician.
Children with no ear infection have a 10% chance of receiving a positive diagnosis by a
pediatrician.
Given that a child was diagnosed with ear infection by a pediatrician, what is the chance of
her having an ear infection?

Average response among 95 doctors was 75%


A Question in Probability
1%
The probability of children under 5 ● 10 out of 1000 children under 5 get
getting an ear infection. ear infection.

90% ● 9 out of 10 children with ear


Chance of being diagnosed when checked infection will get a positive result
by a pediatrician for children with ear infection when checked by a pediatrician.

10%
● Out of the 990 children with no ear
Chance of receiving a positive diagnosis when infection, 99 will get a positive result
checked by a pediatrician for children with no ear when checked by a pediatrician.
infection

A question - Given that a child was diagnosed with an ear infection,


what is the chance of her actually having an infection?
A Question in Probability
1%
The probability of children under 5 ● 10 out of 1000 children under 5 get
getting an ear infection. ear infection.

90% ● 9 out of 10 children with ear


Chance of being diagnosed when checked infection will get a positive result
by a pediatrician for children with ear infection when checked by a pediatrician.

10%
● Out of the 990 children with no ear
Chance of receiving a positive diagnosis when infection, 99 will get a positive result
checked by a pediatrician for children with no ear when checked by a pediatrician.
infection

A question - Given that a child was diagnosed with an ear infection, Calculation -
what is the chance of her actually having an infection? 9/(9 + 99)=9/108=0.084
A Question in Probability

The conditional probability of event A given event B is the chance


of event A occurring, assuming event B had occurred.
The realization of the assumption limits the sample space.
Class Exercise
A certain country has two Healthcare organizations.
75% of the population are members of the larger organization, the others are members of the smaller one.
Feedback questionnaires have indicated that 90% of the smaller organization members are satisfied from
its service, while only 80% of the members of the bigger one are satisfied.
Thus, the smaller organization starts a huge advertising campaign claiming, “If you’re satisfied, you must be
our member”.
Is the campaign truthful?
* Example taken from Wikipedia
Class Exercise Solution
Satisfied
0.80

0.20
Large Org
0.75
Unsatisfied
Total
population
0.25
Satisfied 0.90
Small Org
0.10

Unsatisfied
Class Exercise Solution
Satisfied
0.80

The probability of a satisfied 0.20


Large Org
small org member is: Unsatisfied
0.75

0.25 * 0.9 = 0.225 Total


population
0.25
Satisfied 0.90
Small Org
0.10

Unsatisfied
Class Exercise Solution
Satisfied
0.80
The probability of a satisfied
0.20
Large Org
small org member is: 0.75
Unsatisfied
0.25 * 0.9 = 0.225 Total
population
0.25
Satisfied 0.90
The probability of a satisfied Small Org
0.10
large org member is:
Unsatisfied
0.75*0.8 = 0.6
Try it yourself
Try it Yourself 1
Two blood types can be found in the population with the following distribution:
40% with blood type A
20% with blood type B
5% with blood type AB
Questions
1. What percent of the population has blood type O?
2. Those with blood type B can receive blood from those with O and B. What is the probability for a random donor
to donate blood to a patient with type B?
3. Those with blood type B can donate blood to those with AB and B. What is the probability for a random donor
with type B to donate blood to a random patient?
4. Those with blood type O can donate to everybody, but receive donation only from people with type O.
1. What is the probability of a type O donor to donate blood to a random patient?
2. What is the probability that a random donor will be able to donate blood to an O type patient?
Try it Yourself 2
A certain fictional country has rain on 0.3 of the days in a year.
On rainy days, there’s a probability of 0.5 for traffic jams. On non-rainy days, the probability is only 0.25.
If it’s raining and there are traffic jams, I’ll be late for work in 0.5 probability. On non-rainy days with no jams, the
probability of me being late is only 0.125. In the rest of the days (rainy with no traffic jams or not rainy with jams) th
probability of me being late is 0.25.
On a random day:
1. What is the probability of this being a not-rainy day, with traffic jams, and I’m not late for work?
2. What is the probability of me being late?
3. Given I was late for work today, what is the probability of this being a rainy day?
Solution 1
Questions
1. What percent of the population has blood type O? 35% = 100%-40%-20%-5%
2. Those with blood type B can receive blood from those with O and B. What is the probability for a random donor
to donate blood to a patient with type B? 55% = 35%+20%
3. Those with blood type B can donate blood to those with AB and B. What is the probability for a random donor
with type B to donate blood to a random patient? 25% = 5%+20%
4. Those with blood type O can donate to everybody, but donate only to people with type O.
1. What is the probability of an O type donor to donate blood to a random patient?
If O can donate to everyone, it’s 100% of the population
2. What is the probability that a random donor will be able to donate blood to an O type patient?
Only O donors can donate, and it’s 35% of the population
Year

Solution 2 0.34
0.66

Rainy Not rainy

0.5 0.25 0.75


0.5

Jammed flowing Jammed flowing

0.5 0.5 0.25 0.75 0.25 0.75 0.875 0.125

Late for Late for Late for Late for


On time On time On time On time
work work work work

What is the probability of this being a not-rainy day, with traffic jams, and I’m not late for work?
0.66*0.25*0.25 = 0.041
Year
Solution 2 0.34
0.66

Rainy Not rainy

0.5 0.25 0.75


0.5

Jammed flowing Jammed flowing

0.5 0.5 0.25 0.75 0.25 0.75 0.875 0.125

Late for Late for Late for Late for


On time On time On time On time
work work work work

What is the probability of this being a not-rainy day, with traffic jams, and I’m late for work?
0.66*0.75*0.125 + 0.66*0.25*0.75 + 0.34*0.5*0.75 + 0.34*0.5*0.5 = 0.398125
Year
Solution 2 0.34
0.66

Rainy Not rainy

0.5 0.25 0.75


0.5

Jammed flowing Jammed flowing

0.5 0.5 0.25 0.75 0.25 0.75 0.875 0.125

Late for Late for Late for Late for


On time On time On time On time
work work work work

Given I was late for work today, what is the probability of this being a rainy day?
0.34*0.5*0.75 + 0.34*0.5*0.5 = 0.2125
The Road Ahead

Basic terminology | 30 min.


Average, weighted average, Distribution | 40 min.
median, percentile, mode, Importance of distributions and
standard deviation, variation how to use normal distribution Final
Practice yourself Practice yourself practice | 30 min.

The power A taste of probability | 50 min. Additional concepts | 60 min.


and the risks of Probability of independent Correlation, statistical
statistics | 10 min. events vs. probability of significance, representative
dependent events sample, trendlines
Practice yourself Practice yourself
Distribution
What is distribution?

● A function that displays the possible


values of a variable and how often
each value occurs

● East distribution is related to a graph


illustrating the probability of each value
occurring.
Example 1
Variable: result of a dice roll
Possible values: 1, 2, 3, 4, 5, 6
Frequency of each value occurring in 100 dice rolls: 1/6
Graphic illustration:
Example 2
Frequency Possible Results combination for a
Variable: result of rolling 2 dices calculation Frequency value certain value
Possible values: (1/6*1/6)*1 0.03 2 1+1
2,3,4,5,6,7,8,9,10,11,12 (1/6*1/6)*2 0.06 3 1+2 ,2+1
Frequency each value can occur (1/6*1/6)*3 0.08 4 1+3 ,3+1, 2+2
in 100 dice rolls: (1/6*1/6)*4 0.11 5 1+4, 4+1, 2+3, 3+2
(1/6*1/6)*5 0.14 6 1+5, 5+1, 2+4, 4+2, 3+3
1+6, 6+1, 2+5, 5+2, 3+4,
(1/6*1/6)*6 0.17 7 4+3,
(1/6*1/6)*5 0.14 8 2+6, 6+2, 3+5, 5+3, 4+4
(1/6*1/6)*4 0.11 9 3+6, 6+3, 4+5, 5+4
(1/6*1/6)*3 0.08 10 4+6, 6+4, 5+5
(1/6*1/6)*2 0.06 11 5+6, 6+5
(1/6*1/6)*1 0.03 12 6+6
Example 2
Variable: result of rolling 2 dices
Possible values:
2,3,4,5,6,7,8,9,10,11,12
Frequency each value can occur
in 100 dice rolls:

Graphic illustration:
When do Scan the code or use the link

we use And give an example for using distribution

Distribution?
Why do we want to identify Distribution?

● Distribution indicates data


behavior based on past data

Day-to-day Example
The weather forecast
"There’s 80% of rain tomorrow"
● Characterizing the behavior
enables estimating future
behavior and preparing for it
Types of
Distribution
Unified Distribution

A train arrives at the station every 10 minutes. Probability of receiving any number in a
Assuming you arrive at the station randomly, random dice roll
the probability of waiting between 0 to 10
minutes is identical.
Normal Distribution
Gauss distribution / Bell curve

This is the most important statistical distribution, used in all science areas,
describing the distribution of values around their average value.
Normal Distribution
Example 1
Example 2
% of total sales

Shoe size (American)


Normal Distribution, Calculation
A certain city has an average temperature in August of 25 degrees
with standard deviation of 6 degrees (based on 50 years measurements)

ɱ = 25

σ=6
Normal Distribution, Calculation
Normal Distribution, Calculation
Standard Score - Z
The standard score reflects the distance of the observation/desired outcome from the
average in terms of standard deviation units, meaning the difference in standard deviation
unit between the desired outcome and the average.
We use:
● X for the desired outcome
● ɱ for the average
● σ for the standard deviation

Z = (X-ɱ)/σ = (30-25)/6 = 5/6 = 0.83


Standard
Score -
Z = 0.83
Z

0.7967
Standard
Score -
Z = 0.83
Z

0.7967

1-0.7967
=
0.2033
Now you try it
A certain city has an average temperature in August of 25 degrees with standard deviation of 6 degrees.

ɱ = 25

σ=6
Based on the previous
example, calculate the
possible degrees in 5%
of the hottest days.
Now you try it Solution
1. We are looking for the last 5%, the highest.
2. To return to standard score Z, we must find the opposite probability:
1-0.05 = 0.95
3. We return to the Z table and look for the standard score giving a probability
of 0.95 -> 1.65
4. We calculate the opposite: (X-25)/6 = 1.65
5. The result is 34.9 degrees, meaning that 5% probability of hottest days
means above 34.9 degrees.
The Road Ahead

Basic terminology | 30 min.


Average, weighted average, Distribution | 40 min.
median, percentile, mode, Importance of distributions and
standard deviation, variation how to use normal distribution Final
Practice yourself Practice yourself practice | 30 min.

The power A taste of probability | 50 min. Additional concepts | 60 min.


and the risks of Probability of independent Correlation, statistical
statistics | 10 min. events vs. probability of significance, representative
dependent events sample, trendlines
Practice yourself Practice yourself
Additional Concepts
1 Representative sample

2 Correlation

3 Statistical significance Additional


4 Trendlines Concepts
Representative Exercise
1
sample
● Your company runs a phone survey to test customer satisfaction
2 Correlation
● You call customers’ phone numbers as listed in your system

3 Statistical significance ● The survey is conducted during the pollsters’ regular work hours
(9am-5pm)

4 Trendlines
Representative Sample
1
sample
Data about part of the population that should represent
2 Correlation the entire population and its behavior

3 Statistical significance

4 Trendlines
Representative Sample
1
sample
Data about part of the population that should represent
2 Correlation the entire population and its behavior

3 Statistical significance
● Your company runs a phone survey to test customer satisfaction
● You can customers’ phone numbers as listed in your system
4 Trendlines
● The survey is conducted during the pollsters’ regular work hours
(9am-5pm)
Representative Sample
1
sample
Size of minimal sample allowing considerable accuracy for
2 Correlation concluding about the entire population from the sample

3 Statistical significance

4 Trendlines

Size of population Population variation

https://www.surveysystem.com/sscalc.htm
Representative Example
1
sample
Researches found a new water source and mapped its population
2 Correlation of marine organisms. The results indicated a limited list of fish of size
medium and large organisms.
3 Statistical significance
1 year later, other researches conducted a different research in
the same water source and found out to their surprise that tens types
4 Trendlines of small organisms inhabit it.
Representative Selection Bias
1
sample

2 Correlation
Distortion of research data caused
by a bias in the information
3 Statistical significance collection method.
Ignoring a selection bias might
4 Trendlines lead to a wrong interpretation
of the data and to false results.
✔ Representative sample Use Case 1

2 Correlation

3 Statistical significance

4 Trendlines
✔ Representative sample
Statistical index evaluating the consistency of relations between a few
quantitative variables, i.e., if there is a consistency between a change in
one variable to a change in the other.
2 Correlation
The value of the correlation index indicating a full correlation is 1, and
the opposite is -1. No relation at all is 0.
3 Statistical significance

In Excel -> CORREL(array1, array2)


4 Trendlines

State Foreign Trade of


Country A

Import
Export
✔ Representative sample Use Case 2

2 Correlation

3 Statistical significance

4 Trendlines
✔ Representative sample Use Case 3

You launch a new Instagram


2 Correlation campaign designed to increase
traffic into your website.
3 Statistical significance
A week into the campaign, you
analyze the number of entries from
4 Trendlines Instagram and find a total of 506
during the first week.

Before the campaign you had no


visitors from Instagram at all.

Based on this analysis, can you


determine if the campaign is
successful and if you should
continue with it just the way it is?
✔ Representative sample
Statistical significance means that the results are caused
by a real reason, not mere coincidence

✔ Correlation

Statistical
3
significance

4 Trendlines
✔ Representative sample

✔ Correlation
Alpha Is the maximal risk for reaching
a false conclusion (False Positive) we are
3
Statistical willing to accept as part of the analysis.
significance

4 Trendlines
P-value measures the probability of
getting a false positive results.

Hence we espire the P-value to be as low


as possible, and at least lower than
Alpha.
Estimated entries Actual entries Date
✔ Representative sample
100 102 01-Jan-2022

✔ Correlation 100 85 02-Jan-2022


100 93 03-Jan-2022
Statistical
3
significance 100 45 04-Jan-2022

4 Trendlines 100 7 05-Jan-2022


100 109 06-Jan-2022
100 65 07-Jan-2022
506 Total

*TTEST(B2:B8,C2:C8,2,1)
Estimated entries Actual entries Day
100 102 1
100 85 2

✔ Representative sample 100 93 3

P-Value after 7 days 100 45 4

0.090
100 7 5

✔ Correlation 100 109 6


100 65 7
100 60 8
Statistical
3 P-Value after 10 days 100 55 9
significance
0.010 100
100
5
48
10
11
100 83 12
4 Trendlines
100 94 13

P-Value after 15 days 100 12 14

0.001 100
100
38
71
15
16
100 10 17
100 47 18

P-Value after 25 days 100 99 19

0.000 100 15 20
100 47 21
100 54 22
100 23 23
100 76 24
100 106 25
✔ Estimated Actual Estimated Actual
Representative sample Day Day
entries entries entries entries

✔ Correlation 50 102 1 150 102 1


50 85 2 150 85 2
Statistical
3 50 93 3 150 93 3
significance
50 45 4 150 45 4
4 Trendlines
50 7 5 150 7 5
50 109 6 150 109 6
50 65 7 150 65 7

P-Value after 7 days P-Value after 7 days


0.155 0.001
Trend
✔ Representative sample Trend

✔ Correlation
From the dictionary:
direction, tendency
✔ Statistical significance

4 Trendlines
✔ Trend Upward Downward
Representative sample No trend
trend trend

✔ Correlation

✔ Statistical significance

4 Trendlines
✔ Representative sample Trend

✔ Correlation

✔ Statistical significance

4 Trendlines
✔ Representative sample Trend

✔ Correlation

✔ Statistical significance

4 Trendlines
✔ Representative sample
Trendline reflects the aggregated behavior of a data set. They are
useful for a variety of applications, starting from an optimization of an
advertising campaign and ending in data monitoring.
✔ Correlation
Types of trendlines:

✔ Statistical significance ● Moving average

● Linear
4 Trendlines
✔ Representative sample Moving Average

Goal: to flatten the curve


✔ Correlation and clean noises.

Calculation: choosing a
✔ Statistical significance
period for which we
calculate an average and
4 Trendlines continue to do so for each
day while looking back at
that period.

Note: the longer the period,


the less noisy the trendline
becomes.
✔ Representative sample Moving Average

Advantages:
✔ Correlation ● Easy to calculate
● Flexible in the period
✔ Statistical significance chosen
● Provides a clear view
4 on data
Trendlines
Disadvantages:
● Choosing the period
might cause bias
● Describes the past,
cannot indicate about
the future
✔ Representative sample Moving Average

Goal: to present an upward


✔ Correlation / downward / no-change
trend and its degree
✔ Statistical significance
Calculation: finding the
best relation between all
4 Trendlines points
✔ Representative sample Linear

✔ Correlation

✔ Statistical significance

4 Trendlines

A formula representing the linear line: y=ax+b R squared


a = the incline and degree of the trend. Negative number – measuring the fitness level
downward trend. Positive number – downward trend. The of a linear trendline to the data.
larger a is, the larger the incline. The higher the value, the more
B = the meeting point of the line with the Y axis. credible the trendline.
✔ Representative sample The Dangers of Linear

✔ Correlation

✔ Statistical significance

4 Trendlines
The Road Ahead

Basic terminology | 30 min.


Average, weighted average, Distribution | 40 min.
median, percentile, mode, Importance of distributions and
standard deviation, variation how to use normal distribution Final
Practice yourself Practice yourself practice | 30 min.

The power A taste of probability | 50 min. Additional concepts | 60 min.


and the risks of Probability of independent Correlation, statistical
statistics | 10 min. events vs. probability of significance, representative
dependent events sample, trendlines
Practice yourself Practice yourself
Final Exercise

Work in pairs for 20 minutes


Now it is your turn…
Q&A
Thanks!

You might also like