Professional Documents
Culture Documents
Analytics
The
fundamentals of
statistics
What is Definition: The science that deals with the collection, analysis, and
interpretation of data.
statistics?
Types of Statistics
Descriptive Inferential
Describe what is going on in a data Allow us to infer/predict trends
set. Results cannot be generalized about a larger population based on
to any other group. a study of a sample taken from it.
Example Example
show our
data
vs.
The
ultimate The ultimate goal of statistics is to be able to predict outcomes,
without using astrology. Simply describing our data is a minor goal.
goal of
statistics
Descriptive
statistics
Mean (simple average): The average of all numbers. To calculate, add
Mean all of the numbers in a set and then divide the sum by the total count
of numbers. Outliers can have a major effect on the mean.
Example
Employee age 24+25+25+26+27+29+33+35+41+43+63
24 Mean = = 33.7
11
25
25
26
How to calculate it in Excel
27
29
=AVERAGE(number1, [number2], …)
33
35
41
43 You can select an entire range instead of
63 selecting each individual number.
Outliers can have a major effect on the mean…
Mean
The oldest employee is 63 The oldest employee is 43
Employee age Employee age
24 24
25 24
25 25
26 25
27 26
Average age
29 27
drops by 3.5
33 29
35 33
years
41 35
43 41
63 43
Example
Employee age Mode = 25 (occurs 2x while everything else occurs 1x)
24
25
25
26
27
How to calculate it in Excel
29
=MODE(number1, [number2], …)
33
35
41
43 You can select an entire range instead of
63 selecting each individual number.
Histogram: Helps us visually show how our data is distributed and
Histogram gives better meaning to mean, median, and mode.
Mean = 33.7
Median = 29
Mode = 25
Mean = 33
Median = 33
Skewness: Measures of the lopsidedness of a histogram
Skewness
Right skew: Mean > Median
Mean = 34
Median = 29 In a histogram or distribution, the ends of the chart are
called “tails”. In a right skew, the tail is on the RIGHT.
Skewness: Measures of the lopsidedness of a histogram
Skewness
Left skew: Mean < Median
Mean = 53
Median = 58 In a left skewed histogram, the tail is on the LEFT
of the chart.
Calculating skewness is super easy, but the number is fairly
Skewness useless so we won’t worry about it.
The skewness number doesn’t help us with our analysis and doesn’t
really give any good information.
Are you still concerned about calculating skewness? Use the SKEW()
function in Excel to calculate it.
What is A histogram can show us how our data is distributed, but how can
we know how much it’s distributed?
variance?
Example: Both charts have almost identical means and medians, but
clearly the second chart’s data is grouped much closer to the mean.
But how much so? That is where variance comes in…
variance? Example: Two sets of test scores with the same mean (average)
deviation
High standard deviation Low standard deviation
= =
Harder to predict outcomes Easier to predict outcomes
deviation The actual formula for standard deviation can get ugly and knowing it
won’t add a ton of value…
σ |𝑥 − 𝜇|2
𝑆𝐷 =
𝑁
Luckily Excel has a built in function to calculate this number for us…
=STDEV.S() =STDEV.P()
Use this if you are working with a You should only use this if you are
sample, which will almost always be working with a true population,
the case. which is rare.
Coefficient Coefficient of variation: Don’t let the name scare you. It’s simply a
way to “standardize” our standard deviation so we can compare the
of variation standard deviation across different data sets.
Formula
𝑆𝑡𝑎𝑛𝑑𝑎𝑟𝑑 𝑑𝑒𝑣𝑖𝑎𝑡𝑖𝑜𝑛
𝐶𝑜𝑒𝑓𝑓𝑖𝑐𝑖𝑒𝑛𝑡 𝑜𝑓 𝑣𝑎𝑟𝑖𝑎𝑡𝑖𝑜𝑛 = 𝑥 100
𝑀𝑒𝑎𝑛
Reminder: Types of Statistics
Types of Descriptive Inferential
statistics Describe what is going on in a data Allow us to infer/predict trends
set. Results cannot be generalized about a larger population based on
to any other group. a study of a sample taken from it.
Don’t worry, you’ll never, ever need to calculate the actual probability of
all possible values occurring. That would be sheer madness!
Probability Example: Below is the probability distributions for rolling a 1 through
6 on a single dice and rolling a 2 through 12 on two dice.
Distribution
distribution
distribution
99% of data
95% of data
68% of all data falls within
± 1 standard deviation
68% of data
95% of all data falls within
± 2 standard deviation
99% of all data falls within
± 3 standard deviation
Normal Example: The average test score was a 70% with a standard deviation
of 10%. The test scores would fall in this range.
distribution 99% of test scores
40% 100%
95% of test scores
50% 90%
68% of test scores
60% 80%
Mean – 1 SD
Mean = 70%
70% - 10% = 60%
Central
Central limit theorem (CLT): This is a foundational principle of
limit statistics. The CLT explains why we only need a relatively small
theorem sample to be able to make assumptions about the entire dataset AND
why we can assume a normal distribution with our data (even when
it’s not!).
Instead of giving you a boring definition, let’s explain it to you…
Central
Central limit theorem (CLT) explained
limit
Example: Let’s say a high school has 1,000 students in grades 8-12
theorem with 200 students in each grade.
So, no matter what your original distribution looks like, the distribution
of the sample averages will be normally distributed! Sweet!
Central
Wow, that’s awesome! So what does that mean?
limit
1. This new distribution we just created will have approximately
theorem the same mean as the total population.
20 phones
15 phones 25 phones
Intro to This all ties back to what we’ve learned about the normal
distribution and standard deviation…
estimates
95% of data
If we want to have 95%
confidence that the actual
number will fall within a
range, we need to make
sure it captures data ± 2
standard deviations from
the mean.
15 phones 25 phones
Calculating
confidence Formula
Calculating 5
6
7
2.015
1.943
1.895
2.571
2.447
2.365
3.365
3.143
2.998
4.032
3.707
3.499
5.893
5.208
4.785
6.869
5.959
5.408
confidence 8
9
10
1.860
1.833
1.812
2.306
2.262
2.228
2.896
2.821
2.764
3.355
3.250
3.169
4.501
4.297
4.144
5.041
4.781
4.587
Calculating 5
6
7
2.015
1.943
1.895
2.571
2.447
2.365
3.365
3.143
2.998
4.032
3.707
3.499
5.893
5.208
4.785
6.869
5.959
5.408
t-score from table (for sample size use We can now tell our
n-1) = 2.069 boss, that with 95%
10 confidence, we
Standard error = = 2.04 estimate inventory
24 − 1 needs to be between
145 and 155 units.
150 + 2.069 * 2.04 = 155 (rounded up)
150 - 2.069 * 2.04 = 145 (rounded down)
Intro to
regression
analysis and
correlation
What is a
Regression analysis: Used to find LINEAR trends in our data. The
regression regression provides us with an equation so that we can make
analysis? predictions about our data. It also helps us understand how accurate
our model is.
Example: You want to predict next month’s sales. Obviously there are
dozens of factors that impact this number. A regression analysis is a
way of sorting out which of these dozens of variables have an impact
and add to your predictive model. From your regression, you can
figure out which factors matter most, which ones you can ignore, and
how certain are you about all of those factors.
What is a
Major components of a regression analysis:
regression
Dependent variable Independent variables
analysis? The main thing you’re trying to The factors you believe have an
understand or predict. impact on your dependent variable.
Economy
But how can we know if these
independent variables (ex. the
economy) actually have a
meaningful impact on the Sales Competition
dependent variable (ex. sales)?
Weather
Scatter plot: A chart that has a bunch of points showing the
Scatter plot relationship between two sets of data.
Example: The scatter plot below shows the relationship between the
outside temperature (independent variable) and daily sales
(dependent variable).
Y (Vertical) axis =
Dependent variable It is pretty clear that there
is some type of relationship
X (Horizontal) axis = between the temperature
Independent variable increasing and daily sales
increasing. But how do we
quantify that? Correlation
(which we’ll discuss in a
minute).
Creating a scatter plot in Excel
Scatter plot
Make sure your independent variable is in the
first column (ex. outside temperature) and your
dependent variable (ex. sales) is in the second
column. This will save you time later.
Select all your data, including the headers.
Creating a scatter plot in Excel
Scatter plot
On the Ribbon, go to Insert and under the Charts area,
select Scatter (the very first option).
Creating a scatter plot in Excel
Scatter plot
Adjust the x and y axis as needed to zoom in on your data. To
do this, right click on the axis and select Format Axis. From
there, you can change the Bounds.
Correlation: A quantity (between -1.0 and 1.0) measuring how much
Correlation two items are dependent upon each other. The closer to 1.0 or -1.0,
the higher the correlation.
Correlation = .71
In other words, there is a high,
positive relationship between the
increasing temperature outside
and the increase in sales.
Positive correlation (0-1.0)
An increase in X predicts there
will be an increase in Y. (ex.
smoking more cigarettes
predicts higher lung cancer
deaths.)
Luckily Excel has a built in function to calculate this number for us…
y = The dependent variable we are solving for. This value is on the y axis. In our
example, daily sales.
m = The slope of the line. In our example, our slope is 2.89. This means sales
increase by $2.89 when it is 1.0 degree warmer outside.
x = The independent variable. This value is on the x axis. In our example, the
outside temperature.
b = The intercept.
Equation of Using the equation of the line to predict sales…
the line Example: Temperatures are expected to drop significantly next week
with highs around 25 degrees. What can we expect daily sales to be?
questions
How accurate is our model?
Regression Statistics
Multiple R 0.706
R Square 0.498
Adjusted R Square 0.496
Standard Error 51.421
Observations 374
ANOVA
df SS MS F Significance F
Regression 1 974,890.848 974,890.848 368.696 0.000
Residual 372 983,626.767 2,644.158
Total 373 1,958,517.615
Coefficients Standard Error t Stat P-value Lower 95% Upper 95% Lower 95.0% Upper 95.0%
Intercept 1,007.648 7.728 130.397 0.000 992.453 1,022.843 992.453 1,022.843
Outside Temp 2.887 0.150 19.201 0.000 2.591 3.182 2.591 3.182
The Regression statistics
regression SUMMARY OUTPUT
Multiple R: This is just a fancy word for correlation.
output Regression Statistics
Multiple R 0.706 R Square: This tells us how well our data fits the
R Square 0.498
Adjusted R Square 0.496 model. In our example, our R Square is .498 or 50%.
Standard Error 51.421 This number means that 50% of the change in sales
Observations 374
can be attributed to the change in temperature.
R2 = 40%
R2 = 60%
R2 = 80%
A quick P-value: At its core, this values helps us understand if a variable in our
overview of model is likely to have occurred randomly. If the variable is just some
random number, we would not want to include it in our model, so
p-values we’d get rid of it.
In general, a value less than 0.05 is good. This means we have 95%
confidence that the variable is not random. If the p-value is above
0.05, the probability our variable is random is just too high, so we get
rid of it.
Example: This can be confusing so let’s break this down into a simple
example using a coin.
Let’s say we want to check to see if a coin is rigged or not. So we start
flipping it.
A quick Example cont.: Let’s say we flip the coin 8 times…
overview of Flip Result Probability As we keep getting more and
p-values 1
2
Heads
Heads
50.0%
25.0%
more heads, we are more and
more certain that the coin is
3 Heads 12.5% rigged. In other words, we are
4 Heads 6.3% more and more certain that our
5 Heads 3.1% results are not random.
6 Heads 1.6%
7 Heads 0.8% In fact, by the 4th head, there is a
8 Heads 0.4% 6.3% chance that flipping 4
heads in a row is purely random.
By the 5th head, there is a 3.1%
chance.
This is what the p-value is telling
us. We don’t want random
variables in our model.
SUMMARY OUTPUT
Regression Statistics
Multiple R 0.706
The
R Square 0.498
Adjusted R Square 0.496
Back to our regression statistics…
Standard Error 51.421
regression
Observations 374
As a “p-value”,
output
ANOVA
df SS MS F Significance F should be
Regression
Residual
1 974,890.848
372 983,626.767
974,890.848
2,644.158
368.696 0.000
below 0.05!
Total 373 1,958,517.615
Coefficients Standard Error t Stat P-value Lower 95% Upper 95% Lower 95.0% Upper 95.0%
Intercept 1,007.648 7.728 130.397 0.000 992.453 1,022.843 992.453 1,022.843
There isTemp
Outside a lot of stuff here waaaay beyond this course. For us, we only really care
2.887 0.150 19.201 0.000 2.591 3.182 2.591 3.182
about significance f.
Significance F: This number is basically the p-value for the model. This number
let’s us know if our model is reliable or if it is just a bunch of random numbers.
A value less than 0.05 is generally good.
For a simple regression, this number doesn’t help us much. This is number is
much more important once we have 2+ variables in our model.
Multiple R 0.706
R Square 0.498
Adjusted R Square 0.496
Standard Error 51.421
Observations 374
ANOVA
The
df SS MS F Significance F
Regression 1 974,890.848 974,890.848 368.696 0.000
Regression
Residual statistics 372 983,626.767 2,644.158
output
Coefficients Standard Error t Stat P-value Lower 95% Upper 95% Lower 95.0% Upper 95.0%
Intercept 1,007.648 7.728 130.397 0.000 992.453 1,022.843 992.453 1,022.843
Outside Temp 2.887 0.150 19.201 0.000 2.591 3.182 2.591 3.182
Coefficients: These tell us the equation of our line. So our equation would be
y=2.887x + 1,007.648.
P-value: The p-value for each variable. We want a value below 0.05. For a
single variable regression, if this number is above 0.05, we need a whole new
variable. For a multivariable regression, if any variable is above 0.05, we
should get rid of it.
The p-value for the intercept doesn’t really matter, even if it’s .999.
The Tying it all together
We want the R-square (or adjusted R-
square for a multivariable regression)
ANOVA
df SS MS F Significance F
We want the
Regression 1 974,890.848 974,890.848 368.696 0.000 Significance F to
Residual 372 983,626.767 2,644.158 be below 0.05.
We use the coefficients to Total 373 1,958,517.615
create the equation of our
Coefficients Standard Error t Stat P-value Lower 95% Upper 95% Lower 95.0% Upper 95.0%
line so we can make Intercept 1,007.648 7.728 130.397 0.000 992.453 1,022.843 992.453 1,022.843
predictions/forecasts. Outside Temp 2.887 0.150 19.201 0.000 2.591 3.182 2.591 3.182
SQL for Data Stats for Data Intro to Data Dashboards and
Analytics Analytics Analytics Power BI
Learn the code to Learn the key Apply our statistical Visually represent
grab all the relevant statistical tools to tools to real-life your analyses and
data from you analyze our data business problems gain more insights
database from them.