You are on page 1of 28

Basics of Statistics for Analytics Using SAS/ Excel

Contents Basic Statistics for Analytics

Probability Distribution
Statistics ➢Uniform Distribution
➢ Data and Variables
➢ Types of Variables ➢Poisson Distribution
➢ Population and Sample ➢Normal Distribution
➢ Types of Data Analysis ➢Frequency Distribution
➢Correlation and Covariance
Descriptive Statistics
➢ Measure of Central Tendency
➢ Measure of Spread
➢ Summarization & Visualization Tools

2
Statistics Introduction
Data

“Data (/ˈdeɪtə/ day-tə, /ˈdætə/ da-tə, or /ˈdɑːtə/ dah-tə)[1] is a set of values


of qualitative or quantitative variables; restated, pieces of data are individual
pieces of information. Data is measured, collected and reported, and
analysed, whereupon it can be visualized using graphs or images. Data as a
general concept refers to the fact that some existing information or
knowledge is represented or coded in some form suitable for better usage or
processing.

Raw data, i.e. unprocessed data, is a collection of numbers, characters; data


processing commonly occurs by stages, and the "processed data" from one
stage may be considered the "raw data" of the next.
“- Wiki

3
Statistics Introduction
What Are Variables ?

4
Statistics Introduction
Types Of Variables

➢ Continuous- Can take any values


between a permitted range. Examples-
Height, Weight, Sales, Unemployment
Rate

➢ Discrete- Can take a whole number


value within a permitted range.
Examples- number of cars owned by a
family

➢ Ordinal- Logically ordered categorical


values. Examples- Small, Medium,
Large size T shirts

➢ Nominal- Not logically ordered


categorical values. Example- Gender,
Nationality, Religion, Language.
5
Statistics Introduction
Population and Sample

Population- Entire Group


Sample- A portion of the group.

➢ For practical reasons, a chosen subset of the


population called a sample is studied—as opposed
to compiling data about the entire group (an
operation called census)

➢ Descriptive statistics summarizes the population


data by describing what was observed in the
sample numerically or graphically

➢ Inferential statistics uses patterns in the sample


data to draw inferences about the population
represented, accounting for randomness

➢ To use a sample as a guide to an entire population,


it is important that it truly represent the overall
population 6
Statistics Introduction
Types of Data Analysis

➢ Univariate- One variable is analyzed at a time.


Objective is to describe the variable. Example-
How many students are graduating with
“Analytics“ degree?
➢ Bivariate- Two variables are analyzed together for Type of
any possible association or empirical relationship.
Example- What is the correlation between
Analysis
“Gender” and graduation with “Analytics” degree?
➢ Multivariate- More than two variables are
analyzed together for any possible association or Univariate Bivariate Multivariate
interactions. Example – What is correlation
between “Gender”, “Country of Residence” and
graduation with “ Analytics” degree?

7
Descriptive Statistics
Central Tendency
➢ Mean or Average- One of the most
effective measure of “Center” of the
data. This is simply the arithmetic
average of the values of a variable

➢ While the arithmetic mean is often


used to report central tendencies, it is
not a robust statistic, meaning that it is
greatly influenced by outliers (values
that are very much larger or smaller
than most of the values).

➢ Mode- Most Frequently occurring


value of a variable

➢ Median- The middle value (50


percentile) of a value in an ordered
dataset. This is more robust to outliers
than the Mean 8
Descriptive Statistics
Mean Median Mode

➢ Activity # 1- Compute Mean, Mode, Median using the formulae

➢ Activity # 2- Use “Data Analysis” tool pack / Add-in from Excel

Dataset- Cars.csv data (392 Observations, 9 Variables)

9
Descriptive Statistics
Measure of Spread

➢ Range- The difference between the maximum


(largest) and the minimum (lowest) value a variable
➢ Standard Deviation (σ )- Measures the spread
around the mean of a variable. Higher the standard
deviation, higher the spread in the data

➢ Variance- Square of the standard deviation


➢ Interquartile Range (IQR)- The difference
between the 75 percentile and 25 percentile of the
variable. Higher the IQR, higher the spread in the
data

Note : 75 percentile means that 75% of the values of the variable are below that level 10
Descriptive Statistics
Standard Deviation and Variance

➢ Activity # 1- Compute Standard Deviation, Variance, IQR using formulae

➢ Activity # 2- Use “Data Analysis” tool pack / Add-in from Excel

Dataset- Cars.csv data (392 Observations, 9 Variables)

11
Statistics Introduction
Summarization and Visualization
Row Labels Average of MPG
American 20.03346939
European 27.60294118
➢ Cross-Tabulation Japanese 30.45063291
➢ Contingency (blank)
Tables Grand Total 23.44591837
➢ Box Plots
➢ Histograms
➢ Scatter Plots

12
Descriptive Statistics
Chart and Tables in Excel

➢ Activity # 1- Summarization and Visualization in Excel

➢ Activity # 2- Box plot in R

Dataset- Cars.csv data (392 Observations, 9 Variables)

13
Statistics Introduction
Probability Distribution Function

➢ A probability density function (pdf) is a function


that describes the relative likelihood for a random
variable (X) to take on a given value.

➢ Pdf of a discreet random variable is an assignment


of probabilities to each possible outcome
• Each probability should be between 0 and 1
• The sum of all probabilities should be equal to
1

➢ Pdf of a continuous random variable denoted f(X)


should meet following criteria
• f(X) >= 0
• The sum of all probabilities should be equal to
1

14
Statistics Introduction
Uniform Distribution

➢ The distribution is often abbreviated U(a,b) with a


and b being the maximum and minimum values.
➢ The notation for the uniform distribution is: X ~
U(a,b) where a = the lowest value of x and b = the
highest value of x.
➢ If u is a value sampled from the standard uniform
distribution, then the value a + (b − a)u follows the
uniform distribution parametrized by a and b.
➢ The uniform distribution is useful for sampling
from arbitrary distributions.
➢ All outcomes are equally likely (constant
probability)
➢ Example- Winners of a lottery, Picking a “Heart”
from a deck of card, rolling a “6” on a dice etc.

15
Probability Distribution
Function Poisson Distribution

➢ The Poisson Distribution is used for “the number of


occurrence per unit interval”
➢ Only one parameter “Lambda” is required
➢ Examples- The number of traffic tickets issues per
month, the number of calls arriving/hour to a call
center, the number of typos/ page etc.
➢ In built function in Excel to model for this

Activity:

In a call center, on average 5 calls arrive per hour and


this follows a Poisson Distribution.

• What is the probability of receiving 9 calls exactly?


• What is the probability of receiving 9 or less calls?

16
Probability Distribution
Function Normal Distribution

➢ Physical quantities that are expected to be the sum


of many independent processes (such as
measurement errors) often have a distribution very
close to normal.
➢ The simplest case of normal distribution, known as
the Standard Normal Distribution, has expected
value zero and variance one.
➢ If the mean and standard deviation are known,
then one essentially knows as much as if he or she
had access to every point in the data set.
➢ The empirical rule is a handy quick estimate of the
spread of the data given the mean and standard
deviation of a data set that follows normal
distribution.
➢ The normal distribution is the most used statistical
distribution, since normality arises naturally in
many physical, biological, and social measurement
situations.
17
Probability Distribution
Function Normal Distribution

➢ A normal distribution is a symmetric


distribution (Bell shaped) in which the mean
and median are equal. Most data are
clustered in the center.
➢ An asymmetrical distribution is said to be
positively skewed (or skewed to the right)
when the tail on the right side of the
histogram is longer than the left side.
➢ An asymmetrical distribution is said to be
negatively skewed (or skewed to the left)
when the tail on the left side of the
histogram is longer than the right side.
➢ Distributions can also be uni-modal, bi-
modal, or multi-modal.

18
Normal Probability
Distribution Z-Scores

➢ A positive z-score represents an observation


above the mean, while a negative z-score
represents an observation below the mean.
➢ We obtain a z-score through a conversion
process known as standardizing or
normalizing.
➢ Z-scores are most frequently used to
compare a sample to a standard normal
deviate (standard normal distribution, with μ
= 0 and σ = 1).
➢ While z-scores can be defined without
assumptions of normality, they can only be
defined if one knows the population
parameters.
➢ Z-scores provide an assessment of how off-
target a process is operating.
19
Probability Distribution
Function Normal Distribution

➢ To calculate the area under a normal curve,


we use a z-score table.
➢ In a z-score table, the left most column tells
you how many standard deviations above the
mean to 1 decimal place, the top row gives
the second decimal place, and the
intersection of a row and column gives the
probability.
➢ For example, if we want to know the
probability that a variable is no more than
0.51 standard deviations above the mean, we
find select the 6th row down (corresponding
to 0.5) and the 2nd column (corresponding
to 0.01).

20
Probability Distribution
Function Normal Distribution

➢ The random variable of a standard normal


distribution ( mean =0, std. dev . =1 ) is
denoted by Z, instead of X.
➢ Unfortunately, in most cases in which the
normal distribution plays a role, the mean is
not 0 and the standard deviation is not 1.
➢ Fortunately, one can transform any normal
distribution with a certain mean μ and
standard deviation σ into a standard normal
distribution, by the z-score conversion
formula.
➢ Of importance is that calculating z requires
the population mean and the population
standard deviation, not the sample mean or
sample deviation.

21
Probability Distribution
Function Normal Distribution

Activity: Heights of American adult


males follow a Normal Distribution with
mean of 70 inches and standard
deviation of 2 inches.

➢ What is the probability of randomly


selecting a male of 72 inches or less?
➢ What is the probability for randomly
selecting between 73 and 75 inches
tall

22
Frequency Distribution
Frequency Distribution-Skewness

Positively Skewed Distribution Negatively Skewed Distribution


This distribution is said to be positively This distribution is said to be negatively
skewed (or skewed to the right) because the skewed (or skewed to the left) because the tail
tail on the right side of the histogram is on the left side of the histogram is longer than
longer than the left side. the right side.

23
Frequency Distribution
Box Plots
➢ Box Plot: A graphical
summary of a numerical data
sample through five statistics:
median, lower quartile, upper
quartile, and some indication
of more extreme upper and
lower values.

➢ Interquartile Range: The


difference between the first
and third quartiles; a robust
measure of sample dispersion.

➢ Outlier: a value in a statistical


sample which does not fit a
pattern that describes most
other data points; specifically, a
value that lies 1.5 IQR beyond
the upper or lower quartile

24
Statistics Introduction
Correlation

➢ When one variable increases with the second


variable, we say that x and y have a positive
association.

➢ Conversely, when y decreases as x increases,


we say that they have a negative association.

➢ If both variables are qualitative, we would be


able to graph them in a contingency table.

25
Statistics Introduction
Coefficient of Correlation

➢ The correlation coefficient was developed by


Karl Pearson from a related idea introduced
by Francis Galton in the 1880s.
➢ Pearson's correlation coefficient between two
variables is defined as the covariance of the
two variables divided by the product of their
standard deviations.
➢ Pearson's correlation coefficient when
applied to a sample is commonly represented
by the letter r.
➢ The size of the correlation r indicates the
strength of the linear relationship between x
and y.
➢ Values of r close to -1 or to +1 indicate a
stronger linear relationship between x and y.

26
Statistics Introduction
Coefficient of Correlation

➢ Activity: Use the cars data and find


correlation among all numeric variables.
➢ Which variable is the most correlated to
Miles Per Gallon.
➢ Use both the formulate (CORREL) AND Data
Analysis tool pack

27
Appendix
References

➢ www.Wikipedia.org
➢ www.boundless.com

28

You might also like