Basics of Statistics For Analytics Using SAS/ Excel

Basics of Statistics for Analytics Using SAS/ Excel
Contents Basic Statistics for Analytics
Probability Distribution
Statistics ➢Uniform Distribution
➢ Data and Variables
➢ Types of Variables ➢Poisson Distribution
➢ Population and Sample ➢Normal Distribution
➢ Types of Data Analysis ➢Frequency Distribution
➢Correlation and Covariance
Descriptive Statistics
➢ Measure of Central Tendency
➢ Measure of Spread
➢ Summarization & Visualization Tools
2
Statistics Introduction
Data
“Data (/ˈdeɪtə/ day-tə, /ˈdætə/ da-tə, or /ˈdɑːtə/ dah-tə)[1] is a set of values

of qualitative or quantitative variables; restated, pieces of data are individual
pieces of information. Data is measured, collected and reported, and
analysed, whereupon it can be visualized using graphs or images. Data as a
general concept refers to the fact that some existing information or
knowledge is represented or coded in some form suitable for better usage or
processing.
Raw data, i.e. unprocessed data, is a collection of numbers, characters; data

processing commonly occurs by stages, and the "processed data" from one
stage may be considered the "raw data" of the next.
“- Wiki
3
What Are Variables ?
4
Types Of Variables
➢ Continuous- Can take any values

between a permitted range. Examples-
Height, Weight, Sales, Unemployment
Rate
➢ Discrete- Can take a whole number

value within a permitted range.
Examples- number of cars owned by a
family
➢ Ordinal- Logically ordered categorical

values. Examples- Small, Medium,
Large size T shirts
➢ Nominal- Not logically ordered

categorical values. Example- Gender,
Nationality, Religion, Language.
5
Population and Sample
Population- Entire Group

Sample- A portion of the group.
➢ For practical reasons, a chosen subset of the

population called a sample is studied—as opposed
to compiling data about the entire group (an
operation called census)
➢ Descriptive statistics summarizes the population

data by describing what was observed in the
sample numerically or graphically
➢ Inferential statistics uses patterns in the sample

data to draw inferences about the population
represented, accounting for randomness
➢ To use a sample as a guide to an entire population,

it is important that it truly represent the overall
population 6
Types of Data Analysis
➢ Univariate- One variable is analyzed at a time.

Objective is to describe the variable. Example-
How many students are graduating with
“Analytics“ degree?
➢ Bivariate- Two variables are analyzed together for Type of
any possible association or empirical relationship.
Example- What is the correlation between
Analysis
“Gender” and graduation with “Analytics” degree?
➢ Multivariate- More than two variables are
analyzed together for any possible association or Univariate Bivariate Multivariate
interactions. Example – What is correlation
between “Gender”, “Country of Residence” and
graduation with “ Analytics” degree?
7
Central Tendency
➢ Mean or Average- One of the most
effective measure of “Center” of the
data. This is simply the arithmetic
average of the values of a variable
➢ While the arithmetic mean is often

used to report central tendencies, it is
not a robust statistic, meaning that it is
greatly influenced by outliers (values
that are very much larger or smaller
than most of the values).
➢ Mode- Most Frequently occurring

value of a variable
➢ Median- The middle value (50

percentile) of a value in an ordered
dataset. This is more robust to outliers
than the Mean 8
Mean Median Mode
➢ Activity # 1- Compute Mean, Mode, Median using the formulae
➢ Activity # 2- Use “Data Analysis” tool pack / Add-in from Excel
Dataset- Cars.csv data (392 Observations, 9 Variables)
9
Measure of Spread
➢ Range- The difference between the maximum

(largest) and the minimum (lowest) value a variable
➢ Standard Deviation (σ )- Measures the spread
around the mean of a variable. Higher the standard
deviation, higher the spread in the data
➢ Variance- Square of the standard deviation

➢ Interquartile Range (IQR)- The difference
between the 75 percentile and 25 percentile of the
variable. Higher the IQR, higher the spread in the
data
Note : 75 percentile means that 75% of the values of the variable are below that level 10
Standard Deviation and Variance
➢ Activity # 1- Compute Standard Deviation, Variance, IQR using formulae
➢ Activity # 2- Use “Data Analysis” tool pack / Add-in from Excel
11
Summarization and Visualization
Row Labels Average of MPG
American 20.03346939
European 27.60294118
➢ Cross-Tabulation Japanese 30.45063291
➢ Contingency (blank)
Tables Grand Total 23.44591837
➢ Box Plots
➢ Histograms
➢ Scatter Plots
12
Chart and Tables in Excel
➢ Activity # 1- Summarization and Visualization in Excel
➢ Activity # 2- Box plot in R
13
Probability Distribution Function
➢ A probability density function (pdf) is a function

that describes the relative likelihood for a random
variable (X) to take on a given value.
➢ Pdf of a discreet random variable is an assignment

of probabilities to each possible outcome
• Each probability should be between 0 and 1
• The sum of all probabilities should be equal to
1
➢ Pdf of a continuous random variable denoted f(X)

should meet following criteria
• f(X) >= 0
• The sum of all probabilities should be equal to
1
14
Uniform Distribution
➢ The distribution is often abbreviated U(a,b) with a

and b being the maximum and minimum values.
➢ The notation for the uniform distribution is: X ~
U(a,b) where a = the lowest value of x and b = the
highest value of x.
➢ If u is a value sampled from the standard uniform
distribution, then the value a + (b − a)u follows the
uniform distribution parametrized by a and b.
➢ The uniform distribution is useful for sampling
from arbitrary distributions.
➢ All outcomes are equally likely (constant
probability)
➢ Example- Winners of a lottery, Picking a “Heart”
from a deck of card, rolling a “6” on a dice etc.
15
Function Poisson Distribution
➢ The Poisson Distribution is used for “the number of

occurrence per unit interval”
➢ Only one parameter “Lambda” is required
➢ Examples- The number of traffic tickets issues per
month, the number of calls arriving/hour to a call
center, the number of typos/ page etc.
➢ In built function in Excel to model for this
Activity:
In a call center, on average 5 calls arrive per hour and

this follows a Poisson Distribution.
• What is the probability of receiving 9 calls exactly?

• What is the probability of receiving 9 or less calls?
16
Function Normal Distribution
➢ Physical quantities that are expected to be the sum

of many independent processes (such as
measurement errors) often have a distribution very
close to normal.
➢ The simplest case of normal distribution, known as
the Standard Normal Distribution, has expected
value zero and variance one.
➢ If the mean and standard deviation are known,
then one essentially knows as much as if he or she
had access to every point in the data set.
➢ The empirical rule is a handy quick estimate of the
spread of the data given the mean and standard
deviation of a data set that follows normal
distribution.
➢ The normal distribution is the most used statistical
distribution, since normality arises naturally in
many physical, biological, and social measurement
situations.
17
➢ A normal distribution is a symmetric

distribution (Bell shaped) in which the mean
and median are equal. Most data are
clustered in the center.
➢ An asymmetrical distribution is said to be
positively skewed (or skewed to the right)
when the tail on the right side of the
histogram is longer than the left side.
➢ An asymmetrical distribution is said to be
negatively skewed (or skewed to the left)
when the tail on the left side of the
histogram is longer than the right side.
➢ Distributions can also be uni-modal, bi-
modal, or multi-modal.
18
Normal Probability
Distribution Z-Scores
➢ A positive z-score represents an observation

above the mean, while a negative z-score
represents an observation below the mean.
➢ We obtain a z-score through a conversion
process known as standardizing or
normalizing.
➢ Z-scores are most frequently used to
compare a sample to a standard normal
deviate (standard normal distribution, with μ
= 0 and σ = 1).
➢ While z-scores can be defined without
assumptions of normality, they can only be
defined if one knows the population
parameters.
➢ Z-scores provide an assessment of how off-
target a process is operating.
19
➢ To calculate the area under a normal curve,

we use a z-score table.
➢ In a z-score table, the left most column tells
you how many standard deviations above the
mean to 1 decimal place, the top row gives
the second decimal place, and the
intersection of a row and column gives the
probability.
➢ For example, if we want to know the
probability that a variable is no more than
0.51 standard deviations above the mean, we
find select the 6th row down (corresponding
to 0.5) and the 2nd column (corresponding
to 0.01).
20
➢ The random variable of a standard normal

distribution ( mean =0, std. dev . =1 ) is
denoted by Z, instead of X.
➢ Unfortunately, in most cases in which the
normal distribution plays a role, the mean is
not 0 and the standard deviation is not 1.
➢ Fortunately, one can transform any normal
distribution with a certain mean μ and
standard deviation σ into a standard normal
distribution, by the z-score conversion
formula.
➢ Of importance is that calculating z requires
the population mean and the population
standard deviation, not the sample mean or
sample deviation.
21
Activity: Heights of American adult

males follow a Normal Distribution with
mean of 70 inches and standard
deviation of 2 inches.
➢ What is the probability of randomly

selecting a male of 72 inches or less?
➢ What is the probability for randomly
selecting between 73 and 75 inches
tall
22
Frequency Distribution
Frequency Distribution-Skewness
Positively Skewed Distribution Negatively Skewed Distribution

This distribution is said to be positively This distribution is said to be negatively
skewed (or skewed to the right) because the skewed (or skewed to the left) because the tail
tail on the right side of the histogram is on the left side of the histogram is longer than
longer than the left side. the right side.
23
Frequency Distribution
Box Plots
➢ Box Plot: A graphical
summary of a numerical data
sample through five statistics:
median, lower quartile, upper
quartile, and some indication
of more extreme upper and
lower values.
➢ Interquartile Range: The

difference between the first
and third quartiles; a robust
measure of sample dispersion.
➢ Outlier: a value in a statistical

sample which does not fit a
pattern that describes most
other data points; specifically, a
value that lies 1.5 IQR beyond
the upper or lower quartile
24
Correlation
➢ When one variable increases with the second

variable, we say that x and y have a positive
association.
➢ Conversely, when y decreases as x increases,

we say that they have a negative association.
➢ If both variables are qualitative, we would be

able to graph them in a contingency table.
25
Coefficient of Correlation
➢ The correlation coefficient was developed by

Karl Pearson from a related idea introduced
by Francis Galton in the 1880s.
➢ Pearson's correlation coefficient between two
variables is defined as the covariance of the
two variables divided by the product of their
standard deviations.
➢ Pearson's correlation coefficient when
applied to a sample is commonly represented
by the letter r.
➢ The size of the correlation r indicates the
strength of the linear relationship between x
and y.
➢ Values of r close to -1 or to +1 indicate a
stronger linear relationship between x and y.
26
Coefficient of Correlation
➢ Activity: Use the cars data and find

correlation among all numeric variables.
➢ Which variable is the most correlated to
Miles Per Gallon.
➢ Use both the formulate (CORREL) AND Data
Analysis tool pack
27
Appendix
References
➢ www.Wikipedia.org
➢ www.boundless.com
28

Basics of Statistics For Analytics Using SAS/ Excel

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Basics of Statistics For Analytics Using SAS/ Excel

Uploaded by

Copyright:

Available Formats

Basics of Statistics for Analytics Using SAS/ Excel

Contents Basic Statistics for Analytics

“Data (/ˈdeɪtə/ day-tə, /ˈdætə/ da-tə, or /ˈdɑːtə/ dah-tə)[1] is a set of values

Raw data, i.e. unprocessed data, is a collection of numbers, characters; data

➢ Continuous- Can take any values

➢ Discrete- Can take a whole number

➢ Ordinal- Logically ordered categorical

➢ Nominal- Not logically ordered

Population- Entire Group

➢ For practical reasons, a chosen subset of the

➢ Descriptive statistics summarizes the population

➢ Inferential statistics uses patterns in the sample

➢ To use a sample as a guide to an entire population,

➢ Univariate- One variable is analyzed at a time.

➢ While the arithmetic mean is often

➢ Mode- Most Frequently occurring

➢ Median- The middle value (50

➢ Activity # 1- Compute Mean, Mode, Median using the formulae

➢ Activity # 2- Use “Data Analysis” tool pack / Add-in from Excel

Dataset- Cars.csv data (392 Observations, 9 Variables)

➢ Range- The difference between the maximum

➢ Variance- Square of the standard deviation

➢ Activity # 1- Compute Standard Deviation, Variance, IQR using formulae

➢ Activity # 2- Use “Data Analysis” tool pack / Add-in from Excel

Dataset- Cars.csv data (392 Observations, 9 Variables)

➢ Activity # 1- Summarization and Visualization in Excel

➢ Activity # 2- Box plot in R

Dataset- Cars.csv data (392 Observations, 9 Variables)

➢ A probability density function (pdf) is a function

➢ Pdf of a discreet random variable is an assignment

➢ Pdf of a continuous random variable denoted f(X)

➢ The distribution is often abbreviated U(a,b) with a

➢ The Poisson Distribution is used for “the number of

In a call center, on average 5 calls arrive per hour and

• What is the probability of receiving 9 calls exactly?

➢ Physical quantities that are expected to be the sum

➢ A normal distribution is a symmetric

➢ A positive z-score represents an observation

➢ To calculate the area under a normal curve,

➢ The random variable of a standard normal

Activity: Heights of American adult

➢ What is the probability of randomly

Positively Skewed Distribution Negatively Skewed Distribution

➢ Interquartile Range: The

➢ Outlier: a value in a statistical

➢ When one variable increases with the second

➢ Conversely, when y decreases as x increases,

➢ If both variables are qualitative, we would be

➢ The correlation coefficient was developed by

➢ Activity: Use the cars data and find

You might also like