Professional Documents
Culture Documents
MODE MODE.SNGL()
MEDIAN MEDIAN()
QUARTILE QUARTILE.INC()
INTERQUARTILE QUARTILE.INC(data array,1)- QUARTILE.INC(data array,3)
covariance covariance.s(if data is sample of population)/ covariance.p(if
data is total population)
Range:-infinty to +infinity
corelation CORREL()
Range:-1 to +1
Normal distribution =NORM.DIST(x, mean, std, TRUE).It will give results left of the
curve means less than a certain value.To get the value for right
of the curve means higher than a certain value.we need to 1-
NORM.DIST(x, mean, std, TRUE). As the whole probability Is 1
CONFIDENCE.T = CONFIDENCE.T(α, s, n)
Differnce mean hypothesis testing Creat pivot table>data analysis/unequal variance/paired two
samples
Difference mean hypothesis Data analsis >t test paired two sample for mean
testing
measures of dispersion
2. Calculate t value:
For population portion
Linear Regression
R square value 0.X -- this implies, that this regression model is able to explain
about X percentage of variation or changes in the unit sales of
the toy. What happens to the remaining radiation? It goes
unexplained.This will always increase by adding new
independent variables.
Adjusted R variable the adjusted R square will only increase if the added X variable
makes the modle better. It may go down if the added X variable
does not make any sense.
Hypothesis testing of linear From the table created from data analysis(upper ,lower value),p
regration value.
Multicollinearity
Mean centering
These are a set of numbers that describe a data. A data may have many observations, and a summary
set of numbers that describe those multiple observations, is called descriptive statistics.
there are many other descriptor statistics. The more important and commonly used amongst them can
be categorized broadly into two categories. Measures of central tendency, also known as various
averages of data. And secondly, measures of dispersion. The former tries to capture some central aspect
of the data, while measures of dispersion summarizes how spread out or dispersed the data is.
Measures of central tendency:There are three important averages or measures of central tendency used
to summarize data. The mean, the median, and the mode.
Mean: mean most of the time refers to the arithmetic mean some other means, such as the geometric
mean or the harmonic mean
Median: median of a set of ordered observations is a middle number that divides the data into two
parts.
When should you report the median and when should you report the mean?
The mean is influenced to a greater extent by extreme observation( Some data has higher value than
most of the data). So if you notice extreme observations in your data, then perhaps a median is a better
summary of data than a mean.
Relationship between the mean and the median relates to the skewness of data: Mean higher than the
median the data is skewned to the right
Mode: The Excel command to calculate mode is MODE.SNGL. A mode is not a very relevant statistic
when the data is essentially continuous. For example, consider the daily exchange rate between the US
dollar and Euro in a particular month.The mode is not very relevant because the nature of data is such
that no value occurs more than once.Even if it occurs more than once, the likelihood of such occurrence
would be low.Thus, little information is gained knowing the mode.
Dispersion or spread in the data: How does one translate dispersion into some meaningful
descriptive statistic?
One way to do so is to calculate the range of data, which simply is the difference between maximum
and minimum values in the data.
Another way is: the inter quartile range, or IQR. This defines the middle 50% of data, leaving 25% of
the data to the right, and 25% to the left. The median, incidentally, is the second quartile. The
minimum number in the range is the zeroth quartile. And the maximum number in the range is the
fourth quartile. Finally, the interquartile range or IQR is the third quartile minus the first quartile.
QUARTILE.INC().Your data array, and the particular quartile, zeroth, first, second, third, or the fourth
that you wish to calculate. The inter quartile range can then be calculated in Excel as the difference
between the third and the fourth quartiles.
Standard deviation: If we had data which was a sample from some larger population of data, which,
by the way, typically would be the case in a majority of business applications. We would use the
Excel command STDEV.S to calculate the standard deviation rather than STDEV.P. Which will be
used to calculate the standard deviation of the whole population.The P and S standing for population
and sample.
Box plot:
Rule of thumb: It says that approximately 68% of data lie within one standard deviation, and
approximately 95% lie within 2 standard deviations from the mean.
These measures are called measures of association, and they're described in a set of numbers, the
relationship between two or more variables.
#When the answer is positive means there is positive relation and when the answer is negative
means there is nagitive relation
# affected by the unit of the variables.so, The measure is fine as long as we need to know the
direction of relationship. That is, when one variable increases or decreases, what happens to the
other variable? the covariance cannot be directly interpreted in terms of how strong is the
relationship.# covariance and correlation measures. The covariance measure can theoretically vary
between negative infinity and positive infinity. A positive value for the covariance indicating a positive
relationship between the two variables. That is, if one increases, the other increases. If one
decreases, the other decreases. A negative value indicating a relationship where, when one variable
increases, the other decreases, and vice versa.
Correlation: the correlation is not affected by change of units. And is always bound between -1 and
+1, with the positive value of correlation indicating a positive relationship and negative value
indicating a negative relationship. Further, closer the correlation is to a +1 or -1, stronger is the
positive or negative relationship between the two variables. Loosely speaking of correlation and
excess of positive 0.5 is considered a strong positive relationship. And a correlation less than
negative 0.5, is considered a strong negative relationship between the two variables. EXCEL
function is CORREL()
Causation: the covariance and correlation tell us how two variables vary with respect with each
other. That is, if one variable increases or decreases, how is that related to the increase or decrease
of the other variable. We looked at the height and weights of certain Olympic athletes and concluded
that there was a strong positive correlation between the weight of an athlete and his or her height.
This correlation however in no way proves that weight causes height or that height causes weight.
This is where causation comes in, which is a distinct concept in correlation. However, unfortunately,
many a times this distinction is glossed over.
Usually to establish causation we need to have correlation and then a temporal distinction between
the two variables. That is, the variable causing the other variable needs to proceed or occur before
the variable that it causes.Further, we need to rule out our control for other many variables that
might be causing the variable in question.
Random experiment: Random Experiment It is simply any situation wherein a process leads to more
than one possible outcome.
A Coin toss
Roll of a dice
Random Variable: A random variable is a variable that takes on values determined by the outcome of a
random experiment
Statistical Distributions
Discrete distribution: A statistical distribution used for Discrete data.Example: Height, Distance traveled
in a road trip.(because the are infinite amount of possibility.because it can be 1 km/1.1km/1.111km
etc.only limited by the instrumental measurement)
Continuous distribution: A statistical distribution used for Continuous data. Example: Amount of water in
a bucket, the grains of sand in the universe.
Discrete Data
Test of Discreteness
It is common in business applications to use a continuous distribution such as the Normal (the Bell
curve) for discrete data. Because the most well understood distribution is normal distribution.
Normal distribution: One of the most popular statistical distributions is Normal Distribution. Which
importantly happens to be a continuous distribution. However, As mentioned earlier it is okay many at
times in business applications to approximate even discreet data with the continuous distribution such
as the normal.
Probability mass function (PMF): a PMF, or probability Mass function is a rule that assigns probabilities
(values between 0 to 1) to various possible values that the random variable takes.example:coin toss 0.5
for heads and 0.5 for tails. Total 1
PDF: or probability density function is a rule that assigns probabilities to various possible values that the
random variable takes.
Height.This is continuous data as discussed in the previous lesson, because if you take any two heights,
for example, 5' and 6', then the possible value of heights that can occur between 5' and 6' are infinite. If
you then ask, what is the probability of someone's height being 5'2"? The answer is 0, because even if
your friend has a height of 5'2", my response will be that you get a better measuring instrument and you
will find that the height is not 5' 2'', but say 5' 2.01''. Such kind of argument can be given for any height
that you come up with, thus implying that the probability of someone having a particular height is
always 0.This is the reason why, when using a continuous distribution, we always consider ranges of
outcomes. For example, what is the probability that someone's height is between 5'2" and 5'5",
probability for a range of outcomes.What is the probability that someone's height is less than 5'
feet.Again, a range of outcomes.
Ques: What is the probability that on a particular day the demand for falafel
sandwiches is less than 300 at the restaurant? Ans: 0.4098
Ques: If the restaurant stocks 400 falafel sandwiches for a given day, what
is the probability that it will run out of these sandwiches on that day?
Ans: 0.0635
430.0637
Another Problem…
John can take either of two roads to the airport from his home (Road A
or Road B). Owing to varying traffic conditions the travel times on the
two roads are not fixed, rather on a Friday around midday the travel
distributions as follows,
Road A
= 0.0912
Road B
= 0.1586
Each sample mean is normally distributed and their mean is the population mean.
Bernoulli Process: Multiple Trials of the Bernoulli Process
Game of Dice
Lose (otherwise)
The Binomial Distribution:(one of the 2 most commonly used distribution used in discrete distribution)
is 1-p.
Binomial distribution:
Poission Distribution:
T distribution: T distribution, just like the standard normal, is also a symmetric distribution centered at
zero. That is, it has a mean equal to zero. the spec or the standard deviation of this distribution depends
on a single parameter called the degree of freedom, commonly denoted by df. As the degree of freedom
increases, the t distribution becomes closer and closer to the standard normal distribution.
The t distribution, unlike the normal distribution, does not have any stand-alone business applications.
Rather, it is used as a tool for the calculations used in coming up with confidence intervals and
hypothesis testing, which we will be introducing in subsequent lessons.
Degrees of freedom is the parameter of the t distribution that uniquely identifies one t distribution from
another. Just like in the case of a normal distribution, the combination of two parameters, the mean and
the standard deviation, uniquely identify one normal distribution from another.
the degrees of freedom is linked to the size of data being used. So, a t distribution using a larger set of
data would have a greater degree of freedom than a t distribution using a smaller set of data.
Example:
Example:
Average starting salary of all business students who graduated last year in New York city
When the population standard deviation is unknown sample standard deviation is used in place of
population standard deviation.Then it will be converted to t distribution with (n-1) degree of freedom.
Confidence interval calculation: A random sample of 20 observations from a population data had a mean
equal to 70 The standard deviation of the population data is 10. Find an 85% confidence interval for the
population mean.
Probability outside the confidence interval is referred to as ‘α’ … and we wish to construct a (1-α)
confidence interval for the population mean
63
For population portion/Percentage: (we have to do manually)
The sample size, n, will be largest and hence, a conservative estimate in the calculation If the
underline expression p hat times 1 minus p hat is the maximum.And for this expression to be
maximum, p hat has to be 0.5 or 50%
Hypothesis testing:
EXAMPLE:
We wish to test a claim that the average age of Men MBA students across various MBA programs in the
US is greater than 28 years. For this we collect data on average ages of men MBA students across a
sample of 40 MBA programs in the US.
It will not have any effect if we switch the hypothesis
Hypothesis testing for population proportion: We will be using the z-statistic.For the population mean
where we have been using the t-statistic. All hypothesis tests involving a population proportion will be
using the z-statistic.
Example:
A medium sized university in the US introduces a new lunch facility on campus on a trial basis. The
university operates the Lunch facility for a few months and then decides to survey the student body.
Based on the survey, university would make this facility a permanent fixture or do away with it.
Specifically, if more than or equal to 70% of the student body approves of it then the facility would be
made permanent else it would shut down. The university conducts a survey with 750 randomly selected
students on campus and finds that 510 of these students (or 68% of the sampled students) approve of
the new facility and the remaining 240 students or 32% students do not approve of it. Based on the
criteria set by University should the facility be made permanent?
Error: Type 1 error & type 2 error
Example:
Your friend Sam claims that he can shoot 40 or more baskets in an hour from the 3-point line in a
Basketball court. So, Sam is making a claim about a population parameter, in this case it is his true
shooting ability from the 3-point line in a Basketball court. This can be likened to the population mean
mu. Thus Sam is claiming that the population mean mu of his shooting ability is greater than or equal to
40 baskets in an hour from the 3-point line in a Basketball court.
Example: An empirical study using data on heights of people claimed that the average height of men
aged 18 years to 45 years across the world was 173 cm. This study included men not only from the
sports fraternity but across a wide spectrum of professions and walks of life. One could argue that men
Olympians are likely to be taller than this claimed average height of 173 cm.
Mu=173 cm
N=sample size
The difference in mean hypothesis testing:
This calculations can be done using excle data analysis tool
When it is needed to compare before and after result in a hypothesis testing paired two sample for
mean is to be used. In this case the difference is made for each pair .And then average is made.
Average is made base on two sample (example: average is made for man side and average is made for
female side ) then results are compaired.
T test:Paired two sample for mean and T test:Two sample assuming equal variance/ unequal variance
will give different cutoffs and different results.
In this case we will use t-test sample assuming equal variance. As there is no sense of paring between
the data.because the data represents age of 40 different men and 40 diffent women how could there be
a chance of paring.
In this case we will use t test paired two sampling because there is a sense of pairing between the
data .As both sells occures at the same month.
Linear regression:
Interpreting: FOR BETA 1
Firstly, the interpretation is in terms of x one increasing by one unit. Not one percentage, so if the unit
of x1 is kilogram, then one unit increase implies that x1 increases by one kilogram.If the unit 1,000
kilogram, then one unit increase implies x1 increased by 1,000 kilograms.Secondly, the interpretation
says that y increases by beta one units.So for example, if the y variable is measured in terms of million
dollars, then beta one units imply beta $1 million. Thirdly, the last part of the interpretation is
important, which says that all of the variables are kept at the same level. To interpret the impact of a
particular exponential variable on the y variable, it is important that all of the variables are held constant
at water level there.
All other variables remaining in the same level. Implying that if ad expenditure and promotional
expenditure are kept at the same level, they are not changed, and only the price is increased by $1, one
unit, then we would expect the unit sales to drop by 5,055.27 units.
FOR BETA 2:
For every one unit increase in ad expenditure, in this case, the unit of ad expenditure is $1,000, because
that is what ad expenditure is measured in our data in $1,000. So, the interpretation is, for every $1,000
increase in advertising expenditure, the unit sales increase by 648.61 units. All other variables remaining
at the same level.
FOR BETA 3:
The value of the coefficient is a positive 1802.65. Implying that for every one unit increase in
promotional expenditure, and once again, the promotional expenditure, the units of measurement are
$1,000. So it means, for every $1,000 increase in promotional expenditure, we would expect unit sales
to increase by 1802.61 units. All other variables remaining at the same level.
The interpretation of beta zero coefficient, and the estimate of the beta zero coefficient is the value of
my y variable, when all x variables to zero. So, in this case, it implies that the value of unit sales would be
a negative 25096.83, when all my x variables are zero. That is to say, when the price is zero, the ad
expenditure is zero, and the promotional expenditure is also zero. Now, this is the technical
interpretation of beta zero. Clearly, in this case, this technical interpretation does not have a managerial
relevance, why? Because, talking of a situation where you're selling a toy for free. You're pricing it at $0.
And then you're trying to see what would the unit sales be, does not make managerial sense.
Prediction:
Some cases it is seen that the residuals are negetive. Which means that the model is over predicting
R square value:
R-squared is equal to 0.61899 -- this implies, that this regression model is able to explain about 61.9
percentage of variation or changes in the unit sales of the toy. What happens to the remaining
radiation? It goes unexplained. You may notice from the earlier regression we carried out, that
increasing the number of X variables, increases R-squared. The R-squared was higher in the model when
we also included the advertising expenditure and promotion expenditures. Higher the value of R-
squared, that is, closer it is to one, implies that a greater proportion of variation in the Y variable, is
explained by the regression model. Or in other words, the model fits well to the data. Lower the value of
R-squared, that is closer it is to zero, implies that a lesser proportion of variation in the Y variable is
explained by the regression model. Or in other words, the model does not fit well to the data.
Unfortunately, there is no one value of R-squared, above which you can claim that you have a good
fitting model, and below which you can claim that you have a poor fitting model.
Why do we have errors in the regression model which then lead to these observe residuals?
There could be a multitude of reasons for this. The major reasons tend to be omitted variables and the
functional relationship between the Y and X variables. Omitted variables mean that your model may be
missing some important explanatory or X variables, which may be aggravating these errors. While the
functional relationship means that there may be some non-linearity in the relationship between the Y
variable and the set of X variables, implying that a straight line relationship may not be most
appropriate.
One important assumption about the error is that it has a a normal distribution with the mean equal to
0, and some constant standard deviation.Visually what this assumption means is that the vertical red
error bars shown in the scatterplot from the previous lesson tend to be approximately equally
distributed above and below the regression line. So that the average across the positive and negative
errors tend to be approximately 0. Further, the spread of these vertical error bars tend to be similar
across the entire regression line. Another way to think about this is that if you plotted a histogram of all
the error terms using your data you would tend to get a bell-shaped curve centered at 0.
The relationship between the betas and the b's is that bs are an estimate of the betas. Depending on the
sample use for estimation, the value of b's may change. For example, in a toy sales regression, had we
used 36 months of data rather than 24 months we may have gathered slightly different estimates of the
impact of price and other variables on sales. This indicates that the b's themselves can be considered as
random variables, and in turn have a distribution which is a normal distribution centered at the so-called
true value of the betas. For example, b0 follows normal distribution that is centered at the true value of
beta 0. b1 follows a normal distribution that is centered at the the true value of beta 1, and so on. The
relationship between the betas and the b's is analogous to the relationship between the population
mean and the sample mean that we studied in course two of the specialization. The population mean is
fixed but unknown. And the sample mean can be thought of as a random variable having a normal
distribution centered at the population mean.
b1- beta 1 divided by the standard error of b1 follows the t distribution with n-k-1 degrees of freedom,
and so on for the other estimated b's.
From this data set we see that P value for %_commercial is greater than 5% so we can reject the null
hypothesis which signifies that %_commercial has less significance on the data
See the value 5000 falls between the upper limit and lower limit so we cant reject the claim.
The p value of REGB is greater than alpha value which indicates that area B does not have any
significant differnce than area c
Mean Centering Variables in a Regression Model: When the intercept does not have any manegerial
significance then we have to usemean centering variables in the regression model. To do this we will
change the height column with a column by (height -average(all height)).and by running this we will get
a meaning full intercept.
The intercept indicates that when male =0 ( means female because it is a dummu variable) and the
height =0 (means average height) then the person has 69.62 kg weight and to get male weight we will
add 5.52 kg.
Interaction model:
Beta 2 must be for female because female means male=0 which will remove the interaction effect out of
equation.