Professional Documents
Culture Documents
Harvard eCourse
1 BASICS
Page | 1
1.2 Variability
1.2.1 The Standard Deviation The standard deviation is a common measure for describing how much variability there is in a set of data. The standard deviation tells us how far the data are spread out. A large standard deviation indicates that the data are widely dispersed. A smaller standard deviation tells us that the data points are more tightly clustered together. The standard deviation measures how much a data set varies from its mean. Std.Deviation = STDEV = First you calculate Variance= SUM ((xi mean)2/(n-1)) and then we it. 1.2.2 The Coefficient of Variance We can translate this concept of relative volatility into a standardized measure called the coefficient of variation, which is simply the ratio of the standard deviation to the mean. It can be interpreted as the standard deviation expressed as a percent of the mean. Coefficient of Variance = / The coefficient of variation describes the standard deviation as a fraction of the mean, giving you a standard measure of variability. We can use it to compare variation in different data sets of different scales or units.
Page | 2
Why do the outliers influence our measure of linearity so much? Because measures like correlation give more weight to points distant from the center of the data, outliers can strongly influence the correlation coefficient of the entire set. In these situations, our intuition and the measure we use to quantify our intuition can be quite different. We should always attempt to reconcile those differences by returning to the data.
CORREL(var1;var2)
Page | 3
2.1.1 Confidence intervals The sample mean is the best estimate of our population mean. However, it is only a point estimate. It does not give us a sense of how accurately the sample mean estimates the population mean. Creating a range around the sample mean is quite easy. First, we need to know three statistics of the sample: the mean xbar, the standard deviation s, and the sample size n.
we indicate our level of confidence by saying, for example, that we are "95% confident" that the range contains the true population mean. This means there is a 95% chance that the range contains the true population mean.
Page | 4
Why is normal distribution so important? First, the normal distribution's mean and median are equal. They are located exactly at the center of the distribution. Hence, the probability that a normal distribution will have a value less than the mean is 50%, and that the probability it will have a value greater than the mean is 50%. Second, the normal distribution has a unique symmetrical shape around this mean. How wide or narrow the curve is depends solely on the distribution's standard deviation. In fact, the location and width of any normal curve are completely determined by two variables: the mean and the standard deviation of the distribution. Large standard deviations make the curve very flat. Small standard deviations produce tight, tall curves with most of the values very close to the mean. Regardless of how wide or narrow the curve, it always retains its bell-shaped form. Because of this unique shape, we can create a few useful "rules of thumb" for the normal distribution. For a normal distribution, about 68% (roughly two-thirds) of the probability is contained in the range reaching one standard deviation away from the mean on either side. If we go two standard deviations away from the mean for a standard normal curve we'll cover about 95% of the probability. The amazing thing about normal distributions is that these rules of thumb hold for any normal distribution, no matter what its mean or standard deviation.
Page | 5
2.2.1
The Z-statistics
For example, for a z-value of 1, the probability of being within z standard deviations of the mean is about 68%, the probability of being between -1 and +1 on a standard normal curve. To know how far you must go from the mean to cover a certain area under the curve, you have to know the standard deviation of the
distribution. Using the z-statistic, we can then "standardize" the distribution, making it into a standard normal distribution with a mean of 0 and a standard deviation of 1. We are translating the real value in its original units inches in our example into a z-value. The z-statistic translates any value into its corresponding zvalue simply by subtracting the mean and dividing by the standard deviation.
Page | 6
Sometimes we may want to go in the other direction, starting with the probability and figuring out how many standard deviations are necessary on either side of the mean to capture that probability. For example, suppose we want to know how many standard deviations we need to be from the mean to capture 95% of the probability. Our second rule of thumb tells us that when we move two standard deviations from the mean, we capture about 95% of the probability. More precisely, to capture exactly 95% of the probability, we must be within 1.96 standard deviations of the mean. This means that for a normal distribution, there is a 95% probability of falling between -1.96 and 1.96 standard deviations from the mean. Probability Std 38 % 0.5 68 % 1 95 % 1,96 99 % 2,58
Sometimes, probabilities are shown in other forms. If we start at the very left side of the distribution, the area underneath the curve is called the cumulative probability. For example, the probability of being less than the mean is 0.5, or 50%. This is just one example of a cumulative probability.
A cumulative probability of 70% corresponds to a point that has 70% of the area under the curve to its left.
Page | 7
To find the cumulative probability associated with a given z-value for a standard normal curve, we use the Excel function NORMSDIST. Note the S between the M and the D. It indicates we are working with a 'standard' normal curve with mean zero and standard deviation one. For example, to find the cumulative probability for the z-value 1, we enter the Excel function =NORMSDIST(1). The value returned, 0.84, is the area under the standard normal curve to the left of 1. This tells us that the probability of obtaining a value less than 1 for a standard normal curve is about 84%. We shouldn't be surprised that the probability of being less than 1 is 84%. Why? First, we know that the normal curve is symmetric, so there is a 50% chance of being below the mean. Next, we know that about 68% of the probability for a standard normal curve is between -1 and +1. Since the normal curve is symmetric, half of that 68% or 34% of the probability must lie between 0 and 1.
=STANDARDIZE(x;mean;std) => =NORMSDIST(standardize no.) or = NORMDIST(x;mean;std;TRUE) True x values; False y values (not interested in)
Page | 8
Example
normal curve value avg std standardize Com. probability 24 26 8 -0,25 40%
Suppose we want to find the z-value associated with the cumulative probability 95%. To translate a cumulative probability back to a z-value on the standard normal curve, we use the Excel function NORMSINV. Note once again the S, which tells us we are working with a standard normal curve. =NORMSINV(0,95) = 1,64 std = 1,64 Example
com. Prob. z 0,95 1,64
If we want to translate a cumulative probability back to a value on a general normal curve, we use the NORMINV function. NORMINV requires three values: the cumulative probability, and the mean and standard deviation of the normal distribution in question. =NORMINV(com.prob;avg;std) Example
com. Prob. mean std value 0,95 26 8 39,2
Using z-table Let's use the z-table to find a cumulative probability. Ex. #1: Women's heights are distributed normally, with mean around 63.5 inches, and standard deviation 2.5 inches. What percentage of women are shorter than 65.6 inches? Page | 9
Z value =
The cumulative probability is 0.7995. About 80% of women are shorter than 65.6 inches. Ex. #2: For example, we might want to know what percentage of women are shorter than 61.6 inches. Z value = com.prob = 0,2236 =22,36% When a z-value is negative, we must first use the table to find the cumulative probability corresponding to the positive z-value, in case +0.76. Then, since the normal curve is symmetric, we will be to conclude that the probability of being less than the z-value is the same as the probability of being greater than the z-value +0.76. Since the probability of being less than a z-value of +0.76 is 0.7764, then the probability of being greater than a z-value of +0.76 is 1 - 0.7764 = 0.2236. Thus, we can conclude that the probability of being less than a z-value of -0.76 is also 0.2236. this able 0.76 z= 0,76 com.prob. = 0,7764 1-
How do the unique properties of the normal distribution help us when we use a random sample to infer something about the underlying population? It turns out that even if a population is not normally distributed, the properties of the normal distribution are very helpful to us in sampling. To see why, let's first learn about a well-established statistical fact known as the "Central Limit Theorem".
Furthermore, if we took enough of these samples, the mean of the resulting distribution of sample means would be equal to the true mean of the population. To repeat: no matter what type of distribution the population has uniform, skewed, bi-modal, or completely bizarre if we took enough samples, and the samples were sufficiently large, then the means of those samples would form a normal distribution centered around the true mean of the population.
Page | 10
The Central Limit Theorem tells us that the mean of that one sample is part of a normal distribution. More specifically, we know that the sample mean falls somewhere in a normal Distribution of Sample Means that is centered at the true population mean. The Central Limit Theorem is so powerful for sampling and estimation because it allows us to ignore the underlying distribution of the population we want to learn about. Since we know the Distribution of Sample Means is normally distributed and centered at the true population mean, we can completely disregard the underlying distribution of the population.
The distribution of the sample means will always form a normal distribution. This is what the Central Limit Theorem predicts. The Central Limit Theorem states that the means of sufficiently large samples are always normally distributed, a key insight that will allow you to estimate the population mean from a sample.
Page | 11
We won't prove this fact here, but simply note that it is true, and that it should confirm our general intuition about the Distribution of Sample Means. For example, if we have huge samples, we'd expect the means of those large samples to be tightly clustered around the true population mean, and thereby form a narrow distribution.
Page | 12
Page | 13
Page | 14
Page | 15
Excel provides a simple function for finding the appropriate t-value for a confidence interval. If we enter 1 minus the level of confidence we want and the degrees of freedom into the Excel function TINV, Excel gives us the appropriate t-value. =TINV(1-level of confidence;n-1) n = 16, lvl of con= 95% =TINV(1-0,95;16-1) = TINV(0,05;15) Example: demiurgos sample size sample mean sample SD conf. Int
conf. Int. = mena +- t value * (std/ sample size) 95% con. Int. 10200+2,131*(4800/4) 10200-2,131*(4800/4) [$7642,26, $12757,74] $ $ 12.757,74 7.642,26
What if the Demiurgos' manager thinks this interval is too large? She will have to survey more guests. Increasing the sample size causes the t-value to decrease, and also increases the size of the denominator (the square root of n). Both factors narrow the confidence interval.
Page | 16
To find the sample size necessary to give us a specified distance d from the mean, we must have an estimate of sigma, the standard deviation of spending. If we do not have an estimate based on past data or some other source, we might take a preliminary survey to obtain a rough estimate of sigma.
Page | 17
Sample size
Page | 18
Example:
First, you calculate a 95% confidence interval for the response rate. The 95% confidence interval for the proportion estimate is 0.0412 to 0.1588, or 4.12% and 15.88% (formula for confidence interval). Then after giving Leo's questions some thought, you recommend to him that he send the mailing to a specific number of guests. Based on the confidence interval for the proportion, the maximum percentage of people who are likely to respond to the discount offer (at the 95% confidence level) is 15.88%. So, if 15.88% of people were to respond for 200 rooms, how many people should Leo send out the survey to? Simply divide 200 by 0.1588 to get to the answer: Leo needs to send out the survey to at most 1,259 past customers.
Page | 19
3 HYPOTHESIS TESTING
As the example suggests, in a hypothesis test, we test the null hypothesis. Based on evidence we gather from a sample, there are only two possible conclusions we can draw from a hypothesis test: either we reject the null hypothesis or we do not reject it. "We cannot reject the null hypothesis if it is reasonably likely that our sample mean would come from a population with the mean stated by the null hypothesis. The null hypothesis may or may not be true: we simply don't have enough evidence to draw a definite conclusion." Note that having the sample's average defect rate very close to 3 does not "prove" that the mean is 3. Thus we never say that we "accept" the null hypothesis we simply don't reject it. It is because we can never "accept" the null hypothesis that we do not pose the claim that we actually want to substantiate as the null hypothesis such a test would never allow us to "accept" our claim! The only way we can substantiate our claim is to state it as the opposite of the null hypothesis, and then reject the null hypothesis based on the evidence. In a hypothesis test, if our evidence is not strong enough to reject the null hypothesis, then that does not prove that the null hypothesis is true. We simply have failed to show it is false, and thus cannot reject it.
Page | 20
Page | 21
3.1.1 ONE-SIDED TEST Sometimes, we may want to know if the actual population mean differs from our initial value of the population mean in a specific direction. In these cases, our alternative hypothesis should clearly state which direction of change we want to test for. These kinds of tests are called one-sided hypothesis tests.
Example:
To find this range, all he needs to do is calculate its upper bound. For what value would 95% of all Page | 22
sample means be less than that value? To find out, we use what we know about the cumulative probability under the normal curve: a cumulative probability of 95% corresponds to a z-value of 1.645. Why is this different from the zvalue for a two-sided test with a 95% confidence level? For a two-sided test, the z-value corresponds to a 97.5% cumulative probability, since 2.5% of the probability is excluded from each tail. For a onesided test, the z-value corresponds to a 95% cumulative probability, since 5% of the probability is excluded from the upper tail.
Page | 23
3.2 SINGLE POPULATION PROPORTIONS -> HYPOTHESIS TEST FOR SINGLE POPULATION PROPORTION
3.3 P-VALUES
The "p-value" measures this likelihood: it tells us how likely it is to collect a sample mean that falls at least a certain distance from the null hypothesis mean. In the familiar hypothesis testing procedure, if the p-value is less than our threshold of 5%, we reject our null hypothesis. The p-value does more than simply answer the question of whether or not we can reject the hypothesis. It also indicates the strength of the evidence for rejecting the null hypothesis. For example, if the p-value is 0.049, we barely have enough evidence to reject the null hypothesis at the 0.05 level of significance; if it is 0.001, we have strong evidence for rejecting the hypothesis. Excel: NORMSDIST(z value) Example: the p-value 0.0027 is much smaller than 0.05. Thus, we can reject the null hypothesis at 0.0027, a much lower significance level. In other words, we can reject the null hypothesis with 99.73% confidence. In general, the lower the p-value, the higher our confidence in rejecting the null hypothesis.
Page | 24
Page | 25
4 REGRESSION
Regression is a statistical tool that goes even further: it can help us understand and characterize the specific structure of the relationship between two variables.
What kinds of questions can regression analysis help answer? How does regression help us as managers? In can help in two ways: first, it helps us forecast. For example, we can make predictions about future values of sales based on possible future values of advertising. Second, it helps us deepen our understanding of the structure of the relationship between two variables by expressing the relationship mathematically. With regression, we can forecast sales for any advertising level within the range of advertising levels we've seen historically. We must be extremely cautious about forecasting sales for values of advertising beyond the range of values we have already observed. The further we are from the historical values of advertising, the more we should question the reliability of our forecast. Another critical caveat to keep in mind is that whenever we use historical data to predict future values, we are assuming that the past is a reasonable predictor of the future. Thus, we should only use regression to predict the future if the general circumstances that held in the past, such as competition, industry dynamics, and economic environment, are expected to hold in the future. Regression can be used to deepen our understanding of the structural relationship between two variables. Linear correlation: a=b+c*d Page | 26
The more important term is the advertising coefficient, c, which gives us the slope of the line. The advertising coefficient tells us how sales have changed on average as advertising has increased.
The complete mathematical description of the relationship between the dependent and independent variables is y = a + bx + error. The y-value of any data point is exactly defined by these terms: the value y-hat given by the regression line plus the error, y - (y-hat).
Page | 27
This measure, called the Sum of Squared Errors (SSE), or the Residual Sum of Squares, gives us a good measure of how accurately a line describes a set of data. The less well the line fits the data, the larger the errors, and the higher the Sum of Squared Errors. To find the line that best fits a data set, we first need a measure of the accuracy of a line's fit: the Sum of Squared Errors. To find the Sum of Squared Errors, we calculate the vertical distances from the data points to the line, square the distances, and sum the squares. 4.1.1 IDENTIFYING THE REGRESSION LINE We can calculate the Sum of Squared Errors for any line that passes through the data. Of course, different lines will give us different Sums of Squared Errors. The line we are looking for the regression line is the one with the smallest Sum of Squared Errors. The line that most accurately fits the data the regression line is the line for which the Sum of Squared Errors is minimized.
Page | 28
4.2.1.1 RESIDUAL ANALYSIS Although the regression line is the line that best fits the observed data, the data points typically do not fall precisely on the line. Collectively, the vertical distances from the data to the line the errors measure how well the line fits the data. These errors are also known as residuals.
First, we measure the residuals: the distance from the data points to the regression line. Then we plot the residuals against the values of the independent variable. This graph called a residual plot helps us identify patterns in the residuals. A residual plot often is better than the original scatter plot for recognizing patterns because it isolates the errors from the general trend in the data. Residual plots are critical for studying error patterns in more advanced regressions with multiple independent variables. If we see a pattern in the distribution of the residuals, then we can infer that there is more to the behavior of the
Page | 29
dependent variable than what is explained by our linear regression. Other factors may be influencing the dependent variable, or the assumption that the relationship is linear may be unwarranted. The residuals appear to be getting larger for higher values of the independent variable. This phenomenon is known as heteroskedasticity.
4.2.2 THE SIGNIFICANCE OF REGRESSION COEFFICIENTS In fact, the regression line is almost never a perfect descriptor of the true linear relationship between the variables. Why? Because the data we use to find the regression line typically represent only a sample from the entire population of data pertaining to the relationship. Since each regression line comes from a limited set of data, it gives us only an approximation of the "true" linear relationship between the variables.
When we calculate the coefficients of a regression line from a set of observed data, the value a is an estimate of alpha, the intercept of the "true" line. Similarly, the value b is an estimate of beta, the slope of the true line. Since we don't know the exact value of the slope of the true advertising line, we might well question whether there actually IS a linear relationship between advertising and sales. How can we assure ourselves that the pattern we see in the sample data is not simply due to chance?
Page | 30
If there truly were no linear relationship, then in the full population of relevant data, changes in advertising would not correspond systematically to proportional changes in sales. In that case, the slope of the bestfitting line for the true relationship would be zero. There are two quick ways to test if the slope of the true line might be zero. One is simply to look at the confidence interval for the slope coefficient and see if it includes zero. In general, if the p-value for a slope coefficient is less than 0.05, then we can reject the null hypothesis that the slope beta is zero, and conclude with 95% confidence that there is a linear relationship between the two variables. Moreover, the smaller the p-value, the more confident we are that a linear relationship exists. If the p-value for a slope coefficient is greater than 0.05, then we do not have enough evidence to conclude with 95% confidence that there is a significant linear relationship between the variables.
4.3 SUMMARY
Page | 31
So far, we haven't discussed how sample size affects the accuracy of a regression analysis. The larger the sample we use to conduct the regression analysis, the more precise the information we obtain about the true nature of the relationship under investigation. Specifically, the larger the sample, the better our estimates for the slope and the intercept, and the tighter the confidence intervals around those estimates.
Page | 32
5 MULTIPLE REGRESSION
In multiple regression, we adapt what we know about regression with one independent variable often called simple regression to situations in which we take into account the influence of several variables. Graphing data on more than two variables poses its own set of difficulties. Three variables can still be represented, but beyond that, visualization and graphical representation become essentially impossible.
The coefficients in the simple regression and the coefficients in the multiple regression have very different meanings. In the simple regression equation of price versus distance, we interpret the coefficient, -39,505, in the following way: for every additional mile farther from downtown, we expect house price to decrease by an average of $39,505. We describe this average decrease of $39,505 as a gross effect - it is an average computed over the range of variation of all other factors that influence price. In the multiple regression of price versus size and distance, the value of the distance coefficient, -55,006, is different, because it has a different meaning. Here, the coefficient tells us that, for every additional mile, we should expect the price to decrease by $55,006, provided the size of the house stays the same. In other words, among houses that are similarly sized, we expect prices to decrease by $55,006 per mile of distance to downtown. We refer to Page | 33
this decrease as the net effect of distance on price. Alternatively, we refer to it as "the effect of distance on price controlling for house size".
Page | 34
5.1.2
RESIDUAL ANALYSIS
5.1.3
Page | 35
5.2.2 LAGGED VARIABLES price, house size, lot size, and distance from the city center. These are cross-sectional data: we looked at a cross-section of the Silverhaven real estate market at a specific point in time. A time series is a set of data collected over a range of time: each data point pertains to a specific time period. We incorporate the delayed effect of an independent variable on s dependent variable using a lagged variable. Adding a lagged variable is costly in two ways. The loss of a data point decreases our sample size, which reduces the precision of our estimates of the regression coefficients. At the same time, because we are adding another variable, we decrease adjusted R-squared. Thus, we include a lagged variable only if we believe the benefits of adding it outweigh the loss of an observation and the "penalty" imposed by the adjustment to R-squared. Despite these costs, lagged variables can be very useful. Since they pertain to previous time periods, they are usually available ahead of time. Lagged variables are often good "leading indicators" that help us predict future values of a dependent variable.
Page | 36
5.2.3 DUMMY VARIABLES But many variables we study are qualitative or categorical: they do not naturally take on numerical values, but can be classified into categories.
6 DECISION ANALYSING
Page | 37
Page | 38
Costs that were incurred or committed to in the past, before a decision is made, contribute to the total costs of all scenarios that could possibly unfold after the decision is made. As such these costs called sunk costs should not have any bearing on the decision, because we cannot devise a scenario in which they are avoided.
Page | 39
Page | 40
6.2.2
CONDITIONAL PROBABILITIES
6.2.3
STATISTICAL INDEPENDENCE
Page | 41
Page | 42
Page | 43
Page | 44