You are on page 1of 44

SUMMARY OF E-COURSE: Quantitative Methods

Harvard eCourse
1 BASICS

1.1 Central Values for Data


1.1.1 The mean Average = mean = (excel: =AVERAGE(value1;value2;;valuen) far the most common measure used to describe the "center" or "central tendency" of a data set. if the distribution has a tail that extends out to one side a skewed distribution the values on that side will pull the mean towards them. 1.1.2 The median Excel: =MEDIAN(value1:valuen) In cases like income, where the data are typically very skewed, the mean often isn't the best value to represent the data. In these cases, we can use another central value called the median. The median is the middle value of a data set whose values are arranged in numerical order. Half the values are higher than the median, and half are lower. With an odd number of data points, listed in order, the median is simply the middle value. In a data set with an even number of points, we average the two middle values. The mean weighs the value of every data point, but is sometimes biased by outliers or by a highly skewed distribution. By contrast, the median is not biased by outliers and is often a better value to represent skewed data. 1.1.3 The mode Excel: =MODE(val1:valn) to represent the "center" of a data set is its mode: the data set's most frequently occurring value. We might use the mode to represent data when knowing the average value isn't as important as knowing the most common value. In some cases, data may cluster around two or more points that occur especially frequently, giving the histogram more than one peak. A distribution that has two peaks is called a bimodal distribution.

Page | 1

SUMMARY OF E-COURSE: Quantitative Methods

1.2 Variability
1.2.1 The Standard Deviation The standard deviation is a common measure for describing how much variability there is in a set of data. The standard deviation tells us how far the data are spread out. A large standard deviation indicates that the data are widely dispersed. A smaller standard deviation tells us that the data points are more tightly clustered together. The standard deviation measures how much a data set varies from its mean. Std.Deviation = STDEV = First you calculate Variance= SUM ((xi mean)2/(n-1)) and then we it. 1.2.2 The Coefficient of Variance We can translate this concept of relative volatility into a standardized measure called the coefficient of variation, which is simply the ratio of the standard deviation to the mean. It can be interpreted as the standard deviation expressed as a percent of the mean. Coefficient of Variance = / The coefficient of variation describes the standard deviation as a fraction of the mean, giving you a standard measure of variability. We can use it to compare variation in different data sets of different scales or units.

1.3 Relationships between variables


1.3.1 Two variables This type of graph is called a "scatter diagram." Scatter diagrams provide a visual summary of the relationship between two variables. They are extremely helpful in recognizing patterns in a relationship. The more data points we have, the more apparent the relationship becomes. 1.3.1.1 Time as second variable Time series will help us recognize seasonal patterns and yearly trends. But we must be careful: we shouldn't rely only on visual analysis when looking for relationships and patterns. 1.3.1.2 False relationship Our intuition tells us that pairs of variables with a strong relationship on a scatter plot must be related to each other. But we must be careful: human intuition isn't foolproof and often we infer relationships where there are none. Unless we have a reasonable theory about the connection between the two variables, the relationship is no more than an interesting coincidence. 1.3.1.3 Hidden variables Even when two data sets seem to be directly related, we may need to investigate further to understand the reason for the relationship. We may find that the reason is not due to any fundamental connection between the two variables themselves, but that they are instead mutually related to another underlying factor. In many business contexts, hidden variables can complicate the investigation of a relationship between almost any two variables. Keep in mind that scatter plots don't prove anything about causality. They never prove that one variable causes the other, but simply illustrate how the data behave. 1.3.2 Correlation Humans have an uncanny ability to discern patterns in visual displays of data. We "know" when the relationship between two variables looks strong, weak, linear, nonlinear. The correlation coefficient is such a measure: it quantifies the extent to which there is a linear relationship between two variables. To describe the strength of a linear relationship, the correlation coefficient takes on values between -1 and +1. Even when the correlation coefficient is 0, a relationship might exist just not a linear relationship. As we've seen, scatter plots can reveal patterns and help us better understand the business context the data describe.

Page | 2

SUMMARY OF E-COURSE: Quantitative Methods

Why do the outliers influence our measure of linearity so much? Because measures like correlation give more weight to points distant from the center of the data, outliers can strongly influence the correlation coefficient of the entire set. In these situations, our intuition and the measure we use to quantify our intuition can be quite different. We should always attempt to reconcile those differences by returning to the data.

CORREL(var1;var2)

2 SAMPLING & ESTIMATION


2.1 GENERATING RANDOM SAMPLES
Use representative sample! This means that every person or thing in the population is equally likely to be selected. If there are 15,000 people in the industry, and we are choosing a sample of 1,000, then every person needs to have the same chance 1 out of 15 of being selected. Once we have decided how to select a sample, we have to ask how large our sample needs to be. Sometimes, a sample size of 100 or even 50 might be enough when we are not that concerned about the accuracy of our estimate. Other times, we might need to sample thousands to obtain the accuracy we require. It's important to understand that the sample size depends on the level of accuracy we require, not on the size of the population.

Page | 3

SUMMARY OF E-COURSE: Quantitative Methods

2.1.1 Confidence intervals The sample mean is the best estimate of our population mean. However, it is only a point estimate. It does not give us a sense of how accurately the sample mean estimates the population mean. Creating a range around the sample mean is quite easy. First, we need to know three statistics of the sample: the mean xbar, the standard deviation s, and the sample size n.

we indicate our level of confidence by saying, for example, that we are "95% confident" that the range contains the true population mean. This means there is a 95% chance that the range contains the true population mean.

Page | 4

SUMMARY OF E-COURSE: Quantitative Methods

2.2 NORMAL DISTRIBUTION


The normal distribution is a probability distribution that is centered at the mean. It is shaped like a bell, and is sometimes called the "bell curve." normal distribution is shown on two axes: the x-axis for the variable we're studying women's heights, for example and the y-axis for the likelihood that different values of the variable will occur. As it turns out, for a probability distribution like the normal distribution, the percent of all values falling into a specific range is equal to the area under the curve over that range.

Why is normal distribution so important? First, the normal distribution's mean and median are equal. They are located exactly at the center of the distribution. Hence, the probability that a normal distribution will have a value less than the mean is 50%, and that the probability it will have a value greater than the mean is 50%. Second, the normal distribution has a unique symmetrical shape around this mean. How wide or narrow the curve is depends solely on the distribution's standard deviation. In fact, the location and width of any normal curve are completely determined by two variables: the mean and the standard deviation of the distribution. Large standard deviations make the curve very flat. Small standard deviations produce tight, tall curves with most of the values very close to the mean. Regardless of how wide or narrow the curve, it always retains its bell-shaped form. Because of this unique shape, we can create a few useful "rules of thumb" for the normal distribution. For a normal distribution, about 68% (roughly two-thirds) of the probability is contained in the range reaching one standard deviation away from the mean on either side. If we go two standard deviations away from the mean for a standard normal curve we'll cover about 95% of the probability. The amazing thing about normal distributions is that these rules of thumb hold for any normal distribution, no matter what its mean or standard deviation.

Page | 5

SUMMARY OF E-COURSE: Quantitative Methods

2.2.1

The Z-statistics

For example, for a z-value of 1, the probability of being within z standard deviations of the mean is about 68%, the probability of being between -1 and +1 on a standard normal curve. To know how far you must go from the mean to cover a certain area under the curve, you have to know the standard deviation of the

distribution. Using the z-statistic, we can then "standardize" the distribution, making it into a standard normal distribution with a mean of 0 and a standard deviation of 1. We are translating the real value in its original units inches in our example into a z-value. The z-statistic translates any value into its corresponding zvalue simply by subtracting the mean and dividing by the standard deviation.

Page | 6

SUMMARY OF E-COURSE: Quantitative Methods

Sometimes we may want to go in the other direction, starting with the probability and figuring out how many standard deviations are necessary on either side of the mean to capture that probability. For example, suppose we want to know how many standard deviations we need to be from the mean to capture 95% of the probability. Our second rule of thumb tells us that when we move two standard deviations from the mean, we capture about 95% of the probability. More precisely, to capture exactly 95% of the probability, we must be within 1.96 standard deviations of the mean. This means that for a normal distribution, there is a 95% probability of falling between -1.96 and 1.96 standard deviations from the mean. Probability Std 38 % 0.5 68 % 1 95 % 1,96 99 % 2,58

Sometimes, probabilities are shown in other forms. If we start at the very left side of the distribution, the area underneath the curve is called the cumulative probability. For example, the probability of being less than the mean is 0.5, or 50%. This is just one example of a cumulative probability.

A cumulative probability of 70% corresponds to a point that has 70% of the area under the curve to its left.

Page | 7

SUMMARY OF E-COURSE: Quantitative Methods

To find the cumulative probability associated with a given z-value for a standard normal curve, we use the Excel function NORMSDIST. Note the S between the M and the D. It indicates we are working with a 'standard' normal curve with mean zero and standard deviation one. For example, to find the cumulative probability for the z-value 1, we enter the Excel function =NORMSDIST(1). The value returned, 0.84, is the area under the standard normal curve to the left of 1. This tells us that the probability of obtaining a value less than 1 for a standard normal curve is about 84%. We shouldn't be surprised that the probability of being less than 1 is 84%. Why? First, we know that the normal curve is symmetric, so there is a 50% chance of being below the mean. Next, we know that about 68% of the probability for a standard normal curve is between -1 and +1. Since the normal curve is symmetric, half of that 68% or 34% of the probability must lie between 0 and 1.

=STANDARDIZE(x;mean;std) => =NORMSDIST(standardize no.) or = NORMDIST(x;mean;std;TRUE) True x values; False y values (not interested in)

Page | 8

SUMMARY OF E-COURSE: Quantitative Methods

Example
normal curve value avg std standardize Com. probability 24 26 8 -0,25 40%

Suppose we want to find the z-value associated with the cumulative probability 95%. To translate a cumulative probability back to a z-value on the standard normal curve, we use the Excel function NORMSINV. Note once again the S, which tells us we are working with a standard normal curve. =NORMSINV(0,95) = 1,64 std = 1,64 Example
com. Prob. z 0,95 1,64

If we want to translate a cumulative probability back to a value on a general normal curve, we use the NORMINV function. NORMINV requires three values: the cumulative probability, and the mean and standard deviation of the normal distribution in question. =NORMINV(com.prob;avg;std) Example
com. Prob. mean std value 0,95 26 8 39,2

Using z-table Let's use the z-table to find a cumulative probability. Ex. #1: Women's heights are distributed normally, with mean around 63.5 inches, and standard deviation 2.5 inches. What percentage of women are shorter than 65.6 inches? Page | 9

SUMMARY OF E-COURSE: Quantitative Methods

Z value =

com. Prob. = 0,7995

The cumulative probability is 0.7995. About 80% of women are shorter than 65.6 inches. Ex. #2: For example, we might want to know what percentage of women are shorter than 61.6 inches. Z value = com.prob = 0,2236 =22,36% When a z-value is negative, we must first use the table to find the cumulative probability corresponding to the positive z-value, in case +0.76. Then, since the normal curve is symmetric, we will be to conclude that the probability of being less than the z-value is the same as the probability of being greater than the z-value +0.76. Since the probability of being less than a z-value of +0.76 is 0.7764, then the probability of being greater than a z-value of +0.76 is 1 - 0.7764 = 0.2236. Thus, we can conclude that the probability of being less than a z-value of -0.76 is also 0.2236. this able 0.76 z= 0,76 com.prob. = 0,7764 1-

How do the unique properties of the normal distribution help us when we use a random sample to infer something about the underlying population? It turns out that even if a population is not normally distributed, the properties of the normal distribution are very helpful to us in sampling. To see why, let's first learn about a well-established statistical fact known as the "Central Limit Theorem".

Furthermore, if we took enough of these samples, the mean of the resulting distribution of sample means would be equal to the true mean of the population. To repeat: no matter what type of distribution the population has uniform, skewed, bi-modal, or completely bizarre if we took enough samples, and the samples were sufficiently large, then the means of those samples would form a normal distribution centered around the true mean of the population.

Page | 10

SUMMARY OF E-COURSE: Quantitative Methods

The Central Limit Theorem tells us that the mean of that one sample is part of a normal distribution. More specifically, we know that the sample mean falls somewhere in a normal Distribution of Sample Means that is centered at the true population mean. The Central Limit Theorem is so powerful for sampling and estimation because it allows us to ignore the underlying distribution of the population we want to learn about. Since we know the Distribution of Sample Means is normally distributed and centered at the true population mean, we can completely disregard the underlying distribution of the population.

The distribution of the sample means will always form a normal distribution. This is what the Central Limit Theorem predicts. The Central Limit Theorem states that the means of sufficiently large samples are always normally distributed, a key insight that will allow you to estimate the population mean from a sample.

Page | 11

SUMMARY OF E-COURSE: Quantitative Methods

We won't prove this fact here, but simply note that it is true, and that it should confirm our general intuition about the Distribution of Sample Means. For example, if we have huge samples, we'd expect the means of those large samples to be tightly clustered around the true population mean, and thereby form a narrow distribution.

Page | 12

SUMMARY OF E-COURSE: Quantitative Methods

Sample size: n>= 30!!!! Example

Page | 13

SUMMARY OF E-COURSE: Quantitative Methods

Samples smaller then 30:

Page | 14

SUMMARY OF E-COURSE: Quantitative Methods

Page | 15

SUMMARY OF E-COURSE: Quantitative Methods

Excel provides a simple function for finding the appropriate t-value for a confidence interval. If we enter 1 minus the level of confidence we want and the degrees of freedom into the Excel function TINV, Excel gives us the appropriate t-value. =TINV(1-level of confidence;n-1) n = 16, lvl of con= 95% =TINV(1-0,95;16-1) = TINV(0,05;15) Example: demiurgos sample size sample mean sample SD conf. Int

16 $ 10.200,00 $ 4.800,00 95%

lvl od freedom 15 t value 2,131

conf. Int. = mena +- t value * (std/ sample size) 95% con. Int. 10200+2,131*(4800/4) 10200-2,131*(4800/4) [$7642,26, $12757,74] $ $ 12.757,74 7.642,26

95% con. Int.

What if the Demiurgos' manager thinks this interval is too large? She will have to survey more guests. Increasing the sample size causes the t-value to decrease, and also increases the size of the denominator (the square root of n). Both factors narrow the confidence interval.

Page | 16

SUMMARY OF E-COURSE: Quantitative Methods

To find the sample size necessary to give us a specified distance d from the mean, we must have an estimate of sigma, the standard deviation of spending. If we do not have an estimate based on past data or some other source, we might take a preliminary survey to obtain a rough estimate of sigma.

Page | 17

SUMMARY OF E-COURSE: Quantitative Methods

Step-by-step guide for confidence intervals

Confidence intervals and proportions

Sample size

Page | 18

SUMMARY OF E-COURSE: Quantitative Methods

Example:

First, you calculate a 95% confidence interval for the response rate. The 95% confidence interval for the proportion estimate is 0.0412 to 0.1588, or 4.12% and 15.88% (formula for confidence interval). Then after giving Leo's questions some thought, you recommend to him that he send the mailing to a specific number of guests. Based on the confidence interval for the proportion, the maximum percentage of people who are likely to respond to the discount offer (at the 95% confidence level) is 15.88%. So, if 15.88% of people were to respond for 200 rooms, how many people should Leo send out the survey to? Simply divide 200 by 0.1588 to get to the answer: Leo needs to send out the survey to at most 1,259 past customers.

Page | 19

SUMMARY OF E-COURSE: Quantitative Methods

3 HYPOTHESIS TESTING

As the example suggests, in a hypothesis test, we test the null hypothesis. Based on evidence we gather from a sample, there are only two possible conclusions we can draw from a hypothesis test: either we reject the null hypothesis or we do not reject it. "We cannot reject the null hypothesis if it is reasonably likely that our sample mean would come from a population with the mean stated by the null hypothesis. The null hypothesis may or may not be true: we simply don't have enough evidence to draw a definite conclusion." Note that having the sample's average defect rate very close to 3 does not "prove" that the mean is 3. Thus we never say that we "accept" the null hypothesis we simply don't reject it. It is because we can never "accept" the null hypothesis that we do not pose the claim that we actually want to substantiate as the null hypothesis such a test would never allow us to "accept" our claim! The only way we can substantiate our claim is to state it as the opposite of the null hypothesis, and then reject the null hypothesis based on the evidence. In a hypothesis test, if our evidence is not strong enough to reject the null hypothesis, then that does not prove that the null hypothesis is true. We simply have failed to show it is false, and thus cannot reject it.

Page | 20

SUMMARY OF E-COURSE: Quantitative Methods

3.1 TEST FOR SIMPLE POPULATION MEANS


A hypothesis test with a 95% confidence level is said to have a 5% level of significance. A 5% significance level says that there is a 5% chance of a sample mean falling in the rejection region when the null hypothesis is true. This is what people mean when they say that something is "statistically significant at a 5% significance level. If we increase our confidence level, we widen the range around the null hypothesis mean. At a 99% confidence level, our range captures 99% of all sample means. This reduces to 1% our chance of rejecting the null hypothesis erroneously. But doing this has a downside: by decreasing the chance of one type of error, we increase the chance of the other type. The higher the confidence level the smaller the rejection region, and the less likely it is that we can reject the null hypothesis when it is in fact false. This decreases our chance of being able to substantiate the alternative hypothesis when it is true. As managers, we need to choose the confidence level of our test based on the relative costs of making each type of error. The range of likely sample means should not be confused with a confidence interval. Confidence intervals are always constructed around sample means, never around population means. When we construct a confidence interval, we don't even have an initial estimate of the population mean. Constructing a confidence interval is a process for estimating the population mean, not for testing particular claims about that mean.

Page | 21

SUMMARY OF E-COURSE: Quantitative Methods

3.1.1 ONE-SIDED TEST Sometimes, we may want to know if the actual population mean differs from our initial value of the population mean in a specific direction. In these cases, our alternative hypothesis should clearly state which direction of change we want to test for. These kinds of tests are called one-sided hypothesis tests.

Example:

To find this range, all he needs to do is calculate its upper bound. For what value would 95% of all Page | 22

SUMMARY OF E-COURSE: Quantitative Methods

sample means be less than that value? To find out, we use what we know about the cumulative probability under the normal curve: a cumulative probability of 95% corresponds to a z-value of 1.645. Why is this different from the zvalue for a two-sided test with a 95% confidence level? For a two-sided test, the z-value corresponds to a 97.5% cumulative probability, since 2.5% of the probability is excluded from each tail. For a onesided test, the z-value corresponds to a 95% cumulative probability, since 5% of the probability is excluded from the upper tail.

Page | 23

SUMMARY OF E-COURSE: Quantitative Methods

3.2 SINGLE POPULATION PROPORTIONS -> HYPOTHESIS TEST FOR SINGLE POPULATION PROPORTION

3.3 P-VALUES

The "p-value" measures this likelihood: it tells us how likely it is to collect a sample mean that falls at least a certain distance from the null hypothesis mean. In the familiar hypothesis testing procedure, if the p-value is less than our threshold of 5%, we reject our null hypothesis. The p-value does more than simply answer the question of whether or not we can reject the hypothesis. It also indicates the strength of the evidence for rejecting the null hypothesis. For example, if the p-value is 0.049, we barely have enough evidence to reject the null hypothesis at the 0.05 level of significance; if it is 0.001, we have strong evidence for rejecting the hypothesis. Excel: NORMSDIST(z value) Example: the p-value 0.0027 is much smaller than 0.05. Thus, we can reject the null hypothesis at 0.0027, a much lower significance level. In other words, we can reject the null hypothesis with 99.73% confidence. In general, the lower the p-value, the higher our confidence in rejecting the null hypothesis.

Page | 24

SUMMARY OF E-COURSE: Quantitative Methods

3.4 HYPOTHESIS TEST TO COMPARE TWO POPULATION MEANS


We conduct two-population tests to compare a characteristic of two groups for which we have access to sample data for each group. For example, we'd use a two-population test to study which of two educational software packages better prepares students for the GMAT. Do the students using package 1 perform better on the GMAT than the students using package 2? In two-population tests, we take two samples, one from each population. For each sample, we calculate the sample mean, standard deviation, and sample size. We can then use the two sets of sample data to test claims about differences between the two populations. For example, when we want to know whether two populations have different means, we formulate a null hypothesis stating that the means are not different: the first population mean is equal to the second.

Page | 25

SUMMARY OF E-COURSE: Quantitative Methods

3.5 HYPOTHESIS TEST FOR 2 POPULATION PROPORTIONS

4 REGRESSION
Regression is a statistical tool that goes even further: it can help us understand and characterize the specific structure of the relationship between two variables.

What kinds of questions can regression analysis help answer? How does regression help us as managers? In can help in two ways: first, it helps us forecast. For example, we can make predictions about future values of sales based on possible future values of advertising. Second, it helps us deepen our understanding of the structure of the relationship between two variables by expressing the relationship mathematically. With regression, we can forecast sales for any advertising level within the range of advertising levels we've seen historically. We must be extremely cautious about forecasting sales for values of advertising beyond the range of values we have already observed. The further we are from the historical values of advertising, the more we should question the reliability of our forecast. Another critical caveat to keep in mind is that whenever we use historical data to predict future values, we are assuming that the past is a reasonable predictor of the future. Thus, we should only use regression to predict the future if the general circumstances that held in the past, such as competition, industry dynamics, and economic environment, are expected to hold in the future. Regression can be used to deepen our understanding of the structural relationship between two variables. Linear correlation: a=b+c*d Page | 26

SUMMARY OF E-COURSE: Quantitative Methods

The more important term is the advertising coefficient, c, which gives us the slope of the line. The advertising coefficient tells us how sales have changed on average as advertising has increased.

4.1 CALCULATING THE REGRESSION LINE


A regression line helps you understand the relationship between two variables and forecast future values of the dependent variable. The regression line depicts the best linear relationship between the two variables. We attribute the difference between the actual data points and the line to the influence that other variables have on sales, or to chance alone. Since the regression line does not pass through every point, the line does not fit the data perfectly. How accurately does the regression line represent the data? To quantify how accurately a line fits a data set, we measure the vertical distance between each data point and the line. From now on we will refer to this vertical distance between a data point and the line as the error in prediction or the residual error, or simply the error. The error is the difference between the observed value and the line's prediction for our dependent variable. This difference may be due to the influence of other variables or to plain chance.

The complete mathematical description of the relationship between the dependent and independent variables is y = a + bx + error. The y-value of any data point is exactly defined by these terms: the value y-hat given by the regression line plus the error, y - (y-hat).
Page | 27

SUMMARY OF E-COURSE: Quantitative Methods

This measure, called the Sum of Squared Errors (SSE), or the Residual Sum of Squares, gives us a good measure of how accurately a line describes a set of data. The less well the line fits the data, the larger the errors, and the higher the Sum of Squared Errors. To find the line that best fits a data set, we first need a measure of the accuracy of a line's fit: the Sum of Squared Errors. To find the Sum of Squared Errors, we calculate the vertical distances from the data points to the line, square the distances, and sum the squares. 4.1.1 IDENTIFYING THE REGRESSION LINE We can calculate the Sum of Squared Errors for any line that passes through the data. Of course, different lines will give us different Sums of Squared Errors. The line we are looking for the regression line is the one with the smallest Sum of Squared Errors. The line that most accurately fits the data the regression line is the line for which the Sum of Squared Errors is minimized.

4.2 DEEPER INTO REGRESION


4.2.1 QUANTIFYING THE PREDICTIVE POWER OF REGRESSION Example: How much does the relationship between advertising and sales help us understand and predict sales? We'd like to be able to quantify the predictive power of the relationship in determining sales levels. How much more do we know about sales thanks to the advertising data? To answer this question we need a benchmark telling us how much we know about the behavior of sales without the advertising data. Only then does it make sense to ask how much more information the advertising data give us. Without the advertising data, we have the sales data alone to work with. Using no information other than the sales data, the best predictor for future sales is simply the mean of previous sales. Thus, we use mean sales as our benchmark, and draw a "mean sales line" through the data. Let's compare the accuracy of the regression line and the mean sales line. We already have a measure of how accurately an individual line fits a set of data: the Sum of Squared Errors about the line. Now we want a measure of how much more accurate the regression line is than the mean line. To obtain such a measure, we'll calculate the Sum of Squared Errors for each of the two lines, and see how much smaller the error is around the regression line than around the mean line. The Sum of Squared Errors for the mean sales line measures the total variation in the sales data. In fact, it is the same measure of variation we use to derive the standard deviation of sales. We call the Sum of Squared Errors for our benchmark the mean sales line the Total Sum of Squares. The Sum of Squared Errors for the regression line is often called the Residual Sum of Squared Errors, or the Residual Sum of Squares. The Residual Sum of Squares is the variation left "unexplained" by the regression. A standardized measure of the regression line's explanatory power is called R-squared. R-squared is the fraction of the total variation in the dependent variable that is explained by the regression line. R-squared will always be between 0 and 1 at worst, the regression line explains none of the variation in sales; at best it explains all of it.

Page | 28

SUMMARY OF E-COURSE: Quantitative Methods

4.2.1.1 RESIDUAL ANALYSIS Although the regression line is the line that best fits the observed data, the data points typically do not fall precisely on the line. Collectively, the vertical distances from the data to the line the errors measure how well the line fits the data. These errors are also known as residuals.

First, we measure the residuals: the distance from the data points to the regression line. Then we plot the residuals against the values of the independent variable. This graph called a residual plot helps us identify patterns in the residuals. A residual plot often is better than the original scatter plot for recognizing patterns because it isolates the errors from the general trend in the data. Residual plots are critical for studying error patterns in more advanced regressions with multiple independent variables. If we see a pattern in the distribution of the residuals, then we can infer that there is more to the behavior of the

Page | 29

SUMMARY OF E-COURSE: Quantitative Methods

dependent variable than what is explained by our linear regression. Other factors may be influencing the dependent variable, or the assumption that the relationship is linear may be unwarranted. The residuals appear to be getting larger for higher values of the independent variable. This phenomenon is known as heteroskedasticity.

4.2.2 THE SIGNIFICANCE OF REGRESSION COEFFICIENTS In fact, the regression line is almost never a perfect descriptor of the true linear relationship between the variables. Why? Because the data we use to find the regression line typically represent only a sample from the entire population of data pertaining to the relationship. Since each regression line comes from a limited set of data, it gives us only an approximation of the "true" linear relationship between the variables.

When we calculate the coefficients of a regression line from a set of observed data, the value a is an estimate of alpha, the intercept of the "true" line. Similarly, the value b is an estimate of beta, the slope of the true line. Since we don't know the exact value of the slope of the true advertising line, we might well question whether there actually IS a linear relationship between advertising and sales. How can we assure ourselves that the pattern we see in the sample data is not simply due to chance?

Page | 30

SUMMARY OF E-COURSE: Quantitative Methods

If there truly were no linear relationship, then in the full population of relevant data, changes in advertising would not correspond systematically to proportional changes in sales. In that case, the slope of the bestfitting line for the true relationship would be zero. There are two quick ways to test if the slope of the true line might be zero. One is simply to look at the confidence interval for the slope coefficient and see if it includes zero. In general, if the p-value for a slope coefficient is less than 0.05, then we can reject the null hypothesis that the slope beta is zero, and conclude with 95% confidence that there is a linear relationship between the two variables. Moreover, the smaller the p-value, the more confident we are that a linear relationship exists. If the p-value for a slope coefficient is greater than 0.05, then we do not have enough evidence to conclude with 95% confidence that there is a significant linear relationship between the variables.

4.3 SUMMARY

Page | 31

SUMMARY OF E-COURSE: Quantitative Methods

4.4 REVISITING R^2


Example:

So far, we haven't discussed how sample size affects the accuracy of a regression analysis. The larger the sample we use to conduct the regression analysis, the more precise the information we obtain about the true nature of the relationship under investigation. Specifically, the larger the sample, the better our estimates for the slope and the intercept, and the tighter the confidence intervals around those estimates.

Page | 32

SUMMARY OF E-COURSE: Quantitative Methods

5 MULTIPLE REGRESSION
In multiple regression, we adapt what we know about regression with one independent variable often called simple regression to situations in which we take into account the influence of several variables. Graphing data on more than two variables poses its own set of difficulties. Three variables can still be represented, but beyond that, visualization and graphical representation become essentially impossible.

5.1 ADAPTING BASIC CONCEPTS


5.1.1 INTERPRETING THE MULTIPLE REGRESSION EQUATION y= a + b1 * v1 + b2 * v2 a constant coefficient b coefficient of variable v v variable x

The coefficients in the simple regression and the coefficients in the multiple regression have very different meanings. In the simple regression equation of price versus distance, we interpret the coefficient, -39,505, in the following way: for every additional mile farther from downtown, we expect house price to decrease by an average of $39,505. We describe this average decrease of $39,505 as a gross effect - it is an average computed over the range of variation of all other factors that influence price. In the multiple regression of price versus size and distance, the value of the distance coefficient, -55,006, is different, because it has a different meaning. Here, the coefficient tells us that, for every additional mile, we should expect the price to decrease by $55,006, provided the size of the house stays the same. In other words, among houses that are similarly sized, we expect prices to decrease by $55,006 per mile of distance to downtown. We refer to Page | 33

SUMMARY OF E-COURSE: Quantitative Methods

this decrease as the net effect of distance on price. Alternatively, we refer to it as "the effect of distance on price controlling for house size".

Page | 34

SUMMARY OF E-COURSE: Quantitative Methods

5.1.2

RESIDUAL ANALYSIS

5.1.3

QUANTIFYING THE PREDICTIVE POWER OF MULTIPLE REGRESSION

Page | 35

SUMMARY OF E-COURSE: Quantitative Methods

5.2 NEW CONCEPTS IN MULTPLE REGRESSION


5.2.1 MULTICOLLINEARITY When a multiple regression delivers a surprising result such as this, we can usually attribute it to a relationship between two or more of the independent variables. When two of the independent variables are highly correlated, one is essentially a proxy for the other. This phenomenon is called multicollinearity. A common indication of lurking multicollinearity in a regression is a high adjusted R-squared value accompanied by low significance for one or more of the independent variables. One way to diagnose multicollinearity is to check if the p-value on and independent variable rises when a new independent variable is added, suggesting strong correlation between those independent variables. How much of a problem is multicollinearity? That depends on what we are using the regression analysis for. If we're using it to make predictions, multicollinearity is not a problem, assuming as always that the historically observed relationships among the variables continue to hold going forward. If we're trying to understand the net relationships of the independent variables, multicollinearity is a serious problem that must be addressed. One way to reduce multicollinearity is to increase the sample size. The more observations we have, the easier it will be to discern the net effects of the individual independent variables. We can also reduce or eliminate multicollinearity by removing one of the collinear independent variables.

5.2.2 LAGGED VARIABLES price, house size, lot size, and distance from the city center. These are cross-sectional data: we looked at a cross-section of the Silverhaven real estate market at a specific point in time. A time series is a set of data collected over a range of time: each data point pertains to a specific time period. We incorporate the delayed effect of an independent variable on s dependent variable using a lagged variable. Adding a lagged variable is costly in two ways. The loss of a data point decreases our sample size, which reduces the precision of our estimates of the regression coefficients. At the same time, because we are adding another variable, we decrease adjusted R-squared. Thus, we include a lagged variable only if we believe the benefits of adding it outweigh the loss of an observation and the "penalty" imposed by the adjustment to R-squared. Despite these costs, lagged variables can be very useful. Since they pertain to previous time periods, they are usually available ahead of time. Lagged variables are often good "leading indicators" that help us predict future values of a dependent variable.

Page | 36

SUMMARY OF E-COURSE: Quantitative Methods

5.2.3 DUMMY VARIABLES But many variables we study are qualitative or categorical: they do not naturally take on numerical values, but can be classified into categories.

6 DECISION ANALYSING

Page | 37

SUMMARY OF E-COURSE: Quantitative Methods

6.1 DECISION TREES

Page | 38

SUMMARY OF E-COURSE: Quantitative Methods

Costs that were incurred or committed to in the past, before a decision is made, contribute to the total costs of all scenarios that could possibly unfold after the decision is made. As such these costs called sunk costs should not have any bearing on the decision, because we cannot devise a scenario in which they are avoided.

Page | 39

SUMMARY OF E-COURSE: Quantitative Methods

Page | 40

SUMMARY OF E-COURSE: Quantitative Methods

6.2 CONDITIONAL PROBABILITIES


6.2.1 JOINT AND MARGINALS PROBABILITIES

6.2.2

CONDITIONAL PROBABILITIES

6.2.3

STATISTICAL INDEPENDENCE

Page | 41

SUMMARY OF E-COURSE: Quantitative Methods

6.3 CONDITIONAL PROBABILITIES IN DECISION ANALYSIS

6.4 THE VALUE OF INFORMATION

Page | 42

SUMMARY OF E-COURSE: Quantitative Methods

6.5 RISK ANALYSING

Page | 43

SUMMARY OF E-COURSE: Quantitative Methods

Page | 44

You might also like