Business Statistics

Autumn 2009

Chicago Booth C. Alan Bester

1

About this Course
 Below is a link to the course website. Please visit and bookmark this site NOW. faculty.chicagobooth.edu/alan.bester/teaching/  You can also find the course website on Chalk or Google “business statistics bester”.  Everything you need to know is in the lecture notes. Everything you need for the class is on the course website.
2

About These Notes
 You will find links to data sets, examples, and other things we talk about throughout the notes.  Due to the name change I’ve had to change all the links from ‘chicagogsb.edu’ to ‘chicagoboth.edu’. If you find one (in the notes or on the website) that doesn’t work try changing ‘gsb’ to ‘booth’ in the URL.  Yes, there are a lot of slides. I like to restate things and limit the number of concepts per slide. This course is actually about a small number of “big ideas” that we will develop throughout the quarter.
3

“Statistical Method”
(We’ll start here)

Formulate problem

Get some data

Visualize the data

Do some statistical calculations

Interpret results

4

Notes1: Data: Plots and Summaries
1. Data 2. Looking at a Single Variable 2.1 Tables 2.2 Histograms 2.3 Dotplots 2.4 Time Series Plots 3. Summarizing a Single Numeric Variable 3.1 The Mean and Median 3.2 The Variance and Standard Deviation 3.3 The Empirical Rule 3.4 Percentiles, quartiles, and the IQR 4. Looking at Two Variables 4.1 Categorical variables: the Two-way table 4.2 Numeric variables: Scatter Plots 4.3 Relating Numeric and Categorical variables

5

5. Summarizing Bivariate Relations 5.1 In Tables 5.2 Covariance and Correlation 6. Linearly related variables 6.1 Linear functions 6.2 Mean and variance of a linear function 6.3 Linear combinations 6.4 Mean and variance of a linear combination 7. Linear Regression 8. Pivot Tables (Optional)

6

1.Data
age sex Here is soc edu data (our sample): some Reg inc cola restE juice cigs antiq news ender friend simp foot

67 51 63 45

2 2 2 2

3 3 3 4

1 8 1 3

3 3 2 1

12 10 13 18

1 1 1 1

0 1 1 1

1 0 0 1

0 1 1 0

1 1 1 1

0 0 0 0

0 1 1 0

0 1 0 0

0 0 0 0

0

0

0

0

. . .

(many more rows !!)

The data is from a large survey carried out by a marketing research company in Britain. (Marketing data) Each row corresponds to a household. Each column corresponds to a different feature of the household. The features are called variables. The rows are called observations. 7

Most data sets come in this form. A rectangular array. Rows are observations. Columns are variables.

Variables are the fundamental object in statistics. They come in several types.

8

The variable labeled "age" is simply the age (in years) of the responder. This is a numeric variable. This variable has units, and averages are interpretable. In contrast, the variable "Reg" is the geographical region of the household. Each "number" is really just a code for a region:
1 2 3 4 5 6 7 8 9 10 11 "Scotland" "North West" "North" "Yorkshire & Humberside" "East Midlands" "East Anglia" "South East" "Greater London" "South West" "Wales" "West Midlands"

A variable like Reg is called categorical. Think of: numeric vs. categorical quantitative vs. qualitative
9

Instead of using numbers we could have used text strings in the data file, that is, Reg: 3 3 2 1 . . Reg: North North North_West Scotland . .

Instead of

we could have

But it is extremely common to use numeric codes. Another example: Which Democratic candidate do you support? 1= Hillary Clinton, 2= John Edwards, 3= Barack Obama, 4= Bill Richardson
10

The variable soc is categorical. It takes on codes 1-6, with meanings:
1 2 3 4 5 6 "A" "B" "C1" "C2" "D" "E"

This is an ordered categorical variable. You can't think of it as a numerical measure but A < B < ... < E. (“A” is actually the lowest social grade) Soc is ordered like age, but does not have units. It does not really make sense to compute the difference or to average two soc measurements. It does make sense to difference two ages.
11

That pretty much covers it. Variables are either numeric, categorical, or ordered categorical. Of course a numeric variable is always ordered.

For numeric variables we also have: A variable is discrete if you can list its possible values. Otherwise it is called continuous.

12

For example, the amount of rainfall in the City of Chicago this month is usually thought of as being continuous. As a practical matter, any variable is discrete since we put it in the computer. What it comes down to is, if there are a lot of possible values, we think of it as continuous. (This is not really that important now; it will be later when we get to probability.) For example, you might think of age as continuous even though we measure it in years and can easily list its possible values. Number of children is more likely to be thought of as discrete.
13

Again, a good rule when working with a numeric variable is to keep in mind the units in which it is measured. For example age has units years. Percentages, which are numeric, don't have units. But there are always units somewhere. For example, if we look at the percentage of income a household spends on entertainment, we are looking at one quantity measured in units of currency divided by another.
14

Here are the definitions of all the variables in the survey data set: age: age in years sex: 1 means male, 2 means female soc: we saw this edu: education, terminal age of education
1 2 3 4 5 6 7 8 9 "4O Udr 1 r ne" "5 1" "6 1" "7 1" "8 1" "9 1" "0 2" "1-2" 2 3 "4O Oe" 2 r vr

Reg: we saw this.
15

inc: income
VARIABLE LABELS V_842 "Total Family Income Before Tax". VALUE LABELS V_842 1 "£1,999 Or Less" 2 "£2,000 - £2,999" 3 "£3,000 - £3,999" 4 "£4,000 - £4,999" 5 "£5,000 - £5,999" 6 "£6,000 - £6,999" 7 "£7,000 -£7,999" 8 "£8,000 - £8,999" 9 "£9,000 - £9,999" 10 "£10,000 - £10,999" 11 "£11,000 - £11,999" 12 "£12,000 - £14,999" 13 "£15,000 - £19,999" 14 "£20,000 - £24,999" 15 "£25,000 - £29,999" 16 "£30,000 - £34,999" 17 "£35,000 - £39,999" 18 "£40,000 - £49,999" 19 "£50,000 Or Over" 20 "Not Stated"

Note:

Both edu and inc could have been numeric, but are broken down into ranges. They are thus ordered categorical. This is extremely common; with income there are actually good reasons for doing this!
16

cola, restE, juice, cigs indicate use of a product category. 1 if you use it, 0 if you don't. This is called a dummy variable. 1 indicates something "happened", 0 if not. So, cigs=1 means you purchase cigarettes. restE means "restaurants in the evening". This is extremely common. Often in statistics we are interested in “does something happen?”. Another example is approval ratings ( 1=approve ). We will work with a lot of dummy variables this quarter.

17

A dummy variable can take on two values, 0 or 1. We use dummy variables to indicate something, 1 if that something “happened”, 0 if it did not. The rest of the variables in the marketing data represent tv shows. They are dummies: 1 if you watch, 0 if you don't. antiq: antiques roadshow news: bbc news enders: east enders friend: friends simp: simpsons foot: "football" (soccer)
18

Now we can see that there are three types of variables in the data set. (i) Demographics: age through income (ii) Product category usage, (iii) Media exposure (tv shows). What is the point? Why collect this data? We want to see how product usage relates to demographics. What kind of people drink colas? We want to see how the media relates to product usage so that we can select the appropriate media to advertise in. If friends viewers tend to drink colas, that might be a good place to advertise your cola.

19

Important Note: You can always take a numeric variable and make it an ordered categorical variable by using bins. For example, instead of treating age as a numeric variable it is common to break it into ranges. 0-20: a1 21-30:a2 31-40:a3 41-50:a4 51-60:a5 61-70:a6 >70: a7
20

for example:

The simplest case is a dummy variable:

1 x > a d= 0 x ≤ a

where x is numeric

For example, you could define someone to be "old" if older than 40 and "young" otherwise. d=1 then means "old" and d=0 means "young".

21

2. Looking at a Single Variable
The most interesting thing in statistics is understanding how variables relate to each other. "Friends watchers tend to drink colas". "Smokers tend to get cancer". But it is still very important to get of sense of what variables are like on their own. Note: We’ll use the term “distribution” informally to talk about what a variable looks like (what does a typical value look like, how spread out are its values, etc.) We will use the term more formally when we study probability. 22

2.1 Tables
To look at a categorical variable we use a table:
soc count 1 2 3 4 5 6 28 151 310 235 156 120

How to make this table

We simply count how many of each category we have. Note: We have 1000 observations total, so the numbers in this table must add to 1000.
23

I like to graph the table. This table makes it easy to see how different social grades are represented. Numbers at the bottom are categories. The height of each bar equals the number of observations in that category.
social grade
350 300 250 200 150 100 50 0 1 2 3 4 5 6

24

2.2 Histograms
We take a numeric variable, break it down into categories and then plot the table as on the previous slide. Remember, the height of each bar = # of observations or “frequency” in that category.
Histogram for age 120

100

80

60

40

20

35-40 means (35,40] that is, <35 x <=40.
<=15 1520 2025 2530 3035 3540 4045 4550 5055 5560 6065 6570 7075 7580 8085 8590 >90 Category

0

25

Time between arrivals at a bank, in minutes. (Bank data)
Histogram for InterarrivalTim e 70 60 50 40 30 20 10 0
<= 0. 00 1. 8 00 82. 00 3. 8 00 84. 00 5. 8 00 86. 00 7. 8 00 88. 00 9. 00 8 810 11 .0 08 .0 08 -1 2. 13 00 .0 8 08 -1 4. 15 00 .0 8 08 -1 6. 17 00 .0 8 08 -1 8. 19 00 .0 8 08 -2 0. 21 00 .0 8 08 -2 2. 23 00 .0 8 08 -2 4. 25 00 .0 8 08 -2 6. 27 00 .0 8 08 -2 8. 00 8

Category

A histogram with a "heavy right tail" is called skewed right. You can guess what skewed left is.
26

Source: Nicolas P. B. Bollen and Veronika K. Pool, “Do Hedge Fund Managers Misreport Returns? Evidence from the Pooled Distributions”; original data from Center for International Securities and Derivatives Markets, University of Massachusetts

5%
0

Here’s a histogram of monthly hedge fund returns from 1994 to 2005. Notice anything interesting?

4%

27

Aside: Histograms can be displayed in different ways…

The observations here are starting players in the NFL (on offense). The numbers on the vertical axis correspond to rounds of the NFL draft, while the length of each blue bar is the percentage of starting players drafted at that position (forget the red bars). The plots on the right show only quarterbacks and fullbacks. (Source)

Don’t worry, all of our histograms will be like the previous two slides.

“Aside” or “Optional” on a slide means you are not responsible for the material on that slide on an exam!

28

2.3 Dotplots
It can be a hassle choosing the bins for a numeric variable. For discrete variables and/or small data sets, we can just put a dot on the number line for each value.
(Beer data) nbeerm: the number of beers male MBA students claim they can drink without getting drunk nbeerf: same for females
Note (1): Unfortunately StatPro doesn’t do dotplots. The dotplots in these slides were done in Minitab. Note (2): The beer data is text, not Excel format. Use “Text to Columns”.
29

. : : : : . . : : : : . . : . : : :.: : : : . . +---------+---------+---------+---------+---------+------nbeerm . .. . : : . +---------+---------+---------+---------+---------+------nbeerf 0.0 4.0 8.0 12.0 16.0 20.0

We call a point like this an outlier.

Generally the males claim they can drink more, their numbers are centered or located at larger values. Note: The dot plot is giving you the same kind of information as the histogram.
30

2.4 Time Series Plots
The survey data is what we call cross-sectional. The households in our survey are a (hopefully representative) cross section of all British households at a particular point in time. In cross-sectional data, order doesn’t matter. We can sort our households by age, social, etc. and none of our results change as long as we keep each row intact. Other examples would be samples were every row corresponded to a firm, a plant, a machine... With a time series, each observation corresponds to a point in time.

31

Daily data on the Dow Jones index: (Dow data)
Date 1-May-00 2-May-00 3-May-00 4-May-00 Open 10749.4 10805.6 10732.2 10478.9 High 11001.3 10932.5 10754.4 10631.5 Low 10622.2 10580.7 10345.2 10293.1 Close 10811.8 10731.1 10480.1 10412.5 Volume 9663000 10115000 9916000 9258000

. . .

For time series data, the order of observations matters. (1-May-00 comes before 2-May-00, etc.) The easiest way to visualize time series data is often simply to plot the series in time order.
32

Close 10200 10800 11400 7800 5/1/2000 6/1/2000 7/1/2000 8/1/2000 9/1/2000 10/1/2000 11/1/2000 12/1/2000 1/1/2001 2/1/2001 3/1/2001 4/1/2001 Date 5/1/2001 6/1/2001 7/1/2001 8/1/2001 9/1/2001 10/1/2001 11/1/2001 12/1/2001 1/1/2002 2/1/2002 3/1/2002 4/1/2002 Tim e series plot of Close 8400 9000 9600

Time series plot of the close series.
How to make this plot

33

We could have data at various frequencies: daily, monthly, quarterly, annual. The kinds of patterns you will uncover can be very different depending on the frequency of the data. A current hot topic of research at Booth is "high frequency data".
34

20 19

Monthly US beer production. Do you see a pattern?

18 17 b_prod 16 15 14 13 12 Index 10 20 30 40 50 60 70

Would we see this pattern if we looked at annual data?
35

Time series plot of monthly returns on a portfolio of Canadian assets: (Country Portfolio returns)
On the vertical axis we have returns. On the horizontal axis we have “time”.

0.1

canada

0.0

-0.1 Index 20 40 60 80 100

Do you see a pattern?
36

Here is the histogram of the Canadian returns. Notes: (i) The histogram does not depend on the time order. (ii) The appearance of the histogram depends on the number of bins. Too many bins makes the histogram appear “spiky”.

30

Frequency

20

10

0 -0.09 -0.06 -0.03 0.00 0.03 0.06 0.09

canada

30

20

Frequency

10

0 -0.1 0.0 0.1

canada

37

Taken from David Greenlaw, Jan Hatzius, Anil Kashyap, and Hyun Shin, US Monetary Policy Forum Report No. 2, 2008

Be careful. What pattern do you see in this series? How about now?
38

From same paper as the previous slide.

Time series plots are also used to compare patterns across different variables over time, and sometimes to see the impact of past events (be very careful there, too).
39

3. Summarizing a Single Numeric Variable
We have looked at graphs. Suppose we are now interested in having numerical summaries of the data rather than graphical representations. Two important features of any numeric variable are: 1) What is a typical or average value? 2) How spread out or ‘variable’ are the values?

40

The mean and median capture a typical value. The variance/standard deviation capture the spread. For example we saw that the men tend to claim they can drink more. How can we summarize this?
. : : : : . . : : : : . . : . : : :.: : : : . . +---------+---------+---------+---------+---------+------nbeerm . .. . : : . +---------+---------+---------+---------+---------+------nbeerf 0.0 4.0 8.0 12.0 16.0 20.0
41

Monthly returns on Canadian portfolio and Japanese portfolio. They seem to be centered roughly at the same place but Japan has more spread. How can we summarize this?
42

3.1 The Mean and Median
We will need some notation. Suppose we have n observations on a numeric variable which we call "x".

x1 , x 2 , x 3 ,...xn
the first number the last number, n is the number of numbers, or the “number of observations.” You may also hear it referred to as the “sample size.”

xi is the value of x associated with the i th observation (row).
43

Here, x is just a name for the set of numbers, we could just as easily use y. In a real data set we would use a meaningful name like "age".

x1 x3

x 5 2 8 6 2

n=5

Sometimes the order of the observations means something. In our return data the first observation corresponds to the first time period. In the survey data, the order did not matter.
44

The sample mean is just the average of the numbers “x”:

sum x1 + x 2 + ... + xn x= = n n

We often use the x symbol to denote the mean of the numbers x. We call it “x bar”.

45

Here is a more compact way to write the same thing… Consider

x1 + x 2 + ... + xn
n

We use a shorthand for it (it is just notation):

∑x
i =1

i

= x1 + x 2 +... +x n
This is summation notation.
46

Using summation notation we have: The sample mean:

1 n x = ∑ xi n i=1

47

Graphical interpretation of the sample mean
Here are the dot plots of the beer data for women and men. Which group claims to be able to drink more?
Character Dotplot

. . . . : : . +---------+---------+---------+---------+---------+-------nbeerf . : : : : . . : : : : . . : . : : : . : : : : . +---------+---------+---------+---------+---------+-------nbeerm 0.0 2.5 5.0 7.5 10.0 12.5

In some sense, the men claim to drink more. To summarize this we can compute the average value for each group (men and women). Note: I deleted the outlier, I do not believe him!.
48

Mean of nbeerf = 4.2222 Mean of nbeerm = 7.8625
How to calculate these means

“On average women claim they can drink 4.2 beers. Men claim they can drink 7.9 beers”

In the picture, I think of the mean as the “center” of the data.
Character Dotplot

. . . . : : . +---------+---------+---------+---------+---------+-------nbeerf . 4.2 : : : : . . : : : : . . : . : : : . : : : : . +---------+---------+---------+---------+---------+-------nbeerm 0.0 2.5 5.0 7.5 10.0 12.5

7.86 49

Let us compare the means of the Canadian and Japanese returns.
Mean of canada = 0.0090654 Mean of japan = 0.0023364

This is a big difference as a practical matter! (Average monthly return of .90% versus .23%) It was hard to see this difference in the histograms because the difference is small compared to the variation.

50

More on summation notation (take this as an aside) Let us look at summation in more detail.
n

∑x
i =1

i

means that for each value of i, from 1 to n, we add to the sum the value indicated, in this case xi.

add in this value for each i

51

To understand how it works let us consider some examples. Think of each row as an observation on both x and y. To make things concrete, think of each row as corresponding to a year and let x and y be annual returns on two different assets. x 0.07 0.06 0.04 0.03 y 0.11 0.05 0.09 0.03 year 1 2 3 4

In year 1 asset “x” had return 7%. In year 4 asset “y” had return 3%.
52

compute x bar.

compute y bar.

(here, we do not sum over all observations: we sum only over the second and the third observation).
53

For each value of i, we can add in anything we want:

How to do these calculations using Excel

= (.02)*(.04) + (.01)*(-.02) + (-.01)*(.02)+(-.02)*(-.04)
54

The median After ordering the data, the median is the middle value of the data. If there is an even number of data points, the median is the average of the two middle values.

Example 1,2,3,4,5 1,1,2,3,4,5 Median = 3 Median = (2+3)/2 =2.5

55

Mean versus median Although both the mean and the median are good measures of the center of a distribution of measurements, the median is less sensitive to extreme values. The median is not affected by extreme values since the numerical values of the measurements are not used in its computation. Example 1,2,3,4,5 1,2,3,4,100 Mean: 3 Mean: 22 Median: 3 Median: 3
56

For the bank interarrival data:
His to g r a m f o r In te r ar r iv a lT im e

Summary measures for selected variables InterarrivalTime Mean Median 4.163 2.779

70 60 50 40 30 20 10 0
<= 0.0 08 82. 3.0 008 08 -4 . 5.0 008 08 -6 . 7.0 008 08 -8 .00 9.0 8 08 -1 0.0 11 08 .0 08 -1 2.0 13 08 .0 08 -1 4.0 15 08 .0 08 -1 6.0 17 08 .0 08 -1 8.0 19 08 .0 08 -2 0.0 21 08 .0 08 -2 2.0 23 08 .0 08 -2 4.0 25 08 .0 08 -2 6.0 27 08 .0 08 -2 8.0 08 1.0 0

C ate g o r y

If data is right skewed the mean will be bigger than the median. You can think of this as the extreme “right tail” observations pulling the mean upward.
57

Median or Mean? At Booth professors are rated by students from 1-5 in several categories. In the past only the mean rating was reported. Some faculty members believe the median should be reported instead. This was actually a major debate at a faculty meeting a few years ago. What difference would this make? In fact, Booth now reports the mean and median, along with a histogram of all the ratings!
58

The Mean of a Dummy Variable Consider the "simpson" variable in the survey data set. Does it make sense to take the mean?
Summary measures for selected variables

The sum of the 1's and 0's will equal the number of respondents who watch the simpsons. So the mean is the fraction of respondents who watch.

simpsons

Count

1000.000

Mean

0.181

59

So, in general, the average of a dummy, gives the percentage of times that whatever dummy=1 signals happens. Another example, if a poll is conducted about a particular candidate where 1=approval, 0=disapproval then the sample mean is the candidate’s approval rating. This may seem obvious, but we will get a lot of use out of this idea throughout the quarter.
60

3.2 The Variance and Standard Deviation
The mean and the median give us information about the central tendency of a set of observations, but they shed no light on the dispersion, or spread of the data. Example: Which data set is more variable ? 5,5,5,5,5 1,3,5,8,8 Mean: 5 Mean: 5

If these were portfolio returns (in percent), means are average returns. What else might we want to measure?

61

The Sample Variance
. . . . -+---------+---------+---------+---------+---------+-----x . . . . -+---------+---------+---------+---------+---------+-----y 0.030 0.045 0.060 0.075 0.090 0.105

The y numbers are more spread out than the x numbers. We want a numerical measure of variation or spread.

The basic idea is to view variability in terms of distance between each measurement and the mean.

xi − x
62

. . . . -+---------+---------+---------+---------+---------+-----x . . . . -+---------+---------+---------+---------+---------+-----y 0.030 0.045 0.060 0.075 0.090 0.105

Overall, these are smaller than these.
63

We cannot just look at the distance between each measurement and the mean. We need an overall measure of how big the differences are (i.e., just one number like in the case of the mean). Also, we cannot just sum the individual distances because the negative distances cancel out with the positive ones giving zero always (Why?). The average squared distance would be

1 ( xi − x)2 ∑ n i =1
64

n

So, the sample variance of the x data is defined to be: Sample variance:

1 n 2 2 sx = ∑ ( xi − x) n − 1 i =1
We use n-1 instead of n for technical reasons that will be discussed later (and because Excel does it this way). Think of it as the average squared distance of the observations from the mean.
65

Questions 1) What is the smallest value a variance can be? 2) What are the units of the variance?

It is helpful to have a measure of spread which is in the original units. The sample variance is not in the original units. We now introduce a measure of dispersion that solves this problem: the sample standard deviation
66

The sample standard deviation It is defined as the square root of the sample variance (easy). The sample standard deviation:

sx = s

2 x

The units of the standard deviation are the same as those of the original data.

67

Example 1 (numerical)
Assume as before:

Y − Y = .04, -.02, .02, -.04 X − X = .02, .01, .01, .02

68

The sample standard deviation for the y data is bigger than that for the x data. This numerically captures the fact that y has “more variation” about its mean than x.

69

Example 2 (graphical)
Character Dotplot

. : : : :: : .::: :.: : : :::: :::: ::: :::: :::: ::: . : :::: :::: :::: :::. -----+---------+---------+---------+---------+---------+-canada . . ::. . : . . ::: .:: :.: . : ::: .::: :::: : :. . .. .. :.:: :::: :::: :::: : :: : : . : . -----+---------+---------+---------+---------+---------+-japan -0.160 -0.080 0.000 0.080 0.160 0.240

The standard deviations measure the fact that there is more spread in the Japanese returns

Variable canada japan

N 107 107

Mean 0.00907 0.00234

StDev 0.03833 0.07368

70

3.3 The Empirical Rule
We now have two numerical summaries for the data

x
where the data is

sx
how spread out, how variable the data is

The mean is pretty easy to interpret (some sort of “center” of the data). We know that the bigger sx is, the more variable the data is, but how do we really interpret this number? What is a big sx, what is a small one ?
71

The empirical rule will help us understand sx and relate the numerical summaries back to our plots. Empirical Rule For “mound shaped data”: Approximately 68% of the data is in the interval

( x − s x, x + s x ) = x ± s x
Approximately 95% of the data is in the interval

( x − 2s x , x + 2s x ) = x ± 2s x
72

We can see this on a histogram of the Canadian returns

x =.00907 s x =.03833
The empirical rule says that roughly 95% of the observations are between the dashed lines and roughly 68% between the dotted lines. Looks reasonable.
30 25

x − 2s x

H is to g r a m fo r c a n a d a

x + 2s x

20

15

10

5

0
<=(- 0. 1) ( -. 1 )- (- . 0 8) (- . 0 8) -(- . 0 6) ( - . 06 )- ( -. 04 ) ( -. 04 )- (-. 02 ) (- . 0 2)- 0 0- . 0 2 . 02 - . 0 4 . 04 - . 06 . 0 6- . 0 8 . 08 - . 1 >. 1

-0.1

0

0.1

x − sx

x + sx
73

Same thing viewed from the perspective of the time series plot.
x

x + 2sx
0.1

n=108, so 5% outside would be about 5 points. There are 4 points outside, which is pretty close.

canada

0.0

-0.1 Index 20 40 60 80 100

x − 2s x
74

Sample Exam Question
Histogram of score_ diff
14 12 10 F qec re u n y 8 6 4 2 0

You can use the empirical rule to “back out” the sample standard deviation from a dot plot of histogram.

-20

-10

0

10 score_ diff

20

30

(c)

The sample mean of score_diff is (i) -8.32 (ii) 4.79 (iii) 13.29 (iv) 18.71

Of the choices, 4.79 looks closest to the “center” of the data With a sample s.d. of 11.03, about 68% of the data should be between -6 and 16; 95% of the data should be between -17 and 27.
75

(d)

The sample standard deviation of score_diff is (i) 1.21 (ii) 6.84 (iii) 11.03 (iv) 20.59

You should also be able to do this from a time series plot (see previous slide). Don’t worry about this now; I provide multiple sample exams.

A little finance: comparing mutual funds
Let us use the means and standard deviations to compare mutual funds. For 9 different assets we compute the means and standard deviations. Later, we plot the means versus the standard deviations. The assets are:
#C1 - R22 Drefus (growth) #C2- R30 Fidelity Trend fund (growth) #c3- R55 Keystone Speculative fund (max capital gain) #c4- R92 Putnam Income Fund (income) #c5- R99 Scudder Income #c6- R129 Windsor Fund (growth) #c7- equally weighted market #c8- value weighted market #c9- tbill rate # sample period monthly returns 1:68 - 12-82
76

Variable drefus fidel keystne Putnminc scudinc windsor eqmrkt valmrkt tbill

N 180 180 180 180 180 180 180 180 180

Mean 0.00677 0.00470 0.00654 0.00552 0.00443 0.01002 0.01082 0.00681 0.00598

StDev 0.04724 0.05659 0.08424 0.03008 0.03597 0.04864 0.06856 0.04800 0.00252

The speculative fund (keystne) has a higher mean and standard deviation than the income fund (Putnminc). Later we’ll see how to look at this information graphically.
77

3.4 Percentiles, quartiles, and the IQR
Again, this just applies to numeric variables. The 10th percentile is the number such that 10% of the values are less than it and 90% are bigger. The median is the 50th percentile. Percentiles are also known as quantiles. “95th percentile”, “.95 quantile”, and “95% quantile” all mean the same thing.
78

For the age variable in the survey data:
Summary measures for selected variables age Count 5th percentile 10th percentile 90th percentile 95th percentile 1000.000 25.000 28.000 71.000 75.000

5% of the 1000 “age” values are less than 25. 90% of people in the sample are less than 71 years old. 5% of the people in the sample are over 75 years of age. For now don’t worry about “strictly less than” vs. “less than or equal to”.

79

Summary measures for selected variables age Count Mean Median 1000.000 48.312 48.000 15.718 247.062 35.000 60.000 25.000 80

The first, second, and third quartiles are the 25th, 50th, and 75th percentiles. The interquartile range is the difference between the third and first quartile. The interquartile range is used as a measure of spread (IQR is to variance as median is to mean).

Standard deviation Variance First quartile Third quartile Interquartile range

Histogram for age 120

100

80

60

40

20

0 <=15 1520 2025 2530 3035 3540 4045 4550 5055 5560 6065 6570 7075 7580 8085 8590 >90

Category

first quartile = 35 years

We can interpret quantiles graphically on the histogram. 25% of the area of the colored bars is to the left of the first quantile.

81

The empirical rule is actually a statement about quantiles. What does it say? For a variable with a “mound shaped” histogram… What quantile is two standard deviations below the mean? 2.5% What quantile is one standard deviation above? 84% To see this yourself, draw the picture! We’ll learn later that the empirical rule is based on a very important probability model.

82

10th Percentile (o) 50th Percentile (+) 90th Percentile ( )

1.30

Indexed Real Wage

1.20

1.10

1.00

0.90

70

75

80

Year

85

90

95

Figure 3. Indexed Real Wages for Men by Percentile 1967-1997
Source: Murphy, Kevin and Finis Welch, “Wage Differentials in the 1990s: Is the Glass Half-full or Half-empty?

Aside: We won’t use percentiles much in this class, but above is an interesting time series plot of the 90th (top line), median (middle line), and 10th percentiles of real wages in the U.S. from the late 1960s to late 1990s. This widening income gap is a major concern for economists… or is it?
83

4. Looking at Two Variables

While it is important to look at variables one at a time, many interesting business problems concern how two (or more) variables are related to each other.

84

4.1 Categorical variables: the Two-way Table
Let’s look at the relationship between two categorical variables, x and y. If x has two categories and y has two as well, then there are four categories using both x and y. We can then just count the number of observations in each category. If x has r1 and y has r2, then we have r1*r2 possibilities. We can arrange these possibilities in a two-way table.
85

This is the two way table relating viewership of the simpsons with cola use. 146 of the 1000 view simpsons and consume colas. Raw counts:
simpsons colas 0 1 Grand Total 0 387 432 819 1Grand Total 35 146 181 422 578 1000 Grand Total colas 0 1

Percent of total:
simpsons 0 38.70% 43.20% 81.90% 1Grand Total 3.50% 14.60% 18.10% 42.20% 57.80% 100.00%

Percent of column:
Count of colas simpsons colas 0 1 Grand Total 0 47% 53% 100% 1Grand Total 19% 81% 100% 42% 58% 100%

Percent of row:
Count of colas simpsons colas 0 1 Grand Total 0 92% 75% 82% 1Grand Total 8% 25% 18% 100% 100% 100%

How to make these tables

86

A picture of the table:
900 800 700 600 500 400 300 200 100 0 0 1 1 0

colas

simpsons A much higher fraction of the simpsons viewers consumes colas.
87

How does social grade relate to cigarette use? Now one variable has 2 categories and the other has 6 so combined there are 12. Row percentages:
Count of cigs cigs social 1 2 3 4 5 6 Grand Total 0 75.00% 80.13% 74.19% 70.64% 62.18% 64.17% 71.20% 1Grand Total 25.00% 19.87% 25.81% 29.36% 37.82% 35.83% 28.80% 100.00% 100.00% 100.00% 100.00% 100.00% 100.00% 100.00%
20.00% 40.00% 60.00% 80.00% 1 0 120.00%

100.00%

0.00% 1 2 3 4 5 6

The highest cigarette use is in the two lower social grades.

88

4.2 Numeric variables: Scatter Plots
For two numeric variables we have the scatter plot.
Each row is an observation corresponding to a person. Each person has two numbers associated with him/her, # beers and weight.
nbeer 12.0 12.0 5.0 5.0 7.0 13.0 4.0 12.0 12.0 12.0 . . . weight 192 160 155 120 150 175 100 165 165 150 . . .
89

How are they related?

Is the number of beers you can drink related to your weight?

nbeer 12.0 12.0 5.0 5.0 7.0 13.0 4.0 12.0 12.0 12.0 . . .

weight 192 160 155 120 150 175 100 165 165 150 . . .

20

nbeer

10

0 100 150 200

weight

You can think of a scatterplot as a ‘2D dot plot’. Each point corresponds to an observation: weight determines the position on the horizontal axis, height on the vertical.

Notice our outlier is back (circled)... and is he really an outlier?!
90

In addition to relating two variables, a scatterplot also gives you all the information you’d get from a dotplot of either variable.
20

nbeer

10

0 100 150 200

weight

Same idea for nbeer, though the vertical axis can be a little harder to picture (Hint: rotate the paper)

Sample Exam Question
The sample mean of weight is (i) 105 (ii) 130 (iii) 155 (iv) 180

Imagine the dots on the scatterplot being pulled downward by gravity – you’d get a dotplot of weight! The sample SD of weight is around 28, so roughly 68% of observations between 127 and 183 pounds.

91

Example
Are returns on a mutual fund related to market returns? Each point corresponds to a month.
windsor

0.2

0.1

Like the histogram, scatterplots can also be used with time series data, and the resulting plot does not depend on the time ordering.

0.0

-0.1

-0.1

0.0

0.1

0.2

valmrkt

92

Example
Here’s another example of an “outlier”. This data is from a poker website that went through a major cheating scandal. WINRATE

A similar scandal surfaced recently. Is the evidence as compelling?
93

P PV I

In finance we often use a different type of 2-D plot to compare asset returns. Here each point is a mutual fund. The horizontal and vertical location of each point reflects the sample standard deviation and sample mean of its returns within the same sample period.
eqmrkt windsor

0.011

If you’re a fund manager, where do you want to be on this plot?

0.010 0.009

Mean

0.008 0.007 0.006 0.005 0.004 0.00 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 tbill valmrkt drefus keystne Putnminc scudinc fidel

StDev
94

Let us compare some countries ( Country returns data) Based on monthly returns from ‘88 to ‘96
honkong 0.02 singapor france belgium 0.01 canada australi germany finalnd italy japan 0.00 0.03 0.04 0.05 0.06 0.07 0.08

Mean

usa

StDev
95

4.3 Relating a Numeric to a Categorical variable
How do you plot a numeric variable vs a categorical variable? This is not so obvious. An easy thing to do is make the numeric variable categorical by binning it, like we did when making a histogram.

96

Cigarette usage and age:
cigs age 16-25 26-35 36-45 46-55 56-65 66-75 76-85 86-95 Grand Total 0 50.98% 63.64% 67.69% 64.76% 79.76% 91.13% 88.10% 100.00% 71.20% 1Grand Total 49.02% 36.36% 32.31% 35.24% 20.24% 8.87% 11.90% 0.00% 28.80% 100.00% 100.00% 100.00% 100.00% 100.00% 100.00% 100.00% 100.00% 100.00%
0.00% 16-25 26-35 36-45 46-55 56-65 66-75 76-85 86-95 60.00% 80.00% 1 0 120.00%

100.00%

40.00%

20.00%

Quick — what is the relationship between age and cigarette usage? Plots are a great way to identify patterns, but careful… How strong is the evidence?

97

5. Summarizing Bivariate Relations

Can we numerically summarize the strength of a bivariate relationship? For categorical variables, I don't think there is a generally accepted approach. For two numeric variables, we introduce two summary statistics called covariance and correlation.

98

5.1 In Tables
There does not seem to be a standard way to summarize the strength of the relationship in a table. Sometimes I use the difference between a “marginal” proportion and a “conditional” proportion.
simpsons colas 0 1 Grand Total 0 47.25% 52.75% 100.00% 1Grand Total 19.34% 80.66% 100.00% 42.20% 57.80% 100.00% Grand Total colas 0 1 simpsons 0 38.70% 43.20% 81.90% 1Grand Total 3.50% 14.60% 18.10% 42.20% 57.80% 100.00%

In this case it would be: |.578 - .8066| =.2286 The difference between the percent of cola drinkers and percent of simpsons viewers that are cola drinkers.

99

5.2 Covariance and Correlation
In the beer data (beers vs weight) and mutual fund data (windsor vs valmrkt), it looks like there is a relationship. Even more, the relationship looks linear in that it looks like we could draw a line through the plot to capture the pattern. Covariance and correlation summarize how strong a linear relationship there is between two variables. In our first example weight and # beers were two variables. In our second example our two variables were two kinds of returns. In general, we think of the two variables as x and y.
100

The sample covariance between x and y:

s xy

1 n = ∑ ( xi − x)( yi − y) n − 1 i=1

The sample correlation between x and y:

rxy =

s xy s xs y

So, the correlation is just the covariance divided by the two standard deviations. What are the units?

101

We will get some intuition about these formulae, but first let us see them in action. How do they summarize data for us? Let us start with the correlation.

Correlation, the facts of life:

−1 ≤ rxy ≤ 1
The closer r is to 1 the stronger the linear relationship is with a positive slope. When one goes up, the other tends to go up. The closer r is to -1 the stronger the linear relationship is with a negative slope. When one goes up, the other tends to go down.
102

The correlations corresponding to the two scatter plots we looked at are:
Correlation of valmrkt and windsor = 0.923 Correlation of nbeer and weight = 0.692

The larger correlation between valmrkt and windsor indicates that the linear relationship is stronger. Let us look at some more examples.
0.2

20

0.1

windsor

nbeer
-0.1 0.0 0.1 0.2

0.0

10

-0.1

0 100 150 200

valmrkt

weight

103

2

Correlation of y1 and x1 = 0.019

1

y1

0 -1 -2 -3 -2 -1 0 1 2 3

x1

3 2

Correlation of y2 and x2 = 0.995

1

y2

0 -1 -2 -3 -3 -2 -1 0 1 2 3

x2

104

4

Correlation of y3 and x3 = 0.586
y3

3 2 1 0 -1 -2 -3 -4 -3 -2 -1 0 1 2 3

x3

3 2

Correlation of y4 and x4 = -0.982

1

y4

0 -1 -2 -3 -3 -2 -1 0 1 2 3

x4

105

Correlation of y5 and x5 = 0.210

9 8 7 6 5 4 3 2 1 0 -3 -2 -1 0 1 2 3

y5

x5

IMPORTANT: Correlation only measures linear relationships (here the value is small but there is a strong nonlinear relationship between y5 and x5.)

106

Example: The country data
Which countries go up and down together? I have data on 23 countries. That would be a lot of plots!

0.1

canada

0.0

-0.1 -0.1 0.0 0.1

usa

107

The correlation matrix is a table of all sample correlations between each possible pair of a set of variables.
australi belgium 0.189 canada 0.507 finalnd 0.387 france 0.275 germany 0.226 honkong 0.334 italy 0.159 japan 0.251 usa 0.360 singapor 0.409 usa singapor japan 0.246 0.407 belgium 0.357 0.183 0.734 0.691 0.301 0.367 0.418 0.429 0.355 usa 0.473 canada 0.386 0.342 0.302 0.558 0.334 0.271 0.651 0.478 finalnd france germany honkong italy

Why is this blank?
0.176 0.304 0.355 0.389 0.307 0.264 0.391 0.709 0.359 0.352 0.421 0.501 0.408 0.339 0.465 0.318 0.372 0.467

0.261 0.219 0.429 0.647

0.426 0.240 0.416

Make this table in StatPro

StatPro will also make the covariance matrix, which displays covariances with variances on the diagonal.
108

Understanding the covariance and correlation formulae How do these weird looking formulae for covariance and correlation capture the relationship? To get a feeling for this, let us go back to the simple example and compute covariance and correlation x 0.07 0.06 0.04 0.03 y 0.11 0.05 0.09 0.03
109

First, let us compute the covariance (which is a necessary ingredient to compute the correlation):
1 n ∑ ( xi − x)( yi − y) = n − 1 i =1 1 ((.07 −.05)(.11−.07) + (.06 −.05)(.05 −.07) + (.04 −.05)(.09 −.07) + (.03 −.05)(.03 −.07)) 3 1 = (.02*.04 + .01 * ( −.02) + ( −.1)*.02 + ( −.02) * ( −.04)) 3 1 1 = (.0008 −.0002−.0002+.0008) = (.0012) =.0004 3 3

= .0004 Each of the 4 points makes a contribution to the sum. Let us see which point does what.
110

( x3 − x)( y 3 − y ) = ( −.01)*.02 = −.0002

( x1 − x)( y1 − y ) =.02*.04 =.008

x
0.11 0.10 0.09 0.08 0.07 0.06 0.05 0.04 0.03 0.03 0.04 0.05 0.06 0.07

(II) (III)

(I) (IV)

y

y

x

( x4 − x)( y 4 − y ) = ( −.02) * ( −.04 ) =.008

( x2 − x)( y 2 − y ) =.01 * ( −.02) = −.0002

Points in (I) have both x and y bigger than their means so we get a positive contribution to the covariance. Points in (III) have both x and y less than their means so we get a positive contribution to the covariance. In (II) and (IV) one of x and y is less than its mean and the other is greater so we get a negative contribution. 111 The further out the point is, the bigger the contribution.

just a few relatively small contributions
0.2

Lots of positive contributions

0.1

0.0

-0.1

We saw before that this mutual fund’s returns are positively correlated with the market.
-0.1 0.0 0.1 0.2

windsor

valmrkt

Lots of positive contributions

just a few relatively small contributions
112

So, A positive covariance means that when a variable is above its average the other one tends to be above as well. They move up and down together. A negative covariance means that when one is up the other tends to be down. They move in opposite directions. A small covariance means that their movements are almost (linearly) unrelated. Now let’s compute the correlation…
113

We just finish the example.

sxy

.0004 rxy = =.6 (.0365)(.0183 )
sx sy

The division by the standard deviations standardizes the covariance so that the correlation is always between +/- 1.
114

The sign of the correlation contains the same information as the sign of the covariance (in fact, they have the same sign because the standard deviations always positive). Positive sign: positive relationship Negative sign: negative relationship The correlation can be more informative, though, because it is unit-less (always between –1 and 1), by construction. Hence, it is a more easily interpretable measure of the strength of the relationship. Close to 1: strong positive relationship Close to -1: strong negative relationship
115

6 Linearly Related Variables
We have studied data sets that display some kind of relation between variables (the mutual fund returns and the market returns, for instance). Sometimes there is an exact linear relation between variables: y = c0 + c1 x In this linear relationship, c0 is called the intercept. c1 is called the slope. Suppose we had started with x and we already knew its sample mean and variance. Can we figure out the sample mean and variance of the “new” variable, y?

116

6.1 Linear functions
Example Suppose we have a sample of temperatures in Celsius and we convert them to Fahrenheit.
cel 10 15 20 25 40 30 50 70 fahr 50 59 68 77 104 86 122 158

How are the cel values related to the fahr values? fahr = 32 + (9/5) * cel Note that cel = 32.5, and scel = 20 We could find fahr and sfahr using a spreadsheet.

117

Note: if we make a scatter plot of fahr versus cel, what do we see ?
150

fahr

100

50 10 20 30 40 50 60 70

cel

Correlation of cel and fahr = 1.000
118

In general, we like to use the symbols y and x for the two variables

The variable y is a linear function of the variable x if:

y = c 0 + c1x
c 0 : the intercept c1 : the slope
We think of the c’s as constants (fixed numbers) while x and y vary.

119

Example Suppose your client is a movie star. She has a deal which pays her a $10 million fee per movie + 10% of the gross ticket revenues. How is our star’s income related to the gross? Let I denote income. Let G denote Gross.

I = 10 + .1G

Note: Don’t forget units! When we write it this way we need to make sure all our numbers are in millions of dollars.
120

6.2 Mean and variance of a linear function
Suppose y (i.e., each value of the variable y) is a linear function of x. How are the mean and variance (standard deviation) of y related to those of x? Let us look at our temperature example. Suppose we first multiply by (9/5) and then add 32.

mul = 9/5 * cel fahr = 32 + mul = 32 + (9/5)*cel

121

Variable cel mul fahr

Mean 32.50 58.5 90.5

StDev 20.00 36.0 36.0

. . .. .

.

.

.

+---------+---------+---------+---------+---------+-------cel

.

.

.

.

.

.

.

.

+---------+---------+---------+---------+---------+-------mul

.

.

.

.

.

.

.

.

+---------+---------+---------+---------+---------+-------fahr 0 30 60 90 120 150 122

Interpret When we multiply cel by 9/5 we affect (increase) both the mean and the standard deviation proportionally. If we add a constant (32 in our case) we simply increase the mean (by the value of the constant) but leave the overall dispersion unaffected.

123

1 2 0

1 0 0

X Mean: 1 Stdev: 1
<3 =) ( ( )-3 ( 2 .) 5 ( .) -5 2() 2 ( )-2 ( 1 .) 5 ( .) -5 1() 1 ( )-1 ( .) 5 (50 -) .0 -5 . .5 1 1. - 5 1 12 .5 2. -5 2 23 .5 3. -5 3 34 .5 4. - 5 4 45 .5 5. - 5 5 56 .5 6. -5 6 67 .5 7. - 5 7 78 .5 > 8

8 0

6 0

4 0

2 0

0

1 2 0

1 0 0

X+2 Mean: 3 Stdev: 1
<3 =) ( ( )-3 ( 2 .) 5 ( .) -5 2() 2 ( )-2 ( 1 .) 5 ( .) -5 1() 1 ( )-1 ( .) 5 (50 -) .0 -5 . .5 1 1. - 5 1 12 .5 2. -5 2 23 .5 3. -5 3 34 .5 4. - 5 4 45 .5 5. - 5 5 56 .5 6. -5 6 67 .5 7. - 5 7 78 .5 > 8

8 0

6 0

4 0

2 0

0

6 0

5 0

2X Mean: 2 Stdev: 2
<3 =) ( ( )-3 ( 2 .) 5 ( .) -5 2() 2 ( )-2 ( 1 .) 5 ( .) -5 1() 1 ( )-1 ( .) 5 (50 -) .0 -5 . .5 1 1. -5 1 12 .5 2. -5 2 23 .5 3. - 5 3 34 .5 4. -5 4 45 .5 5. - 5 5 56 .5 6. -5 6 67 .5 7. - 5 7 78 .5 > 8

4 0

3 0

2 0

1 0

0

124

Sample mean and variance of a linear function Suppose

y = c 0 + c1 x y = c 0 + c1x
s =c s
2 y 2 2 1 x

Then,

s y =| c1 | sx
125

Example So, instead of using a spreadsheet, we could have used our linear formulas. We knew that
y c0 = 32 c1 = 9/5

fahr = 32 + (9/5) * cel
x

Our handy linear formulas tell us: fahr = c0 + c1 * cel = 32 + (9/5)*32.5 = 90.5 sfahr = |c1| * scel = |9/5| * 20 = 36

Of course, these are the same answers we got before!!
126

Aside: Why does this work? Look back 3 slides… Variable

x
Multiply by c1

c1x c1x

Add c0

c 0 + c 1x c 0 + c1 x | c1 | s x

=y =y = sy

Mean Std. dev.

x

sx

| c1 | s x
Both mean and std. dev. change by factor of c1

Mean increases by c0 Std. dev. is unchanged
127

Aside: Why? (The hard way)

yi = c0 + c1 xi

1 x = ∑x i n i =1 1 y = ∑ (c 0+ c 1x i) n i =1 1 n 1 n = ∑ c0+ ∑ c1x i n i =1 n i= 1 = c0 + c1 x
n

n

1 n 2 sx = ( xi − x )2 ∑ n − 1 i =1 1 n 2 sy = (c0 + c1 xi − c0 + c1 x)2 ∑ n − 1 i =1 1 n = ( c0 + c1 xi − c0 − c1 x) 2 ∑ n − 1 i =1 1 n 2 2 = c1 ( xi − x)2 = c12 sx ∑ n − 1 i =1

NOTE: This is way more math than we will typically need in this course. BUT you should know these formulas are properties of our summary statistics, not just some coincidence. AND they come up again when we do probability!
128

Example

Each Income number is 10 + .1* the corresponding Gross number.

Suppose our movie star made 10 pictures last year and the sample mean and sample variance of the gross on the films are 100 and 900, respectively. What are the sample mean and variance of the star’s income?

Gross Income 115.8 21.58 128.9 22.89 109.5 20.95 127.1 22.71 87.2 18.72 111.2 21.12 62.5 16.25 129.4 22.94 87.2 18.72 41.2 14.12

129

See the file "moviestar1.xls".
Gross 115.8 128.9 109.5 127.1 87.2 111.2 62.5 129.4 87.2 41.2 Income 21.6 22.9 21.0 22.7 18.7 21.1 16.2 22.9 18.7 14.1 The average of the Gross numbers = The sample variance of the Gross numbers = The standard deviation of the Gross numbers = The average of the Income numbers= The sample variance of the Income numbers= The standard deviation of the Income numbers= 100 900 30 20 9 3

Remember,

I = 10 + .1G
y So, c0 c1 x

10+.1*100= (.1)^2 * 900 = .1*30=

20 9 3

I = c 0 + c1G
= 10 + .1G = 10 + .1* 100 = 20

24

22

20 Income 18 16

14 40 60 80 Gross 100 120 140

2 2 sI2 = c1 sG 2 2 = ( .1) * sG

=

9

130

Note With only one x, we can get the sample standard deviation of y either by using sy2 = c12 sx2 and then taking the square root, or using sy = |c1| sx directly. We get the same answer either way, because sample standard deviation is always the square root of sample variance.
131

Why are these formulas useful? We could always just type everything into a spreadsheet and use spreadsheet functions to get the answers. Really, though, the reason for these formulas will become apparent when we study probability, statistical inference, and regression. You cannot understand statistics or regression without a solid understanding of linear relationships.
In other words, yes, I recognize these formulas are probably the least fun part of the course (and considering this is basic stats, that’s saying something). But you absolutely must know them.
132

Example

Suppose x has mean 100 and standard deviation 10. What are the mean, standard deviation and variance of: (i) y = 2x? (ii) y = 5+x? (iii) y = 5-2x? (c0=0, c1=2) (c0=5, c1=1) (c0=5, c1= -2)
Answers: Mean (i) (ii) (iii) 200 105 -195 SD 20 10 20 Variance 400 100 400

Answers are above; click on the text box just above this and use your cursor to highlight the text inside.

133

6.3 Linear combinations
We may want a variable to be related to several others instead of just one. We will assume that Y is a function of X,Z,…rather than just a function of X.

When a variable y is linearly related to several others, we call it a linear combination.

y = c 0 + c1x1 + c 2 x 2 + K c k x k
We say, “y is a linear combination of the x’s”. c0 is called the intercept or just “the constant” ci is called the coefficient of xi.

134

Example Suppose in addition to the flat $10 million fee and 10 percent of ticket revenues, our movie star also gets 5 percent of all sales of the soundtrack (on CD) released with the movie. How is the star’s income related to the film’s gross and CD sales (in millions of dollars)? c0 c1 c2 Let I,G,C, denote income, Gross, and cd sales

I = 10 + .1G + .05C
x2
135

y

x1

Important example: Portfolios Suppose you have $100 to invest. Let x1 be the return on asset 1. If x1 = .1, and you put all your money into asset 1, then you will have $100*(1+.1) = $110 at the end of the period. Let x2 be the return on asset 2. If x2 = .15, and you put all your money into asset 2, then you will have $100*(1+.15) = $115 at the end of the period. Suppose you put ½ of your money into asset 1 the other ½ of your money into asset 2. What will happen?
136

At the end of the period you will have, .5*(100)*(1+.1) + .5*(100)*(1+.15) = 100*[ 1+(.5*.1)+(.5*.15) ]
Investment in asset 1 Investment in asset 2 Return on portfolio

55

+

57.50

= $112.50

So the return is (.5*.1) + (.5*.15) = .125 In other words, when we put ½ of our money into asset 1 and the other ½ into asset 2, the return on the resulting portfolio is Rp = ( ½ )*x1 + ( ½ )*x2 The return on a portfolio is a linear combination of the returns on the individual assets.
137

It turns out this is true in general. Suppose you have $M to invest in two assets with returns x1 and x2. Let w1 be the fraction of your wealth you choose to invest in asset 1:

w 1M(1 + x1 ) + w 2M(1 + x2 ) = M( w 1 + w 2 + w 1x1 + w 2 x2 ) = M(1 + w 1x1 + w 2 x2 )
The portfolio return is:
Note: For this to work, we need w1 + w2 = 1

Rp = w1x1 + w 2 x 2
The portfolio return is a linear combination of the individual asset returns. The coefficients are the “portfolio weights” (fraction of wealth invested in each asset). 138

Notice that the portfolio weights always sum up to one. (If I invest 30% of my wealth in asset 1, then I have to invest 70% of my wealth in asset 2). When we’re talking about portfolios, we use “w1, w2, …” instead of “c1, c2, …” to remind us that weights have to sum to one. Our linear formulas work the same way in either case. Most of the time when we do portfolios, we don’t worry about the constant (c0=0). Question for those with some finance experience: Can portfolio weights be negative?
139

Suppose we have m assets. The return on the ith asset is xi. Put wi fraction of your wealth into asset i.. Your portfolio is determined by the portfolio weights wi. Then, the return on the portfolio is:

Rp = w1x1 + w 2 x 2 + ... + w m x m = ∑ w i x i
i=1

m

Your portfolio return is always a linear combination of individual asset returns, with coefficients equal to the fraction of wealth invested.
140

6.4 Mean and variance of a linear combination
First, we consider the case where we have only two x’s. 2 inputs: Suppose Then,

y = c 0 + c1x1 + c 2 x2 y = c 0 + c1x1 + c 2 x2

For linear combinations of 2 or more variables, variance also depends on the covariance between the x’s!! More on this later…

s = c s + c s + 2c1c 2s x1x2
2 y 2 2 1 x1 2 2 2 x2
141

Example For each film she does, our movie star makes $10 million plus 10% of gross ticket revenues and 5% of CD sales. Here is the data for ten movies she made last year:
Gross 115.763100 128.904400 109.524600 127.133700 87.234720 111.248000 62.455030 129.397300 87.171460 41.167710 Cd 5.412503 6.539900 5.878809 4.984490 3.544932 5.602628 3.954600 5.387244 5.092816 3.602078 Income 21.8 23.2 21.2 23.0 18.9 21.4 16.4 23.2 19.0 14.3

Here is her income for each film. Remember,

I = 10 + .1G + .05C
So each number in the Income column equals 10 plus .1 times the Gross value plus .05 times the Cd value.
142

Note: All numbers are in millions of $.

Like before, we could type everything in and get the sample mean and variance of income using a spreadsheet. But let’s suppose, as her agent, we already knew that:

G = 100 sG = 30

C=5 sC = 1

rCG = 0.8
c0 c1 c2

Like before, we know that: So: I = c0 + c1 G + c2 C

I = 10 + .1G + .05C
= 10 + .1*(100) + .05*(5) = 20.25 See next slide…

sI2 = c12sG2 + c22sC2 + 2c1c2sCG = (.1)2(30)2 + (.05)2(1)2 + 2(.1)(.05)(30)(1)(.8) = 9.24 143

Reminder: Remember, we defined sample correlation as the covariance divided by the standard deviations

rxy =

s xy s xs y

So, if we know the correlation and both standard deviations, we can get back sample covariance

s xy =s xsyrxy
So, if we know the sample standard deviations and either of correlation or covariance, we can figure out the other. We used this trick to calculate sCG on the previous slide.
144

In Excel ("moviestar2.xls") I have:
Gross 115.763100 128.904400 109.524600 127.133700 87.234720 111.248000 62.455030 129.397300 87.171460 41.167710 Cd 5.412503 6.539900 5.878809 4.984490 3.544932 5.602628 3.954600 5.387244 5.092816 3.602078 mean of Gross: 100 Mean of Cd: 5 corr(G,C): 0.8 sd of Gross: 30 sd of Cd: 1 Income 21.8 23.2 21.2 23.0 18.9 21.4 16.4 23.2 19.0 14.3

I = 10 + .1G + .05C

Mean of Income: 20.25 10+.1*(mean of G) + .05*(mean of C): 20.25

I = 10 + .1G + .05C

Standard Deviation of Income: 3.040148 sqrt(.1^2(variance of G) + .05^2*(variance of C) + 2*.1*.05*30*1*corr(G,C): (G,C)): 3.040148 sqrt(.1^2*(variance of G) + .05^2 * (variance of C): of C)): 3.000417

2 2 sI = .12 sG + .052 sC + 2(.1)(.05)(30 * 1 * .8)
7.0

6.5

2 2 sI = .12 sG + .052 sC

6.0

5.5 Cd

5.0

How would the answer change if the correlation between G and C were zero?

4.5

4.0

3.5 40 60 80 Gross 100 120 140

145

Example (the country data again) Let us use our country data and suppose that we had put .5 into USA and .5 into Hong Kong. What would our returns have been?

port = .5*honkong + .5*usa
honkong 0.02 0.06 0.02 -0.03 0.08 ........ usa 0.04 -0.03 0.01 0.01 0.05 port 0.030 0.015 0.015 -0.010 0.065

For each month, we get the portfolio return as ½*hongkong + ½*usa.

146

w1 (= c1)

w2 (= c2)

port = .5*honkong + .5*usa
honkong 0.02 0.06 0.02 -0.03 0.08 ........ usa 0.04 -0.03 0.01 0.01 0.05 port 0.030 0.015 0.015 -0.010 0.065

For each month, we get the portfolio return as ½*hongkong + ½*usa. honkong = 0.02103 usa = 0.01346

The sample means are:

The sample mean of our portfolio returns is: port = w1 honkong + w2 usa = .5*.02103 + .5*.01346 = .01724

147

Let us do the same exercise for the variance:
Diagonals are variances, off-diagonals are covariances; StatPro will make this table for you automatically!
usa 0.00110774 0.00106906 port 0.00209586

Covariances honkong 0.00521497 0.00103037 0.00312267

honkong usa port

port = .5*honkong + .5*usa

As before, we apply the formula: sport
2

= w12shonkong

2

+ w22susa 2 + 2 w1w2 shonkong,usa

(.5)2 (.00521) + (.5)2(.00111) + 2*(.5)*(.5)*.00103 = .0021
( Note that sport = (.0021)1/2 ≈ .046 )
148

What if we had put 25% into USA and 75% into Hong Kong?
Covariances honkong 0.00521497 0.00103037 0.00416882 usa 0.00110774 0.00104972 port2 0.00338905

honkong usa port2

port2 =.75*honkong +.25*usa

To get sport2 2 just use the SAME formula from the previous slide, except now with w1=.75 and w2=.25 (.75)2(.00521) + (.25)2(.00111) +(2)*(.25)*(.75)*(.00103) = .00339
149

How do the returns on the w1=w2=.5 portfolio compare with those of Hong Kong and USA?
It looks like the mean for my portfolio port = .0172 is right in between the means of USA and Hong Kong. What about the standard deviation?

0.021 0.020 0.019 0.018 0.017 0.016 0.015 0.014 0.013 0.03 0.04 0.05 0.06 usa port

honkong

Mean

0.07

sport = .046

StDev

The sample standard deviation is less than halfway between susa and shonkong … what happened?

150

Why is covariance important?

We just used the formula from this slide:

s x1x2 = s x1 s x2 rx1x2

Often useful to rewrite the variance formula as

s = c s + c s + 2c1c 2 sx1 sx2 rx1x2
2 y 2 2 1 x1 2 2 2 x2

Remember, correlations are between -1 and 1! IF x1 and x2 are perfectly correlated (r=1), then
2 s2 = c1 s21 + c 2s 22 + 2c 1c 2s x1 s x2 y x 2 x = (c1sx1 + c 2 sx2 )2

So in this case,

sy = c1s x1 + c 2 s x2 sy < c1s x1 + c 2s x2
151

BUT in general, when c1 and c2 are positive,

The basic idea here is When we take averages, variance gets smaller. The smaller the correlation, the faster this happens. This is actually one of the most important ideas in statistics – we’ll see it again!! It is also one of the most important ideas in finance, because it leads to diversification.
152

Example (Optional) y = .5x1 + .5 x2 At each point we plot the value of y. The variances and covariance are:
x1 x1 1.334636 x2 -1.208679 x2 1.106238
0 x2 1

-0.12 -0.05 -0.07 -0.1 -0.01 0.03 0.04 -0.06 -0.01 -0.05 -0.03 -0.05 -1 0.05

0.12 -0.08 0.13 0.12 0.11 0.05

-2

0.03 -1 0 x1 1 2

Then, the variance of y is

The dashed lines are drawn at the mean of x1 and x2.

0.0058105 = .5*.5*1.3346 + .5*.5*1.106 +2*.5*.5*(-1.208679)

Why is the variance of y so much smaller than those of the x’s ?
153

Example (Optional)
1.77 1.55

y = .5x1 + .5 x2 At each point we plot the value of y. The variances and covariance are:
x1 x2 x1 1.158167 x2 1.046490 0.9609463
x2

1.19 1 0.85 0.70.81 0.78 0.5 0.53 0.230.33 0 -0.03 -0.17 -0.39 -0.46 -0.79 -0.7 -1 -1.05

-2

-1.85 -2 -1 x1 0 1 2

Then, the variance of y is

The dashed lines are drawn at the mean of x1 and x2.

1.053 = .5*.5*1.158 + .5*.5*.961 + 2*.5*.5*1.0465

Why is the variance of y similar to those of the x’s ?
154

Example (Optional) y = .5x1 + .5 x2 At each point we plot the value of y. The variances and covariance are:
x1 x2 x1 1.3870537 x2 0.1976187 0.8247886
x2

2.0

0.93 -0.02 -0.27

1.5

0.75

1.29 1.03

1.0

-0.43 0.17 0.5 0.43 -0.09 0.39 0.23 -1.07 -0.76 0.13

0.0

-1.11 -1.2

-0.35

-1.0

-0.5 -1.67

-0.69 -2 -1 x1 0 1

Then, the variance of y is

The dashed lines are drawn at the mean of x1 and x2.

0.65175=.5*.5*1.387 + .5*.5*.8248 + 2*.5*.5*.1976

Why is the variance of y less than those of x1 and x2 ?
155

3 inputs:

y = c 0 + c1x1 + c 2 x2 + c 3 x3 y = c 0 + c1x1 + c 2 x2 + c 3 x3

The formula for the sample mean is basically the same, just one more term because there’s one more x

s =c s +c s +c s
2 y 2 2 1 x1 2 2 2 x2 2 3

2 x3

+ 2 c1c 2s x1x2 + c1c 3 s x1x3 + c 3 c 2 s x3 x2
Note that there are now THREE “covariance terms”, one for each PAIR of x’s

156

Example: Portfolio with 3 inputs
port = .1*fidel+.4*eqmrkt+.5*windsor

Covariances port 0.00306760 0.00280224 0.00369384 0.00261967 fidel 0.00320210 0.00319150 0.00241087 eqmrkt 0.00470021 0.00298922 windsor

port fidel eqmrkt windsor

0.00236580

sport 2 = w12sfidel 2 + w22seqmrket 2 + w32swindsor 2 + a 2 w1w2 sfidel ,eqmarket + 2 w1w3 sfidel ,windsor + 2 w2w3 seqmrket,windsor a .0030676 = (.1)*(.1)*.00320 + (.4)*(.4)*.00470 + (.5)*(.5)*.00236 +2*[ (.1)*(.4)*.00319 + (.1)*(.5)*.00241+(.4)*(.5)*.00299 ]
157

Let us try a portfolio with three stocks. Let us go short on Canada (i.e., we borrow Canada to invest in the other stocks)
port = -.5*canada+usa+.5*honkong

Clearly, forming portfolios is an interesting thing to do!

honkong 0.020 port

Mean

0.015 usa

Aside: We can show (using our linear formulas) that all portfolios that can be formed with a given set of assets lie on a hyperbola in mean-s.d. space. Your investments class will call this the “portfolio possibilities curve” or just the “efficient frontier”.

0.010 0.03

canada 0.04 0.05 0.06 0.07

StDev

158

Aside: Why would we form portfolios? Maybe the portfolio has a nice mean and variance (i.e. nice “average return” and nice “risk”) Because portfolio returns are linear combinations of returns on individual assets, we can apply our linear formulas to find the average return and risk of any possible portfolio as long as we know the means and variances of the individual asset returns. These formulae are fundamental tools for those who really understand finance. And remember our “when we take averages, variance gets smaller” idea? In finance, that’s known as diversification…

159

Example (Optional) Cut from a Finance Textbook:

160

161

K inputs (Optional): Suppose

y = c 0 + c1x1 + c 2 x2 + c 3 x3 ++ck xk
then,

y = c 0 + c1x1 + c 2 x2 + c 3 x3 ++ ck xk

s = c s + c s ++ c s
2 y 2 2 1 x1 2 2 2 x2

the L sum of all the different O M P + 2M covariance terms P M the products of the c'sP times N Q
162

2 2 k xk

I won’t ask you to do calculations by hand for more than 3 inputs, this is just to give you an idea of what the formulas look like.

7. Linear Regression
This is data on 128 homes. (Housing data) x=size (square feet) y = price (dollars)
225000

200000

175000

150000 Price 125000 100000 75000 50000 1400

1600

1800

2000 SqFt

2200

2400

2600

163

Clearly, the data are correlated:
Table of correlations

SqFt

Price

SqFt

1.000

Price

0.553

1.000

But what is the equation of the line you would draw through the data? Linear regression fits a line to the plot.
164

When I "run a regression" I get values for the intercept and the slope. y = (intercept) + (slope) * x
Regression coefficients Coefficient

intercept

Constant

-10091.1299

SqFt

70.2263

slope

165

Here is the scatter plot with the line drawn through it. Looks reasonable!

166

It turns out the formula for the slope and the intercept are

slope =

s xy s
2 x

intercept = y - slope* x
We’ll see these later when we study regression. But it isn’t that hard to see what they do! The slope formula takes covariance and “standardizes” it so that its units are (units of y)/(units of x) The intercept formula makes our line pass through the point (x,y)

167

Regression and Prediction You have a house on the market with size = 2200 sqft. Can we predict at what price the house will sell? Histogram of Price (in $1,000s) Price = $130.4 k sPrice = $26.9 k

We might use the sample mean or median as our prediction. But this doesn’t take size into account.

168

Regression and Prediction Now let’s use linear regression to predict the price of our house (size = 2200 sqft). Just plug the size into the equation of the line we fitted to our data: predicted price = -10091.1299 +70.2263*2200 = $144,407 Regression allows us to use other information (here size) to predict!

169

Summary of Regression Because they are using other information, the predictions we make are (hopefully!) better in some sense. One of the homework problems asks you to explore this. Most importantly, though, regression is based on the same concepts (sample means, standard deviations, and covariance) that we’ve studied in these notes. It’s simply a new way to display (and use!) this information. There’s nothing magical or mysterious about linear regression! If you understand the basics well, regression is both intuitive and incredibly useful.
170

Limitations of Regression One thing to notice about regression is that it is not “symmetric”. As we’ve seen, the sample correlation (or covariance) between x and y is the same as between x and y. In regression, it matters which variable is on the left hand side of the “ = ” (the ‘dependent variable’). A regression with y = Size and x = Price gives a different answer. Remember: Correlation is not causation! Just because we regress y on x doesn’t mean changes in x ‘cause’ changes in y. 171

8. Pivot Tables (Optional)
Up till now, we have tried to look at pairs of variables. Of course, it would be interesting to look at more than two at a time. The Pivot table utility in excel uses tables to do this. But the tables can be "more than two way" and you can put a summary for another variable in each cell. The simple two way tables we looked at earlier were also created using pivot tables.
172

In each cell is printed the average of the cigs dummy. This gives the percentage of smokers. The cells are determined by a binned version of age and sex.
What do you think is going on here?
Average of cigs age sex 16-25 1 2 Grand Total 26-35 0.42 0.53 0.49 36-45 0.42 0.33 0.36 46-55 0.37 0.28 0.32 >56 0.35 0.39 0.37 Grand Total 0.16 0.23 0.19 0.28 0.29 0.29

In the age group 16-25, 53% of female respondents are smokers. This table attempts to look at 3 variables at the same time!!
173

here is the pivot chart.

2.00 1.80 1.60 1.40 1.20 1.00 0.80 0.60 0.40 0.20 0.00 1 2 >56 46-55 36-45 26-35 16-25

174

The Hockey Data We have data on every penalty called in the NHL from 95-96 to 2001-2002. Data below is a subsample of size 5000. oppcall = 1 if penalty switches, that is, if A is playing B and the last penalty was on B, then oppcall =1 if this penalty is on A.
oppcall timespan 0 0 1 0 1 1 1 laghome goaldiff 0 1 1 0 1 1 0 inrow2 -1 2 2 0 1 -1 -1 laghomeT 0v 0h 1h 0v 0h 1h 1v inrowT one one two one one two two 14.75 6.90 8.45 11.75 6.30 3.33 5.93

Each row corresponds to a penalty. (Can't have first penalty in game). timespan=time between penalties (mins) laghome=1 last pen on home team goaldiff = lead of last penalized team inrow2=1 if last two pens on same team laghomeT: h if laghome=1 inrowT: two if inrow2=1

...

175

The table attempts to look at 4 variables at one time!!!!

Average of oppcall inrowT one laghomeT h v one Total two h v two Total Grand Total

goaldiff <-2 -2--2 0.66 0.61 0.63 0.71 0.72 0.72 0.66 -1--1 0.64 0.62 0.63 0.77 0.74 0.75 0.69 0-0 0.64 0.59 0.61 0.75 0.61 0.67 0.63 1-2 0.60 0.55 0.57 0.70 0.65 0.67 0.61 >2 0.52 0.49 0.50 0.60 0.55 0.58 0.53 Grand Total 0.46 0.40 0.44 0.51 0.48 0.50 0.46 0.58 0.54 0.56 0.67 0.62 0.64 0.59

176

the home team is behind and you just called two in a row on them
Average of oppcall inrowT one laghomeT h v one Total two h v two Total Grand Total goaldiff <-2 -2--2 0.66 0.61 0.63 0.71 0.72 0.72 0.66

if the last pen was on home team, more likely to switch

-1--1 0.64 0.62 0.63 0.77 0.74 0.75 0.69

0-0 0.64 0.59 0.61 0.75 0.61 0.67 0.63

1-2 0.60 0.55 0.57 0.70 0.65 0.67 0.61

>2 0.52 0.49 0.50 0.60 0.55 0.58 0.53

Grand Total 0.46 0.40 0.44 0.51 0.48 0.50 0.46 0.58 0.54 0.56 0.67 0.62 0.64 0.59

if you just called two in a row on same team, more likely to switch

if the last penalized team was ahead, less likely to switch
177

4.50 4.00 3.50 3.00 2.50 2.00 1.50 1.00 0.50 0.00 h one v h two v >2 1-2 0-0 -1--1 -2--2 <-2

178

Sign up to vote on this title
UsefulNot useful