Chapter 1

CHAPTER 1
BIVARIATE DATA
• 1.01 Variables & Summary Data
• 1.02 Two Way Frequency Table
• 1.03 Relationships / Associations.
• 1.04 Displaying Relationship: Scatter Plot
• 1.05 Analysing Relationships: Correlation Coefficient
• 1.06 Fitting Least Square Regression Lines
• 1.07 Reliability of Regression Model
Objective:
What is Bivariate?
Bivariate means “two variables”, in other words there are

two types of data.
So with bivariate data, we are interested in comparing two

sets of data and any relationships.
1
1.01 VARIABLES & SUMMARY DATA
Variables are characteristics that we can measure, observe or count.
 Height  Waistline
 Weight  Exam Scores
 Pollution level  Distance from Earth
 Gaming hours  Velocity
 Number of people  Amount of fertilizer
 Population  Sales
TYPES OF DATA
• There are 2 types of data, categorical and numerical.

• Categorical data can be grouped into categories.
• Numerical data can be counted or measured.
2
Calculating Measures of Centre
1) Mean – is the arithmetic average of the data values. The mean is not an
appropriate measure of centre if there are outliers in the data set.
2) Median – is the middle score when the data values are ordered from smallest to
largest. The median is not affected by outliers.
3) Mode – the most common data value, or the data value with the highest frequency.
3
Calculating Measures of Spread
A measure of spread indicates how the data is spread out.
1) Range = highest value – lowest value. The range measures the spread of all the
data and therefore is not an appropriate measure of spread when outliers are
present.
2) Interquartile Range ‫ܳ = ܴܳܫ‬ଷ − ܳଵ where ܳଵ the lower quartile, is the median

of the lower half of data and ܳଷ the upper quartile, is the median of the upper half
of data. The ‫ ܴܳܫ‬is not affected by outliers.
3) Standard deviation – is calculated using formula. The formula is seldom used to

calculate, using a CAS calculator is the most efficient method.
4
USING CAS CALCULATOR
5
USING CAS CALCULATOR
6
1.02 Two Way Frequency Table
● Data can be represented in a 2-way frequency table.
● Eg. Suppose that the number of male and female ‘Members of Parliament
(MP)’ from two parties are tabulated
● The table below shows the totals in the form of a 2-way frequency table.
• From the table, we can get their proportions based on the total number, or
even their percentages.
• We can then make observations:

1) 51% of all of the MPs from the two parties are males from Party A
2) 76% of all the MPs from the two parties are male.
7
Suppose that a survey asking a number of people their age and which they
most preferred as a main course at a restaurant, out of beef, chicken, fish or
other, gave rise to the two way table shown below.
(a) How many people responded to the survey?

(b) How many of the respondents chose beef as their most preferred main
course at a restaurant?
(c) What percentage of those choosing beef as their most preferred course at
a restaurant were aged 31 to 43?
1.02 –Two Way Frequency Table

We can display the percentages of the Display – Stacked 100% column Graph:
previous example as two proportionally
divided columns. (aka. Stacked 100%
column graph)
☺Row Percentages
Party A Party B Total
Male
Female
8
Display – Stacked 100% column Graph:
☺Column Percentages
Party A Party B
Male
Female
Total

• Suppose we have the following data and table of percentages.
• How would we draw the graph to correctly display the data?
9
The proportional column graph shows

that it does not matter whether male or
female, the proportions for the modes
of transport are identical. Thus, NO
ASSOCIATION.
• If changing the category variable

heavily Influences the other variable,
then the two variables are
Associated.
10
(c)
Comment on association:
Describing association:
11
1.03 RELATIONSHIPS / ASSOCIATIONS
• What are relationships?

• What do relationships have?
• How would you describe a relationship?
• What can have relationships?
Definition: The way in which two or more concepts, objects, or people

are connected, or the state of being connected.
1.03 RELATIONSHIPS / ASSOCIATIONS

• When comparing variables for relationships, we need to know:
– Explanatory / independent variable (EV),
– Response / dependent variable (RV).
• Tip: The response variable depends on the explanatory variable.
• Identifying the explanatory and response variables is important. It will affect how
you would represent data and conduct your analysis.
RV depends on the EV
12
For the following sets of variables, state which is the explanatory variable and
which is the response variable.
1. Temperature and number of ice creams sold
2. Exam scores and time spent studying
3. Time travelling and distance travelled
4. Working hours and wage
5. Caffeine consumption and heart rate
6. Time spent dating and Couple happiness level
1.04 Displaying Relationship: Scatter Plot
13
1. FORM
14
2. DIRECTION
3. STRENGTH
15
16
17
18
1.05 Analyzing Relationships: Correlation Coefficient
• To quantify the degree of correlation between 2 variables, we will calculate the

Pearson’s correlation coefficient, denoted by the letter ࢘.
• Calculating the correlation coefficient will produce a value. This value can be
used to determine the strength and direction of the relationship.
• Note: This correlation coefficient is used only to analyze the relationship of 2

variables that shows linear relationship. Thus, used only in ‘linear regression
models’.
19
Linear, negative and weak
20
21
22
CAUSALITY
• Causality is the relationship between cause and effect.
• This is used when the explanatory variable absolutely and solely affects the
outcome of the response variable.
23
1. A negative correlation exists between the number of ice-creams sold and the
number of flu cases reported. Does ice cream prevent the flu? Comment.
2. There exist a strong positive correlation between the number of televisions and
the life expectancy for the world’s nations. Does having multiple televisions
increase life expectancy? Comment.
OUTLIER
• The correlation coefficient is also affected by outliers.
• Outliers are extreme data points that do not seem to

belong with the rest of the data points. This may be due to
extreme cases, special conditions or inaccurate data.
• Outliers reduce the strength of the correlation

coefficient.
• An analyst may consider removing/cropping the outliers

to improve the strength of the correlation coefficient.
24
25
1.06 Fitting Least Square Regression Lines
26
27
● Recall the general equation for the regression line.

࢟ = ࢇ + ࢈࢞
● Interpret/Comment on the ܽ value (‫ݕ‬-intercept):

‘The ࢟-variable is ‘ࢇ’ units when the ࢞ variable is zero units’.
● Interpret/ Comment on the ܾ value (gradient):

‘The ࢟-variable increases/decreases by ‫ ܊‬units for every 1 unit increase in
the ࢞-variable.’
28
29
MAKING PREDICTIONS
Once a regression line has been found, the equation can be used to make predictions.
Data was collected from people aged between 7 and 19 years of age and a linear
regression line was found with the equation.
Height(cm) = 100 + 2.5 x age(years)
What is the predicted height for a 8-year old? 21 year old?
30
1.07 Reliability of Regression Model
☺The regression line model can be used to make predictions.
☺These predictions are not accurate. However, we can determine if a

prediction is reliable or not.
☺The following points can be used to determine reliability:

1. Strength of Correlation Coefficient
2. Coefficient of Determination
3. Prediction: Interpolation or Extrapolation
4. Existence of Outliers
5. Shape of Residual Plot
1. Strength Of Correlation Coefficient
• From the correlation coefficient, we can determine the reliability of

prediction.
• Strong correlation coefficient suggests that the prediction is reliable.
• Weak correlation coefficient suggests that the prediction is not reliable.
31
2. Coefficient Of Determination, R 2
• This coefficient of determination is used to ‘determine how well does our
regression line represent our set of data’.
• It has a numerical value of 0 to 1.
• When interpreting, a general sentence can be used:
r2 x 100% of the variation in the response or dependent variable can be explained
by the variation in the explanatory or independent variable.
• As the r2 gets higher,

1. Relationship between the variables gets stronger
2. Linear Regression line becomes a more appropriate model for the data
32
33
3. Prediction: Interpolation Or Extrapolation
34
4. Existence Of Outlier
• The existence of outliers shows extreme data points on the scatter plot.
• These outliers can reduce the strength of correlation coefficients and affect
the regression line equation.
• It is best to exclude outliers to improve the reliability of predictions made

from the regression line.
35
5. Shape Of Residual Plot
36
37
38
39
• From the residual plot, we can determine the ‘linearity’ of the scatter plot.
• A residual plot with scattered/random points suggests that the scatter plot
is linear. This suggests that the linear regression model is suitable for the
data.
• A residual plot with pattern/shape suggests that the scatter plot is non-
linear. This suggests that the linear regression model is not suitable for
the data.
40
41

Chapter 1

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Chapter 1

Uploaded by

Copyright:

Available Formats

CHAPTER 1

Bivariate means “two variables”, in other words there are

So with bivariate data, we are interested in comparing two

Variables are characteristics that we can measure, observe or count.

• There are 2 types of data, categorical and numerical.

Calculating Measures of Centre

A measure of spread indicates how the data is spread out.

2) Interquartile Range ‫ܳ = ܴܳܫ‬ଷ − ܳଵ where ܳଵ the lower quartile, is the median

3) Standard deviation – is calculated using formula. The formula is seldom used to

• We can then make observations:

(a) How many people responded to the survey?

1.02 –Two Way Frequency Table

Party A Party B Total

1.02 –Two Way Frequency Table

The proportional column graph shows

• If changing the category variable

• What are relationships?

Definition: The way in which two or more concepts, objects, or people

1.03 RELATIONSHIPS / ASSOCIATIONS

• Tip: The response variable depends on the explanatory variable.

1.04 Displaying Relationship: Scatter Plot

1.04 Displaying Relationship: Scatter Plot

1.04 Displaying Relationship: Scatter Plot

• To quantify the degree of correlation between 2 variables, we will calculate the

• Note: This correlation coefficient is used only to analyze the relationship of 2

1.05 Analyzing Relationships: Correlation Coefficient

1.05 Analyzing Relationships: Correlation Coefficient

Linear, negative and weak

1.05 Analyzing Relationships: Correlation Coefficient

• Outliers are extreme data points that do not seem to

• Outliers reduce the strength of the correlation

• An analyst may consider removing/cropping the outliers

1.06 Fitting Least Square Regression Lines

● Recall the general equation for the regression line.

● Interpret/Comment on the ܽ value (‫ݕ‬-intercept):

● Interpret/ Comment on the ܾ value (gradient):

☺The regression line model can be used to make predictions.

☺These predictions are not accurate. However, we can determine if a

☺The following points can be used to determine reliability:

1.07 Reliability of Regression Model

1. Strength Of Correlation Coefficient

• From the correlation coefficient, we can determine the reliability of

• Strong correlation coefficient suggests that the prediction is reliable.

• Weak correlation coefficient suggests that the prediction is not reliable.

• As the r2 gets higher,

3. Prediction: Interpolation Or Extrapolation

• It is best to exclude outliers to improve the reliability of predictions made

1.07 Reliability of Regression Model

5. Shape Of Residual Plot

5. Shape Of Residual Plot

You might also like