You are on page 1of 41

CHAPTER 1

BIVARIATE DATA
• 1.01 Variables & Summary Data
• 1.02 Two Way Frequency Table
• 1.03 Relationships / Associations.
• 1.04 Displaying Relationship: Scatter Plot
• 1.05 Analysing Relationships: Correlation Coefficient
• 1.06 Fitting Least Square Regression Lines
• 1.07 Reliability of Regression Model

Objective:

What is Bivariate?

Bivariate means “two variables”, in other words there are


two types of data.

So with bivariate data, we are interested in comparing two


sets of data and any relationships.

1
1.01 VARIABLES & SUMMARY DATA

Variables are characteristics that we can measure, observe or count.

 Height  Waistline
 Weight  Exam Scores
 Pollution level  Distance from Earth
 Gaming hours  Velocity
 Number of people  Amount of fertilizer
 Population  Sales

TYPES OF DATA

• There are 2 types of data, categorical and numerical.


• Categorical data can be grouped into categories.
• Numerical data can be counted or measured.

2
1.01 VARIABLES & SUMMARY DATA

Calculating Measures of Centre

1) Mean – is the arithmetic average of the data values. The mean is not an
appropriate measure of centre if there are outliers in the data set.

2) Median – is the middle score when the data values are ordered from smallest to
largest. The median is not affected by outliers.

3) Mode – the most common data value, or the data value with the highest frequency.

3
1.01 VARIABLES & SUMMARY DATA
Calculating Measures of Spread

A measure of spread indicates how the data is spread out.

1) Range = highest value – lowest value. The range measures the spread of all the
data and therefore is not an appropriate measure of spread when outliers are
present.

2) Interquartile Range ‫ܳ = ܴܳܫ‬ଷ − ܳଵ where ܳଵ the lower quartile, is the median


of the lower half of data and ܳଷ the upper quartile, is the median of the upper half
of data. The ‫ ܴܳܫ‬is not affected by outliers.

3) Standard deviation – is calculated using formula. The formula is seldom used to


calculate, using a CAS calculator is the most efficient method.

4
USING CAS CALCULATOR

5
USING CAS CALCULATOR

6
1.02 Two Way Frequency Table
● Data can be represented in a 2-way frequency table.

● Eg. Suppose that the number of male and female ‘Members of Parliament
(MP)’ from two parties are tabulated

● The table below shows the totals in the form of a 2-way frequency table.

• From the table, we can get their proportions based on the total number, or
even their percentages.

• We can then make observations:


1) 51% of all of the MPs from the two parties are males from Party A
2) 76% of all the MPs from the two parties are male.

7
Suppose that a survey asking a number of people their age and which they
most preferred as a main course at a restaurant, out of beef, chicken, fish or
other, gave rise to the two way table shown below.

(a) How many people responded to the survey?


(b) How many of the respondents chose beef as their most preferred main
course at a restaurant?
(c) What percentage of those choosing beef as their most preferred course at
a restaurant were aged 31 to 43?

1.02 –Two Way Frequency Table


We can display the percentages of the Display – Stacked 100% column Graph:
previous example as two proportionally
divided columns. (aka. Stacked 100%
column graph)

☺Row Percentages

Party A Party B Total

Male

Female

8
1.02 –Two Way Frequency Table
Display – Stacked 100% column Graph:

☺Column Percentages

Party A Party B

Male

Female

Total

1.02 –Two Way Frequency Table


• Suppose we have the following data and table of percentages.
• How would we draw the graph to correctly display the data?

9
1.02 –Two Way Frequency Table

The proportional column graph shows


that it does not matter whether male or
female, the proportions for the modes
of transport are identical. Thus, NO
ASSOCIATION.

• If changing the category variable


heavily Influences the other variable,
then the two variables are
Associated.

10
(c)
Comment on association:

Describing association:

11
1.03 RELATIONSHIPS / ASSOCIATIONS

• What are relationships?


• What do relationships have?
• How would you describe a relationship?
• What can have relationships?

Definition: The way in which two or more concepts, objects, or people


are connected, or the state of being connected.

1.03 RELATIONSHIPS / ASSOCIATIONS


• When comparing variables for relationships, we need to know:
– Explanatory / independent variable (EV),
– Response / dependent variable (RV).

• Tip: The response variable depends on the explanatory variable.

• Identifying the explanatory and response variables is important. It will affect how
you would represent data and conduct your analysis.

RV depends on the EV

12
For the following sets of variables, state which is the explanatory variable and
which is the response variable.
1. Temperature and number of ice creams sold
2. Exam scores and time spent studying
3. Time travelling and distance travelled
4. Working hours and wage
5. Caffeine consumption and heart rate
6. Time spent dating and Couple happiness level

1.04 Displaying Relationship: Scatter Plot

13
1.04 Displaying Relationship: Scatter Plot

1.04 Displaying Relationship: Scatter Plot

1. FORM

14
1.04 Displaying Relationship: Scatter Plot

2. DIRECTION

1.04 Displaying Relationship: Scatter Plot

3. STRENGTH

15
16
17
18
1.05 Analyzing Relationships: Correlation Coefficient

• To quantify the degree of correlation between 2 variables, we will calculate the


Pearson’s correlation coefficient, denoted by the letter ࢘.

• Calculating the correlation coefficient will produce a value. This value can be
used to determine the strength and direction of the relationship.

• Note: This correlation coefficient is used only to analyze the relationship of 2


variables that shows linear relationship. Thus, used only in ‘linear regression
models’.

1.05 Analyzing Relationships: Correlation Coefficient

19
1.05 Analyzing Relationships: Correlation Coefficient

1.05 Analyzing Relationships: Correlation Coefficient

Linear, negative and weak

20
21
22
1.05 Analyzing Relationships: Correlation Coefficient

CAUSALITY
• Causality is the relationship between cause and effect.
• This is used when the explanatory variable absolutely and solely affects the
outcome of the response variable.

23
1. A negative correlation exists between the number of ice-creams sold and the
number of flu cases reported. Does ice cream prevent the flu? Comment.

2. There exist a strong positive correlation between the number of televisions and
the life expectancy for the world’s nations. Does having multiple televisions
increase life expectancy? Comment.

1.05 Analyzing Relationships: Correlation Coefficient

OUTLIER
• The correlation coefficient is also affected by outliers.

• Outliers are extreme data points that do not seem to


belong with the rest of the data points. This may be due to
extreme cases, special conditions or inaccurate data.

• Outliers reduce the strength of the correlation


coefficient.

• An analyst may consider removing/cropping the outliers


to improve the strength of the correlation coefficient.

24
25
1.06 Fitting Least Square Regression Lines

1.06 Fitting Least Square Regression Lines

26
27
1.06 Fitting Least Square Regression Lines

● Recall the general equation for the regression line.


࢟ = ࢇ + ࢈࢞

● Interpret/Comment on the ܽ value (‫ݕ‬-intercept):


‘The ࢟-variable is ‘ࢇ’ units when the ࢞ variable is zero units’.

● Interpret/ Comment on the ܾ value (gradient):


‘The ࢟-variable increases/decreases by ‫ ܊‬units for every 1 unit increase in
the ࢞-variable.’

28
1.06 Fitting Least Square Regression Lines

29
1.06 Fitting Least Square Regression Lines
MAKING PREDICTIONS

Once a regression line has been found, the equation can be used to make predictions.

Data was collected from people aged between 7 and 19 years of age and a linear
regression line was found with the equation.
Height(cm) = 100 + 2.5 x age(years)
What is the predicted height for a 8-year old? 21 year old?

30
1.07 Reliability of Regression Model

☺The regression line model can be used to make predictions.

☺These predictions are not accurate. However, we can determine if a


prediction is reliable or not.

☺The following points can be used to determine reliability:


1. Strength of Correlation Coefficient
2. Coefficient of Determination
3. Prediction: Interpolation or Extrapolation
4. Existence of Outliers
5. Shape of Residual Plot

1.07 Reliability of Regression Model

1. Strength Of Correlation Coefficient

• From the correlation coefficient, we can determine the reliability of


prediction.

• Strong correlation coefficient suggests that the prediction is reliable.

• Weak correlation coefficient suggests that the prediction is not reliable.

31
1.07 Reliability of Regression Model

2. Coefficient Of Determination, R 2
• This coefficient of determination is used to ‘determine how well does our
regression line represent our set of data’.
• It has a numerical value of 0 to 1.
• When interpreting, a general sentence can be used:
r2 x 100% of the variation in the response or dependent variable can be explained
by the variation in the explanatory or independent variable.

• As the r2 gets higher,


1. Relationship between the variables gets stronger
2. Linear Regression line becomes a more appropriate model for the data

32
1.07 Reliability of Regression Model

33
1.07 Reliability of Regression Model

3. Prediction: Interpolation Or Extrapolation

34
1.07 Reliability of Regression Model

4. Existence Of Outlier

• The existence of outliers shows extreme data points on the scatter plot.

• These outliers can reduce the strength of correlation coefficients and affect
the regression line equation.

• It is best to exclude outliers to improve the reliability of predictions made


from the regression line.

35
1.07 Reliability of Regression Model
5. Shape Of Residual Plot

1.07 Reliability of Regression Model

36
37
38
1.07 Reliability of Regression Model

5. Shape Of Residual Plot

39
1.07 Reliability of Regression Model

5. Shape Of Residual Plot

• From the residual plot, we can determine the ‘linearity’ of the scatter plot.

• A residual plot with scattered/random points suggests that the scatter plot
is linear. This suggests that the linear regression model is suitable for the
data.

• A residual plot with pattern/shape suggests that the scatter plot is non-
linear. This suggests that the linear regression model is not suitable for
the data.

40
41

You might also like