You are on page 1of 67

Level I - Quantitative Methods

Organizing, Visualizing and Describing Data

www.ift.world

Graphs, charts, tables, examples, and figures are copyright 2022, CFA Institute.
Reproduced and republished with permission from CFA Institute. All rights reserved.

1
Contents and Introduction
1. Introduction
2. Data Types
3. Organizing Data for Quantitative Analysis
4. Summarizing Data Using Frequency Distribution
5. Summarizing Data Using a Contingency Table
6. Data Visualization
7. Measures of Central Tendency
8. Quantiles
9. Measures of Dispersion
10. Downside Deviation and Coefficient of Variation
11. The Shape of Distributions
12. Correlation Between Two Variables

www.ift.world 2
2. Data Types
What is data?
• Collection of numbers, characters, words and text to represent facts or information
• Can be in the form of images, audio, and video
• Can be in raw or organized format

1. Numerical versus Categorical Data


2. Cross-Sectional versus Time-Series versus Panel Data
3. Structured versus Unstructured Data

www.ift.world 3
2.1 Numerical Versus Categorical Data
• Numerical (quantitative) data
 Continuous data
 Discrete data

• Categorical (qualitative) data


 Nominal data
 Ordinal data

www.ift.world 4
Practice Question
Identify the data type for each of the following items:

• Number of coupon payments for a bond

• Dividends paid by a stock

• Credit ratings for corporate bonds

• Hedge fund classification types

www.ift.world 5
2.2 Cross-Sectional versus Time-Series versus Panel Data
This classification is based on how data are collected
What is a variable (field, attribute, feature)?
• Characteristic or quantity that can be measured, counted, or categorized
• Subject to change
• Observation: value of a specific variable collected at a point in time or
over a specified period of time

• Time-series data
• Cross-sectional data
• Panel data

www.ift.world 6
2.3 Structured versus Unstructured Data

• Structured data

• Unstructured data

www.ift.world 7
3. Organizing Data for Quantitative Analysis

• One-dimensional array

• Two-dimensional rectangular array (data table)

www.ift.world 8
4. Summarizing Data Using Frequency Distribution (1/2)
Frequency distribution (one-way table)
Constructing a frequency distribution of a Summary data for 200 companies
categorical variable: across four sectors.
1. Count the number of observations for
each unique value of the variable Sector Absolute Relative
Frequency Frequency
2. Construct a table listing each unique Technology 22 11%
value and the corresponding counts
Healthcare 50 25%
3. Sort the records by number of counts in Financial 58 29%
descending or ascending order
Industrial 70 35%
Total 200 100%

www.ift.world 9
4. Summarizing Data Using Frequency Distribution (2/2)
Constructing a frequency distribution Summary data for 100 stocks with prices ranging between 45.00 and 65.00
for numerical data:
1. Sort the data in ascending order.
2. Calculate the range of the data.
3. Decide on the number of bins (k).
4. Determine bin width.
5. Determine bins.
6. Determine the number of
observations in each bin. Stock Price Range Absolute Cumulative Relative Cumulative Relative
(Min – Max) Frequency Frequency Frequency Frequency
7. Construct a table of the bins
listed from smallest to largest. 45.00 – 50.00 25 25 0.25 0.25
50.00 – 55.00 35 60 0.35 0.60
55.00 – 60.00 29 89 0.29 0.89
60.00 – 65.00 11 100 0.11 1.00

www.ift.world 10
Practice Question
The actual number of observations in a given interval is called the:
A. Absolute frequency
B. Relative frequency
C. Cumulative absolute frequency

Answer: A

The actual number of observations in a given interval is known as absolute frequency. Relative
frequency is the absolute frequency of each interval divided by the total number of observations.
Cumulative absolute frequency is the running total of all absolute frequencies.

www.ift.world 11
Practice Question
Which of the following is most likely to be accurate?
A. An observation can fall in more than one interval.
B. The data is sorted in a descending order for the construction of a frequency
distribution.
C. The cumulative relative frequency tells the observer the fraction of the
observations that are less than the upper limit of each interval.

Answer: C

The cumulative relative frequency tells the observer the fraction of the observations that are less
than the upper limit of each interval. An observation cannot fall in more than one interval. The
data is sorted in an ascending order for the construction of a frequency distribution.

www.ift.world 12
5. Summarizing Data Using a Contingency Table
A contingency table is a tabular format that displays the frequency distributions of two or more
categorical variables simultaneously.

Market Capitalization Variable (3 Levels)

Sector Variable (4 Levels) Small Mid Large Total


Financial 44 38 20 102
FMCG 130 54 46 230
Information Technology 57 34 21 112
Real Estate 30 16 10 56
Total 261 142 97 500

Joint frequency
Marginal frequency

www.ift.world 13
Contingency Table Applications
Evaluate the performance of a classification model

Suppose we have a model for classifying companies into two groups: Predicted Actual Default
Total
Default Yes No
1. those that default on their bond payments
Yes 150 10 160
2. those that do not default
No 6 834 840
The table shows the confusion matrix for a sample of 1,000 non-
Total 156 844 1,000
investment-grade bonds.

Investigate potential association between two categorical variables


One way to test potential association:
1. Add the marginal frequencies and overall total to the contingency table
2. Use the marginal frequencies to construct a table with expected values of the observations
3. Compare with chi-square value for a given level of significance

www.ift.world 14
Example
Suppose we randomly pick 200 mutual funds and classify them based on two parameters:
Fund style – Growth versus value Low Risk High Risk
Growth 67 19
Risk level – Low risk versus high risk
Value 98 16

Describe how the contingency table is used to set up a test for independence between fund style and risk level.

Add the marginal frequencies and overall total to the contingency table
The actual values and the expected values are used to
Observed Values Observed Values
Low High Low High derive the chi-square test statistic. This is then compared
Risk Risk Risk Risk to a value from the chi-square distribution table for a
Growth 67 19 86 Growth 78% 22% 100%
given level of significance. If the test statistic is greater
Value 98 16 114 Value 86% 14% 100%
165 35 200 than the chi-square distribution value, then we can
Use the marginal frequencies to construct a table with expected values conclude that there is significant association between
of the observations. the categorical variables.
Observed Values Observed Values
Low High Low High
Risk Risk Risk Risk
Growth 70.95 15.05 86 Growth 82.5% 17.5% 100%
Value 94.05 19.95 114 Value 82.5% 17.5% 100%
165 35 200

www.ift.world 15
6. Data Visualization
Visualization is the presentation of data in a pictorial or graphical format for the
purpose of increasing understanding and for gaining insights into the data
1. Histogram and Frequency Polygon
2. Bar Chart
3. Tree-Map
4. Word Cloud
5. Line Chart
6. Scatter Plot
7. Heat Map
8. Guide to Selecting among Visualization Types

www.ift.world 16
6.1 Histogram and Frequency Polygon
Histogram Frequency polygon Cumulative frequency distribution
Presents distribution of numerical data Plots the midpoints of each interval Plots the cumulative frequency or
by using height of column to represent on the X-axis and the absolute cumulative relative frequency against the
absolute frequency; shows where frequency of that interval on the Y- upper interval limit; shows how many or
most of the data lies axis what percent of the observations lie
Price Rage # Stocks below a certain value
46.00 – 51.00 20
51.00 – 56.00 60
56.00 – 61.00 100
61.00 – 65.00 20 Cumulative Frequency
Distribution
250
Histogram Frequency Polygon
120 200
120
100
100
150
80 80
60 60 100
40 40
20 50
20
0 0
0
46.00 – 51.00 51.00 – 56.00 56.00 – 61.00 61.00 – 65.00 46.00 – 51.00 51.00 – 56.00 56.00 – 61.00 61.00 – 65.00
46.00 – 51.00 51.00 – 56.00 56.00 – 61.00 61.00 – 65.00

Slope interpretation
www.ift.world 17
Practice Question
Which of the following statements is most likely to be inaccurate about histograms?
A. A histogram is the graphical equivalent of a frequency distribution.
B. A histogram is a form of a bar chart.
C. In a histogram, the height represents the relative frequency for each interval.

Answer: C

In a histogram, the height represents the absolute frequency for each interval.

www.ift.world 18
6.2 Bar Chart (1/3)
A bar chart is used to plot the frequency A grouped bar chart (also called a clustered bar chart) can
distribution of categorical data be used to show the frequency distribution of multiple
Each bar represents a distinct category categorical variables simultaneously
Bar’s height (or length) proportional to frequency

www.ift.world 19
6.2 Bar Chart (2/3)
A stacked bar chart is an alternative form for presenting the frequency distribution of multiple categorical variables
simultaneously.

www.ift.world 20
6.2 Bar Chart (3/3)

Normally height is proportional to the


value it depicts
If the y-axis is truncated the graph needs
to be evaluated more carefully

www.ift.world 21
6.3 Tree-Map
A tree-map is a graphical tool to display categorical data

Consists of a set of colored


rectangles to represent
distinct groups
Area of each rectangle is
proportional to the value of
the corresponding group
Additional dimensions of
categorical data can be
displayed by nested
rectangles

www.ift.world 22
6.4 Word Cloud
A word cloud is a visual device for representing textual data
The size of each distinct word is proportional to the frequency with which it appears in the given text

Color can be used to add another


dimension
For example with a word cloud based
on analyst reports related to a
particular company, different colors
can be used for positive, negative
and neutral sentiment words.

Graphs, charts, tables, examples, and figures are copyright 2019, CFA Institute.
Reproduced and republished with permission from CFA Institute. All rights reserved.

www.ift.world 23
6.5 Line Chart
A line chart is a type of graph used to visualize ordered
observations
• Often used to display the change of data series over
time
• Can plot more than one set of data points, which
helps in making comparisons

A bubble line chart is a special type of line chart that


uses varying-sized bubbles as data points to represent
an additional dimension of data.

Graphs, charts, tables, examples, and figures are copyright 2019, CFA Institute.
Reproduced and republished with permission from CFA Institute. All rights reserved.

www.ift.world 24
6.6 Scatter Plot (1/2)
A scatter plot helps visualize the joint variation in two
numerical variables.
• Constructed by drawing dots to indicate the
values of the two variables plotted against the
corresponding axes.
• Dots are drawn to indicate the values of the two
variables at different points in time.

A scatter plot can reveal


• Sign of association
• Degree of association
• Whether association is linear or non-linear
• Maximum and minimum values
• Outliers/extreme values

www.ift.world 25
6.6 Scatter Plot (2/2)
A scatter plot matrix organizes scatter
plots between pairs of variables into a
matrix format.

This makes it easy to inspect all pairwise


relationships in one combined visual.

www.ift.world 26
6.7 Heat Map
A heat map organizes and summarizes data in a tabular format and represents it using a color
spectrum.

Number of stocks in a portfolio categorized by sector and market capitalization

Uses:
Display frequency distributions
Visualize the degree of correlation among different variables

www.ift.world 27
6.8 Guide to Selecting among Visualization Types
Select visualization technique based on purpose:
1) exploring/presenting relationships 2) exploring/presenting distributions 3) making comparisons

www.ift.world 28
Common Pitfalls
• Improper chart type

• Selectively plotted data

• Truncated graph where y-axis does


not start at 0

• Improper scaling of axes

www.ift.world 29
7. Measures of Central Tendency
1. The Arithmetic Mean Population: all members of a specified group
A ‘parameter’ describes the characteristics of a population

2. The Median
Sample: a subset drawn from a population
A ‘sample statistic’ describes the characteristic of a sample
3. The Mode

4. Other Concepts of Mean

www.ift.world 30
7.1 The Arithmetic Mean
Arithmetic mean: sum of the observation values divided by the number of observations

n
i=1 Xi
Sample mean: X =
n

If the sample data is: 2, 4, 4, 6, 10, 10, 12, 12, and 12


Sample mean = 72/9 = 8

The arithmetic mean is the ‘center of gravity’ of the data set

The arithmetic mean very sensitive to outliers (extreme high or low values)

www.ift.world 31
Dealing with Outliers
Option 1: Do nothing

Option 2: Delete all the outliers


A 5% trimmed mean discards the lowest and
highest 2.5%

Option 3: Replace outliers with another value


A 95% winsorized mean sets the
• bottom 2.5% of values equal to the value at
or below which 2.5% of all the values lie
• top 2.5% of values equal to the value at or
below which 97.5% of all the values lie

www.ift.world 32
7.2 The Median
Middle item of a set of items that has been sorted into ascending or descending order

For 2, 5, 7, 11, 14 Median = 7

For 3, 9, 10, 20 Median = (9 + 10) /2 = 9.5

Not impacted by extreme values

Does not use all the information

www.ift.world 33
7.3 The Mode
Most frequently occurring value in a distribution

2, 4, 5, 5, 7, 8, 8, 8, 10, 12 Mode = 8

Data sets can have more than one mode (bimodal, trimodal, etc)

A data set might not have any mode

With continuous data  modal interval

Graphs, charts, tables, examples, and figures are copyright 2019, CFA Institute.
Reproduced and republished with permission from CFA Institute. All rights reserved.

www.ift.world 34
7.4 Other Concepts of Mean

• Weighted Mean

• Geometric Mean

• Harmonic Mean

www.ift.world 35
Weighted Mean
Different observations are given different
proportional influence on the mean

Consider the following portfolio:


Stock A = $40 million
Stock B = $60 million
Stock C = $100 million

If returns were 5% on A, 7% on B and 9% on C,


what was the portfolio return?

www.ift.world 36
Practice Question
A portfolio manager wishes to compute the weighted mean of a portfolio that has the
following asset allocation:
Local Equities: 25%
International Equities: 13%
Bonds: 27%
Mortgage: 18%
Gold: 17%
The returns on the above mentioned assets on December 31, 2012, were 5.4%, 8.9%, -2.5%,
-7%, 11%, respectively. What is the weighted mean for the portfolio?

www.ift.world 37
The Geometric Mean
The geometric mean is frequently used to average rates of change over time
1
RG = 1 + R1 1 + R 2 … 1 + R n n –1

The return over the last four periods for a given stock is: 10%, 8%, -5% and 2%. Calculate the GM.

The geometric means represents the growth rate or compound rate of return on an investment

www.ift.world 38
Using Geometric and Arithmetic Means
The geometric mean is appropriate to measure past performance over multiple periods.

The portfolio returns for the past two years were 100% in year 1 and -50% in year 2.
What is the geometric mean return? What is the arithmetic mean return?

GM = AM if there is no variability, otherwise GM < AM


As the variability increases the difference increases

The arithmetic return is appropriate for forecasting single period returns.

www.ift.world 39
The Harmonic Mean
The harmonic mean is a special type of weighted mean in which an observation’s weight is inversely
proportional to its magnitude.
The harmonic mean is appropriate for averaging ratios (amount per unit) when ratios are repeatedly
applied to a fixed quantity to yield a variable number of units. A classic example is cost averaging.

An investor purchases $1,000 of a security each month for three months. The share prices are $10, $15 and $20
at the three purchase dates. Calculate the average purchase price per share for the security purchased.
n
XH =
n 1
i=1 X
i

HM = GM = AM if there is no variability, otherwise HM < GM < AM


As the variability increases the difference increases

www.ift.world 40
Which Mean to Use?
• Arithmetic Mean

• Geometric Mean

• Weighted Mean

• Harmonic Mean

• Trimmed Mean

• Winsorized Mean

www.ift.world 41
8. Quantiles

1. Quartiles, Quintiles, Deciles, and Percentiles

2. Quantiles in Investment Practice

www.ift.world 42
8.1 Quartiles, Quintiles, Deciles, and Percentiles
Interquartile range (IQR)

Py is the value at or below which y% of the observations lie

n + 1 y
Ly =
100

With a small sample the percentile location calculation is approximate

When Ly is a whole number, the location corresponds to an actual observation

When Ly is not a whole number or integer, Ly lies between the two closest integer
numbers and we use linear interpolation between those two places to determine Py.

www.ift.world 43
8.1 Quartiles, Quintiles, Deciles, and Percentiles

www.ift.world 44
Box and Whisker

Graphs, charts, tables, examples, and figures are copyright 2019, CFA Institute.
Reproduced and republished with permission from CFA Institute. All rights reserved.

www.ift.world 45
Practice Question
Consider the data set:
47 35 37 32 40 39 36 34 35 31 44
Using Ly = (n + 1) (y/100)
Find the 75th percentile point
Find the 1st quartile point
Find the 5th decile point

• First arrange the data in ascending order: 31, 32, 34, 35, 35, 36, 37, 39, 40, 44, 47
• Location of the 75th percentile: L75 = (11 + 1) (75/100) = 9th value. i.e. P75 = 40
• Location of the 1st quartile: L25 = (11 + 1) (25/100) = 3rd value. i.e. P25 = 34
• Location of the 5th decile is the: L50 = (11 + 1) (50/100) = 6th value. i.e. P50 = 36

www.ift.world 46
8.2 Quantiles in Investment Practice

• Portfolio performance evaluation: The performance of investment managers is often evaluated in


terms of the percentile or quartile in which they fall relative to the performance of their peers.

• Investment research: For example, companies can be ranked based on their market capitalization
and sorted into deciles. The first decile contains companies with smallest market values and the
tenth decile contains companies with the largest market values. Such a classification allows analysts
to compare the performance of small companies with large ones.

www.ift.world 47
9. Measures of Dispersion

1. The Range

2. The Mean Absolute Deviation

3. Sample Variance and Sample Standard Deviation

www.ift.world 48
9.1 The Range
Range: difference between the maximum and minimum values in a data set

Range = Max value – Min value

Range is from <max value> to <min value>

Annual returns data: 10%, -5%, 10%, 25%. What is the range?

Advantage: easy to calculate

Disadvantage: only considers two data points

www.ift.world 49
9.2 The Mean Absolute Deviation
Mean Absolute Deviation (MAD): average of the absolute values of deviations from the mean

MAD = Xi − X n
i=1

Data: 8, 12, 10, 8 and 5. What is the MAD?

X= (8 + 12 + 10 + 8 + 5) / 5 = 8.6
8 − 8.6 + 12 − 8.6 + 10 − 8.6 + 8 − 8.6 + 5 − 8.6
MAD =
5
0.6 + 3.4 + 1.4 + 0.6 + 3.6
MAD = = 1.92
5

www.ift.world 50
9.3 Sample Variance and Standard Deviation
Variance is defined as the average of the squared deviations around the mean

Sample variance applies when we are dealing with a subset, or sample, of the total population
n
s2 = Xi − X ) 2
n−1
i=0 s2
XG ≈ X −
2
Data: 8, 12, 10, 8 and 5.What
is the sample variance?

8 − 8.6 2 + 12 − 8.6 2 + 10 − 8.6 2 + 8 − 8.6 2 + 5 − 8.6 2


s2 =
5−1

s2 = 6.80%

Sample standard deviation is defined as the positive square root of the sample variance

www.ift.world 51
Using the Calculator
Annual returns data: 10%, -5%, 10%, 25%. What is the population and sample standard deviation?

Keystrokes Explanation Display


[2nd] [DATA] Enter data entry mode
[2nd] [CLR WRK] Clear data registers X01
10 [ENTER] X01 = 10
[↓] [↓] 5+/- [ENTER] X02 = -5
[↓] [↓] 10 [ENTER] X03 = 10
[↓] [↓] 25 [ENTER] X04 = 25
[2nd] [STAT] [ENTER] Puts calculator into stats mode.
[2nd] [SET] Press repeatedly till you see  1-V
[↓] Number of data points N=4
[↓] Mean X = 10
[↓] Sample standard deviation Sx = 12.25
[↓] Population standard deviation σx = 10.61

www.ift.world 52
Practice Problem
The dividend yield for five hypothetical companies from a list of ten companies is given below. What is
the sample variance?
Paknama 10.50%
Genie Ltd. 16.25%
Mirinda Corp. 27.00%
Tina Travels Ltd. 12.00%
Thomas Press Ltd. 7.80%

www.ift.world 53
10. Downside Deviation
Target downside deviation (target semideviation) is a measure of the risk of being below a given target
It is calculated as the square root of the average squared deviations from the target

n Month Observation Deviation from Deviation below Squared deviations


Xi − B 2
the 4% target the target below the target
STarget =
n−1 Jan 6 2 - -
for all Xi ≤B
Feb 4 0 - -
Mar -2 -6 -6 36
Apr -5 -9 -9 81
May 5 1 - -
Jun 2 -2 -2 4
Jul 1 -3 -3 9
Aug 0 -4 -4 16
Sep 4 0 - -
Oct 3 -1 -1 1
Nov 0 -4 -4 16
Dec 2 -2 -2 4
Sum 167

Target semideviation
167
Target downside deviation < standard deviation = = 3.8964%
11
As the target is increased the target downside deviation will increase

www.ift.world 54
10.1 Coefficient of Variation
Coefficient of variation expresses how much dispersion exists relative to the mean of a distribution
s
CV =
X

Example: Investment A has a


mean return of 7% and a std dev
of 0.05. Investment B has a
mean return of 12% and a std
dev of 0.07. Which is riskier?

Allows for direct comparison of dispersion across different data sets


Used in investment analysis to compare relative risks

www.ift.world 55
Practice Problem
The table below provides data for three securities. Which security has the lowest risk per unit of return?

Asset Arithmetic mean Standard deviation of


return (%) return (%)
A 16.4% 4.9%
B 12.6% 3.5%
C 14.8% 4.2%

Answer: B

www.ift.world 56
11. The Shape of the Distributions
Normal distribution
• Completely described by the mean and variance
• Mode = Median = Mean
• Skewness = 0

Positively skewed unimodal distribution


• Limited but frequent downside returns
• Unlimited but less frequent upside returns Investors prefer
• Mode < Median < Mean positive skewness
• Skewness > 0

Negatively skewed unimodal distribution


• Limited but frequent upside returns
• Unlimited but less frequent downside returns
• Mean < Median < Mode
• Skewness < 0

www.ift.world 57
Practice Problem
Which of the following distribution is most likely characterized by frequent small
losses and a few extreme gains?
A. Normal distribution
B. Negatively skewed
C. Positively skewed

Answer: C

A positively skewed distribution has frequent small losses and a few extreme gains. A negatively
skewed distribution has frequent small gains and a few extreme losses. A normal distribution is
symmetrical.

www.ift.world 58
Practice Problem
Which of the following is most likely to be true for a negatively skewed distribution?
A. Mean < median < mode
B. Mode < median < mean
C. Median < mean < mode

Answer: A

For a negatively skewed distribution, the mean is less than the median, which is less than the
mode.

www.ift.world 59
11.1 The Shape of the Distributions: Kurtosis
Kurtosis: measure of combined weight of tails relative to the rest of the distribution

• Mesokurtic

• Leptokurtic

• Platykurtic

Excess Kurtosis

www.ift.world 60
Interpretation of Skewness and Kurtosis

Graphs, charts, tables, examples, and figures are copyright 2019, CFA Institute.
Reproduced and republished with permission from CFA Institute. All rights reserved.

www.ift.world 61
12. Correlation between Two Variables
Covariance is a measure of how two variables move together
The formula for computing the sample covariance of X and Y is:
N
i=1 Xi − X Yi − Y
sXY =
n − 1

Correlation is a standardized measure of the linear relationship between two variables


Range is between -1 and +1.

sXY
rXY =
sx ∗ sy

1. Properties of Correlation

2. Limitations of Correlation Analysis

www.ift.world 62
12.1 Properties of Correlation
1. Correlation ranges from -1 and +1.

2. A correlation of 0  absence of any linear relationship

3. A correlation of +1  perfect positive relationship

4. A correlation of -1  perfect negative relationship

www.ift.world 63
Interpreting the Correlation Coefficient

Covariance

Correlation

www.ift.world 64
12.2 Limitations of Correlation Analysis
• Two variables can have a strong non-linear relation and still have a very low correlation

• Correlation can be unreliable when outliers are present

• Correlation does not imply causation

• Correlation does not tell the whole story

• Correlation may be spurious

Spurious correlation refers to the following situations:


 The correlation between two variables that reflects chance relationships in a particular data set.
 The correlation induced by a calculation that mixes each of two variables with a third variable.
 The correlation between two variables arising not from a direct relation between them, but from their
relation to a third variable.

www.ift.world 65
Summary
• Data types
• Organizing data for quantitative analysis
• Frequency distributions
• Contingency tables
• Data visualization
• Guide to selecting among visualization types
• Measures of central tendency
• Other concepts of mean
• Other measures of location: quantiles
• Measures of dispersion
• Target downside deviation
• Skewness
• Kurtosis
• Correlation between two variables

www.ift.world 66
Conclusion

• Learning Outcomes
• IFT Notes
• IFT Qbank
• Curriculum Practice Problems
• Curriculum Examples

www.ift.world 67

You might also like