Professional Documents
Culture Documents
www.ift.world
Graphs, charts, tables, examples, and figures are copyright 2022, CFA Institute.
Reproduced and republished with permission from CFA Institute. All rights reserved.
1
Contents and Introduction
1. Introduction
2. Data Types
3. Organizing Data for Quantitative Analysis
4. Summarizing Data Using Frequency Distribution
5. Summarizing Data Using a Contingency Table
6. Data Visualization
7. Measures of Central Tendency
8. Quantiles
9. Measures of Dispersion
10. Downside Deviation and Coefficient of Variation
11. The Shape of Distributions
12. Correlation Between Two Variables
www.ift.world 2
2. Data Types
What is data?
• Collection of numbers, characters, words and text to represent facts or information
• Can be in the form of images, audio, and video
• Can be in raw or organized format
www.ift.world 3
2.1 Numerical Versus Categorical Data
• Numerical (quantitative) data
Continuous data
Discrete data
www.ift.world 4
Practice Question
Identify the data type for each of the following items:
www.ift.world 5
2.2 Cross-Sectional versus Time-Series versus Panel Data
This classification is based on how data are collected
What is a variable (field, attribute, feature)?
• Characteristic or quantity that can be measured, counted, or categorized
• Subject to change
• Observation: value of a specific variable collected at a point in time or
over a specified period of time
• Time-series data
• Cross-sectional data
• Panel data
www.ift.world 6
2.3 Structured versus Unstructured Data
• Structured data
• Unstructured data
www.ift.world 7
3. Organizing Data for Quantitative Analysis
• One-dimensional array
www.ift.world 8
4. Summarizing Data Using Frequency Distribution (1/2)
Frequency distribution (one-way table)
Constructing a frequency distribution of a Summary data for 200 companies
categorical variable: across four sectors.
1. Count the number of observations for
each unique value of the variable Sector Absolute Relative
Frequency Frequency
2. Construct a table listing each unique Technology 22 11%
value and the corresponding counts
Healthcare 50 25%
3. Sort the records by number of counts in Financial 58 29%
descending or ascending order
Industrial 70 35%
Total 200 100%
www.ift.world 9
4. Summarizing Data Using Frequency Distribution (2/2)
Constructing a frequency distribution Summary data for 100 stocks with prices ranging between 45.00 and 65.00
for numerical data:
1. Sort the data in ascending order.
2. Calculate the range of the data.
3. Decide on the number of bins (k).
4. Determine bin width.
5. Determine bins.
6. Determine the number of
observations in each bin. Stock Price Range Absolute Cumulative Relative Cumulative Relative
(Min – Max) Frequency Frequency Frequency Frequency
7. Construct a table of the bins
listed from smallest to largest. 45.00 – 50.00 25 25 0.25 0.25
50.00 – 55.00 35 60 0.35 0.60
55.00 – 60.00 29 89 0.29 0.89
60.00 – 65.00 11 100 0.11 1.00
www.ift.world 10
Practice Question
The actual number of observations in a given interval is called the:
A. Absolute frequency
B. Relative frequency
C. Cumulative absolute frequency
Answer: A
The actual number of observations in a given interval is known as absolute frequency. Relative
frequency is the absolute frequency of each interval divided by the total number of observations.
Cumulative absolute frequency is the running total of all absolute frequencies.
www.ift.world 11
Practice Question
Which of the following is most likely to be accurate?
A. An observation can fall in more than one interval.
B. The data is sorted in a descending order for the construction of a frequency
distribution.
C. The cumulative relative frequency tells the observer the fraction of the
observations that are less than the upper limit of each interval.
Answer: C
The cumulative relative frequency tells the observer the fraction of the observations that are less
than the upper limit of each interval. An observation cannot fall in more than one interval. The
data is sorted in an ascending order for the construction of a frequency distribution.
www.ift.world 12
5. Summarizing Data Using a Contingency Table
A contingency table is a tabular format that displays the frequency distributions of two or more
categorical variables simultaneously.
Joint frequency
Marginal frequency
www.ift.world 13
Contingency Table Applications
Evaluate the performance of a classification model
Suppose we have a model for classifying companies into two groups: Predicted Actual Default
Total
Default Yes No
1. those that default on their bond payments
Yes 150 10 160
2. those that do not default
No 6 834 840
The table shows the confusion matrix for a sample of 1,000 non-
Total 156 844 1,000
investment-grade bonds.
www.ift.world 14
Example
Suppose we randomly pick 200 mutual funds and classify them based on two parameters:
Fund style – Growth versus value Low Risk High Risk
Growth 67 19
Risk level – Low risk versus high risk
Value 98 16
Describe how the contingency table is used to set up a test for independence between fund style and risk level.
Add the marginal frequencies and overall total to the contingency table
The actual values and the expected values are used to
Observed Values Observed Values
Low High Low High derive the chi-square test statistic. This is then compared
Risk Risk Risk Risk to a value from the chi-square distribution table for a
Growth 67 19 86 Growth 78% 22% 100%
given level of significance. If the test statistic is greater
Value 98 16 114 Value 86% 14% 100%
165 35 200 than the chi-square distribution value, then we can
Use the marginal frequencies to construct a table with expected values conclude that there is significant association between
of the observations. the categorical variables.
Observed Values Observed Values
Low High Low High
Risk Risk Risk Risk
Growth 70.95 15.05 86 Growth 82.5% 17.5% 100%
Value 94.05 19.95 114 Value 82.5% 17.5% 100%
165 35 200
www.ift.world 15
6. Data Visualization
Visualization is the presentation of data in a pictorial or graphical format for the
purpose of increasing understanding and for gaining insights into the data
1. Histogram and Frequency Polygon
2. Bar Chart
3. Tree-Map
4. Word Cloud
5. Line Chart
6. Scatter Plot
7. Heat Map
8. Guide to Selecting among Visualization Types
www.ift.world 16
6.1 Histogram and Frequency Polygon
Histogram Frequency polygon Cumulative frequency distribution
Presents distribution of numerical data Plots the midpoints of each interval Plots the cumulative frequency or
by using height of column to represent on the X-axis and the absolute cumulative relative frequency against the
absolute frequency; shows where frequency of that interval on the Y- upper interval limit; shows how many or
most of the data lies axis what percent of the observations lie
Price Rage # Stocks below a certain value
46.00 – 51.00 20
51.00 – 56.00 60
56.00 – 61.00 100
61.00 – 65.00 20 Cumulative Frequency
Distribution
250
Histogram Frequency Polygon
120 200
120
100
100
150
80 80
60 60 100
40 40
20 50
20
0 0
0
46.00 – 51.00 51.00 – 56.00 56.00 – 61.00 61.00 – 65.00 46.00 – 51.00 51.00 – 56.00 56.00 – 61.00 61.00 – 65.00
46.00 – 51.00 51.00 – 56.00 56.00 – 61.00 61.00 – 65.00
Slope interpretation
www.ift.world 17
Practice Question
Which of the following statements is most likely to be inaccurate about histograms?
A. A histogram is the graphical equivalent of a frequency distribution.
B. A histogram is a form of a bar chart.
C. In a histogram, the height represents the relative frequency for each interval.
Answer: C
In a histogram, the height represents the absolute frequency for each interval.
www.ift.world 18
6.2 Bar Chart (1/3)
A bar chart is used to plot the frequency A grouped bar chart (also called a clustered bar chart) can
distribution of categorical data be used to show the frequency distribution of multiple
Each bar represents a distinct category categorical variables simultaneously
Bar’s height (or length) proportional to frequency
www.ift.world 19
6.2 Bar Chart (2/3)
A stacked bar chart is an alternative form for presenting the frequency distribution of multiple categorical variables
simultaneously.
www.ift.world 20
6.2 Bar Chart (3/3)
www.ift.world 21
6.3 Tree-Map
A tree-map is a graphical tool to display categorical data
www.ift.world 22
6.4 Word Cloud
A word cloud is a visual device for representing textual data
The size of each distinct word is proportional to the frequency with which it appears in the given text
Graphs, charts, tables, examples, and figures are copyright 2019, CFA Institute.
Reproduced and republished with permission from CFA Institute. All rights reserved.
www.ift.world 23
6.5 Line Chart
A line chart is a type of graph used to visualize ordered
observations
• Often used to display the change of data series over
time
• Can plot more than one set of data points, which
helps in making comparisons
Graphs, charts, tables, examples, and figures are copyright 2019, CFA Institute.
Reproduced and republished with permission from CFA Institute. All rights reserved.
www.ift.world 24
6.6 Scatter Plot (1/2)
A scatter plot helps visualize the joint variation in two
numerical variables.
• Constructed by drawing dots to indicate the
values of the two variables plotted against the
corresponding axes.
• Dots are drawn to indicate the values of the two
variables at different points in time.
www.ift.world 25
6.6 Scatter Plot (2/2)
A scatter plot matrix organizes scatter
plots between pairs of variables into a
matrix format.
www.ift.world 26
6.7 Heat Map
A heat map organizes and summarizes data in a tabular format and represents it using a color
spectrum.
Uses:
Display frequency distributions
Visualize the degree of correlation among different variables
www.ift.world 27
6.8 Guide to Selecting among Visualization Types
Select visualization technique based on purpose:
1) exploring/presenting relationships 2) exploring/presenting distributions 3) making comparisons
www.ift.world 28
Common Pitfalls
• Improper chart type
www.ift.world 29
7. Measures of Central Tendency
1. The Arithmetic Mean Population: all members of a specified group
A ‘parameter’ describes the characteristics of a population
2. The Median
Sample: a subset drawn from a population
A ‘sample statistic’ describes the characteristic of a sample
3. The Mode
www.ift.world 30
7.1 The Arithmetic Mean
Arithmetic mean: sum of the observation values divided by the number of observations
n
i=1 Xi
Sample mean: X =
n
The arithmetic mean very sensitive to outliers (extreme high or low values)
www.ift.world 31
Dealing with Outliers
Option 1: Do nothing
www.ift.world 32
7.2 The Median
Middle item of a set of items that has been sorted into ascending or descending order
www.ift.world 33
7.3 The Mode
Most frequently occurring value in a distribution
2, 4, 5, 5, 7, 8, 8, 8, 10, 12 Mode = 8
Data sets can have more than one mode (bimodal, trimodal, etc)
Graphs, charts, tables, examples, and figures are copyright 2019, CFA Institute.
Reproduced and republished with permission from CFA Institute. All rights reserved.
www.ift.world 34
7.4 Other Concepts of Mean
• Weighted Mean
• Geometric Mean
• Harmonic Mean
www.ift.world 35
Weighted Mean
Different observations are given different
proportional influence on the mean
www.ift.world 36
Practice Question
A portfolio manager wishes to compute the weighted mean of a portfolio that has the
following asset allocation:
Local Equities: 25%
International Equities: 13%
Bonds: 27%
Mortgage: 18%
Gold: 17%
The returns on the above mentioned assets on December 31, 2012, were 5.4%, 8.9%, -2.5%,
-7%, 11%, respectively. What is the weighted mean for the portfolio?
www.ift.world 37
The Geometric Mean
The geometric mean is frequently used to average rates of change over time
1
RG = 1 + R1 1 + R 2 … 1 + R n n –1
The return over the last four periods for a given stock is: 10%, 8%, -5% and 2%. Calculate the GM.
The geometric means represents the growth rate or compound rate of return on an investment
www.ift.world 38
Using Geometric and Arithmetic Means
The geometric mean is appropriate to measure past performance over multiple periods.
The portfolio returns for the past two years were 100% in year 1 and -50% in year 2.
What is the geometric mean return? What is the arithmetic mean return?
www.ift.world 39
The Harmonic Mean
The harmonic mean is a special type of weighted mean in which an observation’s weight is inversely
proportional to its magnitude.
The harmonic mean is appropriate for averaging ratios (amount per unit) when ratios are repeatedly
applied to a fixed quantity to yield a variable number of units. A classic example is cost averaging.
An investor purchases $1,000 of a security each month for three months. The share prices are $10, $15 and $20
at the three purchase dates. Calculate the average purchase price per share for the security purchased.
n
XH =
n 1
i=1 X
i
www.ift.world 40
Which Mean to Use?
• Arithmetic Mean
• Geometric Mean
• Weighted Mean
• Harmonic Mean
• Trimmed Mean
• Winsorized Mean
www.ift.world 41
8. Quantiles
www.ift.world 42
8.1 Quartiles, Quintiles, Deciles, and Percentiles
Interquartile range (IQR)
n + 1 y
Ly =
100
When Ly is not a whole number or integer, Ly lies between the two closest integer
numbers and we use linear interpolation between those two places to determine Py.
www.ift.world 43
8.1 Quartiles, Quintiles, Deciles, and Percentiles
www.ift.world 44
Box and Whisker
Graphs, charts, tables, examples, and figures are copyright 2019, CFA Institute.
Reproduced and republished with permission from CFA Institute. All rights reserved.
www.ift.world 45
Practice Question
Consider the data set:
47 35 37 32 40 39 36 34 35 31 44
Using Ly = (n + 1) (y/100)
Find the 75th percentile point
Find the 1st quartile point
Find the 5th decile point
• First arrange the data in ascending order: 31, 32, 34, 35, 35, 36, 37, 39, 40, 44, 47
• Location of the 75th percentile: L75 = (11 + 1) (75/100) = 9th value. i.e. P75 = 40
• Location of the 1st quartile: L25 = (11 + 1) (25/100) = 3rd value. i.e. P25 = 34
• Location of the 5th decile is the: L50 = (11 + 1) (50/100) = 6th value. i.e. P50 = 36
www.ift.world 46
8.2 Quantiles in Investment Practice
• Investment research: For example, companies can be ranked based on their market capitalization
and sorted into deciles. The first decile contains companies with smallest market values and the
tenth decile contains companies with the largest market values. Such a classification allows analysts
to compare the performance of small companies with large ones.
www.ift.world 47
9. Measures of Dispersion
1. The Range
www.ift.world 48
9.1 The Range
Range: difference between the maximum and minimum values in a data set
Annual returns data: 10%, -5%, 10%, 25%. What is the range?
www.ift.world 49
9.2 The Mean Absolute Deviation
Mean Absolute Deviation (MAD): average of the absolute values of deviations from the mean
MAD = Xi − X n
i=1
X= (8 + 12 + 10 + 8 + 5) / 5 = 8.6
8 − 8.6 + 12 − 8.6 + 10 − 8.6 + 8 − 8.6 + 5 − 8.6
MAD =
5
0.6 + 3.4 + 1.4 + 0.6 + 3.6
MAD = = 1.92
5
www.ift.world 50
9.3 Sample Variance and Standard Deviation
Variance is defined as the average of the squared deviations around the mean
Sample variance applies when we are dealing with a subset, or sample, of the total population
n
s2 = Xi − X ) 2
n−1
i=0 s2
XG ≈ X −
2
Data: 8, 12, 10, 8 and 5.What
is the sample variance?
s2 = 6.80%
Sample standard deviation is defined as the positive square root of the sample variance
www.ift.world 51
Using the Calculator
Annual returns data: 10%, -5%, 10%, 25%. What is the population and sample standard deviation?
www.ift.world 52
Practice Problem
The dividend yield for five hypothetical companies from a list of ten companies is given below. What is
the sample variance?
Paknama 10.50%
Genie Ltd. 16.25%
Mirinda Corp. 27.00%
Tina Travels Ltd. 12.00%
Thomas Press Ltd. 7.80%
www.ift.world 53
10. Downside Deviation
Target downside deviation (target semideviation) is a measure of the risk of being below a given target
It is calculated as the square root of the average squared deviations from the target
Target semideviation
167
Target downside deviation < standard deviation = = 3.8964%
11
As the target is increased the target downside deviation will increase
www.ift.world 54
10.1 Coefficient of Variation
Coefficient of variation expresses how much dispersion exists relative to the mean of a distribution
s
CV =
X
www.ift.world 55
Practice Problem
The table below provides data for three securities. Which security has the lowest risk per unit of return?
Answer: B
www.ift.world 56
11. The Shape of the Distributions
Normal distribution
• Completely described by the mean and variance
• Mode = Median = Mean
• Skewness = 0
www.ift.world 57
Practice Problem
Which of the following distribution is most likely characterized by frequent small
losses and a few extreme gains?
A. Normal distribution
B. Negatively skewed
C. Positively skewed
Answer: C
A positively skewed distribution has frequent small losses and a few extreme gains. A negatively
skewed distribution has frequent small gains and a few extreme losses. A normal distribution is
symmetrical.
www.ift.world 58
Practice Problem
Which of the following is most likely to be true for a negatively skewed distribution?
A. Mean < median < mode
B. Mode < median < mean
C. Median < mean < mode
Answer: A
For a negatively skewed distribution, the mean is less than the median, which is less than the
mode.
www.ift.world 59
11.1 The Shape of the Distributions: Kurtosis
Kurtosis: measure of combined weight of tails relative to the rest of the distribution
• Mesokurtic
• Leptokurtic
• Platykurtic
Excess Kurtosis
www.ift.world 60
Interpretation of Skewness and Kurtosis
Graphs, charts, tables, examples, and figures are copyright 2019, CFA Institute.
Reproduced and republished with permission from CFA Institute. All rights reserved.
www.ift.world 61
12. Correlation between Two Variables
Covariance is a measure of how two variables move together
The formula for computing the sample covariance of X and Y is:
N
i=1 Xi − X Yi − Y
sXY =
n − 1
sXY
rXY =
sx ∗ sy
1. Properties of Correlation
www.ift.world 62
12.1 Properties of Correlation
1. Correlation ranges from -1 and +1.
www.ift.world 63
Interpreting the Correlation Coefficient
Covariance
Correlation
www.ift.world 64
12.2 Limitations of Correlation Analysis
• Two variables can have a strong non-linear relation and still have a very low correlation
www.ift.world 65
Summary
• Data types
• Organizing data for quantitative analysis
• Frequency distributions
• Contingency tables
• Data visualization
• Guide to selecting among visualization types
• Measures of central tendency
• Other concepts of mean
• Other measures of location: quantiles
• Measures of dispersion
• Target downside deviation
• Skewness
• Kurtosis
• Correlation between two variables
www.ift.world 66
Conclusion
• Learning Outcomes
• IFT Notes
• IFT Qbank
• Curriculum Practice Problems
• Curriculum Examples
www.ift.world 67