You are on page 1of 20

Reading 2 Organizing, Visualizing, and Describing Data

1. INTRODUCTION

FinQuiz Notes – 2 0 2 2
Data is the key input for security analysis and Organizing, cleaning, and analyzing data is
investment management. The rapid growth in highly important and is a foundation of a
technology has contributed to providing a successful investment strategy. The data is then
data-rich environment featuring large volume, examined to detect - important relationships,
high velocity, and a wide variety of data - valuable insights, underlying structures, and
resulted in investors embracing big data for outliers - within the dataset.
their investment strategies.

2. DATA TYPES

exclusive such as dividend-paying versus non-


Data represent facts or information in a raw or dividend paying, small-cap versus large-cap,
organized form. Data is a collection of words, etc.
text, characters, numberpanel dates, images,
audio, or video. Nominal data: In this classification, data is
categorized into various types without any
To summarize and analyze effectively, we need order or rank. Nominal data is commonly
to distinguish among different classes of data represented by text labels or numerical
types such as: values/codes, provided these codes do not
represent ranking.
• numerical versus categorical data
• cross-sectional vs. time-series vs. panel For example, Global industry classification
data standard (GICS) is a method for assigning
• structured versus unstructured data companies into sectors, industry groups,
industries, and sub-industries based on the
2.1 Numerical versus Categorical Data nature of the companies’ businesses and
operations.
Statistically, data can be classified into
numerical data and categorical data. Ordinal data: This scale classifies data into
various categories and also rank them into an
2.1.1 Numerical Data (a.k.a. quantitative data): order based on some characteristics. However,
are numbers that can be divided into two the intervals separating the ranks in ordinal
types: data cannot be compared with each other.

a) Continuous data – infinite number Example: Under Morningstar and Standard


of values between whole numbers. & Poor's star ratings for mutual funds, a fund
Data that can take any numerical that is assigned:
value within a specific range of • 1 star represents a fund with
values such as the future value of relatively poor performance.
an investment, price returns of a • 5 stars represents a fund with
stock, cash dividends per share. relatively superior performance.

b) Discrete data – finite values Rule of Thumb: How to distinguish numerical


Values that result from a counting data from categorical data coded in
process such as the number of numerical format?
coupon payments (semi-annually,
annually), discrete compounding Arithmetic operations can be performed on
frequency (monthly, quarterly, numerical data but cannot be performed on
yearly, etc.) categorical data coded in numerical format.

2.1.2. Categorical Data (a.k.a. qualitative


data): are values that are classified into various Practice: Example 2,
types based on the quality or characteristics of Curriculum Volume 1, Reading 2.
the dataset. These values can be mutually
Reading 2 Organizing, Visualizing, and Describing Data

Cross-Sectional versus Time-Series are readily searchable and readable by


2.2
versus Panel Data computers for processing and analyzing.

Two data related terminologies: For example:


• Market data – data issued by stock
a) Variable (field or attribute or feature) – exchanges
that can be measured or categorized, • Fundamental data – data in
and its value is subject to change. For financial reports issued by
example, stock price, dividend yield, companies
earning per share. • Analytical data - data obtained
from analytics, cash flow
b) Observation – is the value of a specific projections, forecasted earnings.
variable at a specific time or over a
specified period. For example, last
Unstructured Data - Data that do not follow any
quarter’s EPS of LRT Inc. is $2.50.
systematically organized format.
Following are three data classifications based
Unstructured data are usually alternative data
on how data are collected:
gathered from unconventional sources such as
audio/video/text generated from satellites,
a) Cross-sectional data
financial news, social media posts,
b) Time-series data
presentations/filings generated by companies
c) Panel data
in regular business operations.
1. Cross-sectional data: Cross-sectional data
Unstructured data must be converted into a
are observations of a specific variable
format usable by traditional modeling methods
collected at the same point in time from
designed for structured inputs.
multiple observational units.
Based on the sources from which the data are
E.g., 2020 year-end book value per generated, three types of unstructured data
share (the variable) for all New York are the data generated by:
Stock Exchange-listed companies (the
observational units).
1. Individuals – web searches, social
media posts
2. Time series data: Time series data is a set of 2. Business processes – credit card
observations for a single observational unit transactions, corporate filings such as
of a specific variable collected at different Form 10-Q.
times at discrete and equally spaced time 3. Sensors – satellite imagery, traffic by
intervals. mobile devices

E.g., monthly returns (the variables) of


UNP for the past 5 years.
Practice: Example 2,
Curriculum Volume 1, Reading 2.
3. Panel Data: Mix of time series and cross-
sectional data that are frequently used in
financial analysis. It is a set of observations
on one or more variables for multiple 2.4 Data Summarization
observational units collected at different
times
For quantitative analysis, raw data must be
transformed into structured data for cleaning
E.g., The annual inflation rate (the
and formatting purposes. Depending on the
variable) of the Eurozone countries (the
variables, raw data is organized into one-
observational units) over a 5-year
dimensional array or two-dimensional array to
period.
find patterns and relations between variables.

2.3 Structured versus Unstructured Data • Frequency distribution is a useful tool to


summarize one-variable data.
• Contingency tables efficiently sum up
Structured data - Highly organized data in a
two-variable data.
systematic format with repeating patterns that
Reading 2 Organizing, Visualizing, and Describing Data

3. ORGANIZING DATA FOR QUANTITATIVE ANALYSIS

Two-dimensional rectangular array (a.k.a. data


One-dimensional data table)

• Simplest format to organize information One of the most popular forms of organizing
• Suitable for compiling data with single data for computers or humans.
variable. For example, time-series data –
such as closing price of TSLA for the first 10 Data tables are similar to excel spreadsheet
trading days in January 2021. where columns hold multiple variables and
• Time series format facilitates: rows hold multiple observations typically
organized in a time ordered sequence.
o future data updates to the current
dataset.
o in observing trends or patterns in the Practice: Example 3,
data over time Curriculum Volume 1, Reading 2.

4. SUMMARIZING DATA USING FREQUENCY DISTRIBUTIONS

Sector Absolute Relative


Frequency distribution is a useful tool to (variable) Frequency Frequency
summarize data for on variable. Health care 21 26.25%
Financials 19 23.75%
• Frequency distribution (also known as a Consumer goods 17 21.25%
one-way table) - Frequency distribution is Utilities 14 17.50%
a tabular display where data is Real estate 9 11.25%
categorized into mutually exclusive groups Total 80 100.00%
(categorical data) or numerically ordered
bins (numerical data) and shows the A frequency distribution table provides
number of observations in each bin. valuable information such as, the industry
sector with the largest number of stocks in the
• Absolute Frequency: The actual number of portfolio is the ‘health care’ sector - contains 21
observations for each unique value of a stocks and accounts for 26.25% of the total
variable is called the absolute frequency stocks of the portfolio.
or simply frequency.
Constructing Frequency distribution of a
• Relative frequency: Relative frequency = numerical data:
!"#$%&'( *+(,&(-./
01234 56789: 1; 18<9:=32>15<
Step 1:
Constructing Frequency distribution of a Arrange the data in ascending order.
categorical variable:
Step 2:
• Count the number of observations Calculate the range of the data.
for each unique value of the Range = Maximum Value - Minimum value
variable
• List unique values along with Step 3:
corresponding counts in a table in Choose the appropriate number of bins
ascending or descending order. (a.k.a. intervals) (k) based on your
judgement. Bins are a set of values within
which an observation lies.
Consider a portfolio containing 80 stocks,
organized in five sectors. The frequency Step 4:
distribution of the portfolio’s stock holdings by
Determine the bin width using the formula:
sectors is provided below. ?@-A(
B$. $* DE-#

Example: Suppose,
Max. value = 3.5%
Reading 2 Organizing, Visualizing, and Describing Data

Min value = 1.5% 1.5


Bins on intervals = 5 4.3
5.2
F.G% I J.G%
Bin width = = 0.5% 6.2
K
6.5
7.2
Important 9.9
• If too few intervals are used, then the data 10.1
is over-summarized and may ignore 11.8
important characteristics.
• If too many intervals are used, then the Step 2: Determine Range
data is under-summarized. Range = Max. value – Min value = 11.8% – (-
• The smaller (greater) the value of k, the 2.2%) = 14%
larger (smaller) the interval.
Step 3: Set number of bins
Step 5:
Determine the endpoints of each bin. Suppose we set k = 4 i.e., four number of bins.

Compute the endpoint of the first bin by Step 4: Determine bin width
adding bin width to the minimum value.
?@-A( JK%
Then compute the 2nd bin’s endpoint by Bin width = = = 3.5%
L K
adding the bin width to the endpoint of the
first bin. Step 5: Determine the end points of the bins

Determined the next bin by successively Endpoints of Bins’ Limits Observation


adding the bin width to the endpoints of bins (obs.)
the previous bin. -2.2 + 3.5 = 1.3 [-2.2 to 1.3) −2.2 ≤ 𝑜𝑏𝑠. < 1.3
1.3 + 3.5 = 4.8 [1.3 to 4.8) 1.3 ≤ 𝑜𝑏𝑠. < 4.8
The last bin would be the one, which
4.8 + 3.5 = 8.3 [4.8 to 8.3) 4.8 ≤ 𝑜𝑏𝑠. < 8.3
includes the maximum value.
8.3 + 3.5 = 11.8 [8.3 to 11.8) 8.3 ≤ 𝑜𝑏𝑠. < 11.8
Step 6:
Note:
Count the number of observations in each
Square bracket [ ] indicates endpoints are
bin.
included in the bin.
Step 7:
Parentheses ( ) indicate endpoints are not
Construct a table presenting set of bins
included in the bin.
listed in ascending order showing no. of
observations falling into each bin.
Step 6 and 7:
Construct a table presenting set of bins listed in
ascending order displaying no. of observations
It is important to note that: falling into each bin.
• Bins do not overlap.
• Each observation should fall into one bin Bin Observation Absolute Relative
only. obs. Frequency Frequency
• Start the first bin/interval with a nearest (%)
whole number below the minimum value.
A −2.2 ≤ 𝑜𝑏𝑠. 2 0.20
• To ensure that the final interval includes the < 1.3
maximum value of the data., always round
B 1.3 ≤ 𝑜𝑏𝑠. 1 0.10
up (not down). < 4.8
C 4.8 ≤ 𝑜𝑏𝑠. 4 0.40
Example: < 8.3
D 8.3 ≤ 𝑜𝑏𝑠. 3 0.30
Suppose an investment fund has the following < 11.8
10 observations of monthly returns.
Cumulative Absolute Frequency is computed
1.5%, -2.2%, 6.2%, 11.8%, 9.9%, 6.5%, 5.2%, 4.3%, by adding up the absolute frequencies. It
7.2%, 10.1% reflects the number of observations that are less
than the upper limit of each interval.
Step 1: Sorting the returns in ascending order

-2.2%
Reading 2 Organizing, Visualizing, and Describing Data

Cumulative Relative Frequency is computed by the percentage of observations that are less
adding up the relative frequencies. It reflects than the upper limit of each interval.

Bin Observation obs. Absolute Relative Cumulative Cumulative Relative


Frequency Frequency Absolute Frequency
(%) Frequency (%)
A −2.2 ≤ 𝑜𝑏𝑠. < 1.3 2 0.20 2 0.20
B 1.3 ≤ 𝑜𝑏𝑠. < 4.8 1 0.10 3 0.30
C 4.8 ≤ 𝑜𝑏𝑠. < 8.3 4 0.40 7 0.70
D 8.3 ≤ 𝑜𝑏𝑠. < 11.8 3 0.30 10 1.00

Practice: Example 4,
Volume 1, Reading 2.

5. SUMMARIZING DATA USING A CONTINGENCY TABLE

A contingency table is a powerful tool to Consumer 8 4 5 17


summarize data and to find patterns for two or goods
more variables simultaneously. A contingency Utilities 4 6 4 14
table with two variables is called two-way Real estate 4 3 2 9
table. TOTAL 29 28 23 80

Constructing a two-way contingency table: • Blue cells in the above contingency


table are called joint frequencies i.e.,
List all levels(categories) of one variable in rows joining observations of rows and
R and all the levels of the other variable in columns.
columns C. An R x C table refers to R levels of • Green cells in the above contingency
one variable in rows and C levels of the other table are called marginal frequencies
variable in columns. i.e., when joint frequencies are added
across rows and columns.
• Each variable should have a finite • Small cap ‘Health Care’ stocks are the
number of levels. portfolio’s largest subgroup with 9
• Levels can either comprise of ordered stocks in terms of frequency.
data or unordered data. • Large cap ‘Real estate’ stocks are the
portfolio’s smallest subgroup with 2
Consider a portfolio of 80 stocks. A table below stocks.
shows a 5 x 3 contingency table that
summarizes the stocks of the portfolio by two Relative Frequency as percentage of Total
variables – sectors and size as market count
capitalization.
Relative frequency of each cell =
B$.$* X"#(+Y@'E$-#
• Sectors have five levels – i) Health .
Z$'@% X"#(+Y@'E$-#
Care ii) Financials iii) Consumer goods
iv) Utilities v) Real estate Market Capitalization
• Size have three levels - i) small cap ii) Sectors Small cap Mid cap Large cap Total
mid cap iii) large cap Health 11.25% 8.75% 6.25% 26.25%
• Each cell below shows the number of care
stocks in each sector with a certain Financials 5.0% 10.0% 8.75% 23.75%
level of market cap. Consumer 10.0% 5.0% 6.25% 21.25%
goods
Market Capitalization Utilities 5.0% 7.5% 5.0% 17.5%
Sectors Small Mid Large Total Real 5.0% 3.75% 2.5% 11.25%
cap cap cap estate
Health care 9 7 5 21 TOTAL 36.25% 35.0% 28.75% 100%
Financials 4 8 7 19
Reading 2 Organizing, Visualizing, and Describing Data

Uses of Contingency tables better insights into the portions where the
model is creating errors and where the model is
Contingency tables can be used to examine correct.
the potential association between two
variables. One method used to test for the
Refer to the paragraph above
potential association between variables is Chi-
Example 5.
square test of independence.

Confusion matrix is a special type of two-way


Practice: Example 5,
contingency table where one dimension
Volume 1, Reading 2.
represents actual data, and another dimension
represents predicted data. This table gives

6. DATA VISUALIZATION

Visualization – presenting data in a


pictorial/graphical format – is a useful tool to
recognize potential associations and
comparisons among data.

6.1 Histogram and Frequency Polygon

Histogram
A histogram is the graphical representation of
the frequency distribution (absolute frequency
or relative frequency) of numerical data.

• The bins of the variables are plotted on the


horizontal axis.
• The absolute/relative frequencies are
plotted on the vertical axis.
• The heights of the bars of the histogram
Frequency polygon: is another tool to
represent the frequencies i.e., the tallest
graphically represents the frequency
bar would be the bin that has the highest
distribution.
frequency.
• Since the bins have no gaps between
• The mid-point of each bin/interval is
them, there would be no gaps between
plotted on the horizontal axis.
the bars. However, gaps can be added
• The corresponding absolute frequency of
between the bars to improve readability.
the bin is plotted on the vertical axis.
• The points representing the intersections of
Advantage: Histogram is useful for a quick the midpoints and class frequencies, are
inspection of the frequency distribution connected by a line as shown by the black
(shape, center or spread) of a large line in the chart below.
numerical data.
Reading 2 Organizing, Visualizing, and Describing Data

• Bars can be plotted horizontally or


vertically.
• The height of the bar represents the
frequency of the corresponding
category.
• Bar charts are used to show
comparisons between categories of
data.
• A bar chart 1 below represents the
frequency distribution of one
categorical variable – sector.

Cumulative frequency distribution: This graph


can be used to determine the number or the
percentage of the observations lying between
certain values. In this graph,

• Cumulative absolute or cumulative relative


frequency is plotted on the vertical axis.
• The upper interval limit of the
corresponding bin is plotted on the
horizontal axis.

o For extreme values (both negative Note:


and positive), the cumulative Pareto chart – is a bar chart plus a line where
distribution tends to flatten out. categories represented by bars are arranged in
descending order and a line represents
o Steeper (flatter) slope of the curve cumulative relative frequency.
indicates large (small) frequencies Bar Charts for more than one categorical
(number of observations). variable

• Grouped bar chart (a.k.a. clustered bar


chart) is used to show joint frequencies of
more than one categorical variable.

For example, Bar chart 2 below is a


grouped bar chart. Three bars within
each sector represents size (market cap
levels) denoted by small cap, mid cap
and large cap.

• Stacked bar chart is another style to


present joint frequencies of more than one
categorical variable. In this chart, a single
bar is divided into subsections -
differentiated by various colors/patterns.
The height of the bar signifies the marginal
6.2 Bar Charts frequency for the category.

• Bar charts are similar to a histogram, Note: Bar charts are also used when
with the difference that bar charts categorical data are also associated with
represent the frequency distribution of numerical data
categorical data.
• Each bar indicates a distinct category
arranged with no logical ordering.
Reading 2 Organizing, Visualizing, and Describing Data

6.4 Word Cloud (a.k.a. Tag cloud)

Word cloud is used to display the unstructured


6.3 Tree map (text) data where key words are sized and
displayed based on their frequency in the data
Tree-map is another visual tool to present file. Higher the frequency of the word, bigger
categorical data. Tree-map consists of a set of the size. Common words such as it, the, etc. are
colored rectangles that represent distinct not included.
groups. The size of the rectangle is proportional
to the value of the corresponding group.

A tree-map below is an example of two


categorical variables. The rectangles are
first split into five sectors. Each sector is then
subdivided into three sub-rectangles: small
cap, mid cap, large cap.

When more than three levels are used, tree-


maps become hard to read.

6.5 Line Chart


A line chart plots data over time and is used to
examine the changes in data and trends and
to predict future
Source: data series.
https://worditout.com/word-
cloud/create

• Ordered observations are plotted on the


vertical axis (Y-axis).
• Time is plotted on the horizontal axis (X-axis).
• The data points over time are connected
using a line

Advantage: Line charts are useful for:


• visualizing large amounts of data.
• comparing more than one set of
data points
Reading 2 Organizing, Visualizing, and Describing Data

• The strength of the association


between data points is determined by
how closely the data points are
clustered together.

Bubble Line Chart


Bubble line chart is a type of line chart used to
display multi-dimensional data in one chart. In
bubble line chart, data points are replaced by
varying-sized color-coded bubbles to add
another dimension.
A Scatter plot matrix is a useful tool to inspect
bivariate (pairwise) relationships between the
combination of variables in one visual. The
scatter plot matrix provides a brief visual
summary of variables and potential correlation
among them.
Refer to Exhibit 32 ‘Pairwise Plot Matrix’,
Reading 2, CFA Program Curriculum.

6.7 Heat Map

Heat map is a graphic summary of data in a


tabular format using a color spectrum to
differentiate high values from low values. Heat
maps are typically used to see the degree of
correlation among different variables.

Practice: Example 6,
6.6 Scatter Plot Volume 1, Reading 2.

A scatter plot graphically shows the relationship Guide to selecting among


between two variables. i.e., how two sets of 6.8
visualization Types
data are related.
Data visualization is a useful tool to gain insight.
• Observations in the scatter plot are Different visual types are appropriate for
represented by a point, and the different purposes. Some charts are suitable for
points are not connected. numerical data, and some are suitable for
• If the points on the scatter plot cluster categorical data.
together in a straight line, the two
variables have a strong linear relation. Following are the four common pitfalls that
• Randomly distributed points on the analysts should avoid for ethical and
scatter plot may indicate no clear appropriate use of data visuals.
relationship between the variables.
Reading 2 Organizing, Visualizing, and Describing Data

1. Selecting the wrong chart 4. Improper scaling of axis.


2. Selectively plotting data for biased
conclusions
3. Truncated graph problem – a y-axis Practice: Example 7,
that does not start at zero. Such Volume 1, Reading 2.
problem may inaccurately imply that
variables are significantly different
when there is only a small difference.

7. MEASURES OF CENTRAL TENDENCY

A measure of central tendency indicates the *The difference between each outcome and
center of the data. The most used measures of the mean is called a deviation.
central tendency are:
Property 2:
•Arithmetic mean The arithmetic mean is sensitive to extreme
•Median values i.e., it can be biased upward or
•Mode downward by extremely large or small
•Weighted mean observations, respectively.
•Geometric mean
•Harmonic mean Advantages of Arithmetic Mean:

• The mean uses all the information


7.1 The Arithmetic Mean regarding the size and magnitude of
the observations.
It is the sum of the observations in the dataset • The mean is also easy to calculate.
divided by the number of observations in the • Easy to work with algebraically
dataset.
Limitation: The arithmetic mean is highly
The terms ‘mean’ and ‘average’ are used affected by outliers (extreme values).
interchangeably.
7.1.3 Outliers
7.1.1 The Sample Mean
The sample mean is the arithmetic mean value Extreme values (outliers) in a dataset may
of a sample; it is computed as: reflect a rare value in the population or an
error.
∑-EfJ 𝑋E
𝑆𝑎𝑚𝑝𝑙𝑒 𝑚𝑒𝑎𝑛 𝑋c =
𝑛 Three ways of dealing with the outliers are:

where, 1. Do nothing. Use the data without any


Xi = ith observation adjustment
n = number of observations in the sample
Choose this option when it is important to
• The sample mean can be computed for represent the whole observations and/or
individual units or overtime. outliers contain meaningful information.
• It is not unique i.e. for a given population; 2. Delete all the outliers
different samples may have different
means. When this option is chosen, the measure of
central tendency in this case is trimmed mean.
Practice: Example 8,
Volume 1, Reading 2. Trimmed Mean is the arithmetic mean of
the distribution computed after excluding a
stated small % of the lowest and highest
values.
7.1.2 Properties of the Arithmetic Mean

Property 1: 3. Replace the outliers with some other value


The sum of the deviations* around the mean is
always equal to 0. When this option is used, the mean is called
winsorized mean.
Reading 2 Organizing, Visualizing, and Describing Data

Winsorized mean: In a winsorized mean, a The mode is the most frequently occurring
stated % of the lowest values is assigned a value in a distribution.
specified low value and a stated % of the
highest values is assigned a specified high Unimodal Distribution: A distribution that has
value and then a mean is computed from only one mode is called a unimodal
the restated data. distribution.

Bimodal Distribution: A distribution that has two


E.g., in a 95% winsorized mean,
modes is called a bimodal distribution.
o The bottom 2.5 % of values are set =
2.5th percentile value.
Trimodal Distribution: A distribution that has
o The upper 2.5% of values are set =
three modes is called a Trimodal distribution.
97.5th percentile value.
A distribution would have no mode when all
7.2 The Median the values in a data set are different.

Median is the middle value of a sorted Modal Interval: Data with continuous
(ascending or descending) list of items. distribution (e.g., stock returns) may not have a
modal outcome. In such cases, a modal
Steps to compute the Median: interval is found i.e., an interval with the largest
1. Arrange all observations in ascending order number of observations (highest frequency).
i.e., from the smallest to the largest. The modal interval always has the highest bar
2. When the number of observations (n) is in the histogram.
odd, the median is the center observation
in the ordered list i.e. Important to note: The mode is the only
(-hJ) measure of central tendency that can be used
Median will be located at = position
j with nominal data.

• (n+1)/2 only identifies the location of


7.4 Other Concepts of Mean
the median, not the median itself.

7.4.1) The Weighted Mean


3. When the number of observations (n) is
even, then median is the mean of the two
Weighted mean: It is the arithmetic mean in
center observations in the ordered list i.e.
which observations are assigned different
weights. It is computed as:
Median will be located at mean of
- (-hJ)
𝑎𝑛𝑑 j . -
j
𝑋cl = m 𝑤E 𝑋E = (𝑤J 𝑋J + 𝑤j 𝑋j + ⋯ + 𝑤- 𝑋- )
Advantage: Median is not affected by extreme EfJ
observations (outliers).
where,
Limitations: X1, X2,…,Xn = observed values
w1, w2,…,w3 = Corresponding weights, sum to 1.
• It is time consuming to calculate
median. • An arithmetic mean is a special case of
• The median is difficult to compute. weighted mean where all observations
• It does not use all the information about are equally weighted by the factor 1/ n
the size and magnitude of the (or l/N).
observations. • A positive weight represents a long
• It only focuses on the relative position of position and a negative weight
the ranked observations. represents a short position.
• Expected value: When a weighted
Example: mean is computed for a forward-looking
Suppose current P/Es of three firms are 16.73, data, it is referred to as the expected
22.02, and 29.30. value.
n = 3 → (n + 1) / 2 = 4/ 2 = 2nd position.
Example:
Thus, the median P/E is 22.02.
Weight of stocks in a portfolio = 0.60
Weight of bonds in a portfolio = 0.40
7.3 The Mode Return on stocks = –1.6%
Return on bonds = 9.1%
Reading 2 Organizing, Visualizing, and Describing Data

• The geometric mean is always ≤ arithmetic


A portfolio's return is the weighted average of mean.
the returns on the assets in the portfolio i.e. • When there is no variability in the
observations (i.e. when all the observations
Portfolio return = (w stock × R stock) + (w bonds × R in the series are the same), geometric
bonds) mean = arithmetic mean
= 0.60(-1.6%) + 0.40 (9.1%) = • The greater the variability of returns over
2.7%. time, the more the geometric mean will be
lower than the arithmetic mean.
• The geometric mean return decreases with
Practice: Example 9,
an increase in standard deviation (holding
Volume 1, Reading 2.
the arithmetic mean return constant).

7.4.2) The Geometric Mean In addition, the geometric mean ranks the two
funds differently from that of an arithmetic
Geometric mean (GM): The geometric mean mean.
can be used to compute the mean value over
time to compute the growth rate of a variable.
Practice: Example 10,
Volume 1, Reading 2.
𝐺 = tr𝑋J 𝑋j 𝑋F … 𝑋-
with Xi ≥ 0 for i = 1, 2, …, n.
7.4.3) The Harmonic Mean
Or
1 -
𝐼𝑛 𝐺 = 𝐼𝑛(𝑋J 𝑋j 𝑋F … 𝑋- ) 1
𝑛 𝐻𝑎𝑟𝑚𝑜𝑛𝑖𝑐 𝑀𝑒𝑎𝑛 𝑋c• = 𝑛/ m( )
𝑋E
EfJ
or as with Xi > 0 for i = 1,2, …, n.

∑-EfJ 𝐼𝑛𝑋-
𝐼𝑛 𝐺 = • It is a special case of the weighted
𝑛 mean in which each observation's
weight is inversely proportional to its
G = elnG
magnitude.

• It should be noted that the geometric
Cost Averaging is an investment strategy
mean can be computed only when the
involving periodic investments of fixed amount
product under the radical sign is non-
of money. Harmonic mean is appropriate when
negative.
averaging the ratios, and the ratios are
repeatedly applied to a fixed quantity to yield
The geometric mean return over the time a variable number of units.
period can be computed as:
In cost averaging, the ratios to be averaged
𝑅w($x = [(1 + 𝑅J )(1 + 𝑅j ) … (1 + 𝑅Z )]J/Z − 1 are prices per share at the date of the
purchase, and then apply those prices to a
• Geometric mean returns are also known constant amount of money to yield a variable
as compound returns. number of shares.

Advantages of Measures of Central Tendency:


Practice: Example 11 and 12
• Widely recognized. Volume 1, Reading 2.
• Easy to compute.
• Easy to apply.
Important to note:
Geometric mean versus Arithmetic mean:
• Harmonic mean formula cannot be
• The geometric mean return represents the
used to compute average price paid
growth rate or compound rate of return on
when different amounts of money are
an investment.
invested at each date.
• The arithmetic mean return represents an
• When all the observations in the data
average single-period return on an
set are the same, geometric mean =
investment.
arithmetic mean = harmonic mean.
Reading 2 Organizing, Visualizing, and Describing Data

• When there is variability in the


observations, harmonic mean < Practice: Example 13
geometric mean < arithmetic mean. Volume 1, Reading 2.

Refer to Exhibit 42, CFA Program


Curriculum, Reading 2 for “Deciding which
central tendency measure to use?”

8. QUANTILES

Ly = location (L) of the percentile (Py).


Quantile or Fractile is a general used for a n = number of observations.
value at or below which a stated fraction of
the data lies. • The larger the sample size, the more
accurate the calculation of percentile
The following four measures collectively are location.
called quantiles.
Example:
1. Quartiles
2. Quintiles Dividend Yields on the components of the
3. Deciles DJ Euros STOXX 50
4. Percentiles
Dividend
No. Company
Yield(%)
Quartiles, Quintiles, deciles, and
8.1 1 AstraZeneca 0.00
Percentiles
2 BP 0.00
1) Quartiles divide the distribution into four 3 Deutsche Telekom 0.00
different parts.
4 HSBC Holdings 0.00

• First Quartile = Q1 = 25th percentile i.e. 5 Credit Suisse Group 0.26


25% of the observations lie at or below
6 L’Oreal 1.09
it.
• Second Quartile = Q2 = 50th percentile 7 SwissRe 1.27
i.e. 50% of the observations lie at or
8 Roche Holding 1.33
below it.
• Third Quartile = Q3 = 75th percentile i.e. 9 Munich Re Group 1.36
75% of the observations lie at or below
it. 10 General Assicurazioni 1.39
11 Vodafone Group 1.41
2) Quintiles divide the distribution into five
12 Carrefour 1.51
different parts. In terms of percentiles, they
can be specified as P20, P40, P60, & P80. 13 Nokia 1.75

3) Deciles divide the distribution into ten 14 Novartis 1.81


different parts. 15 Allianz 1.92

4) Percentiles divide the distribution into 16 Koninklije Philips 2.01


hundred different parts. The position of a Electronics
percentile in an array with n entries 17 Siemens 2.16
arranged in ascending order is determined
as follows: 18 Deutsche Bank 2.27

𝑦 19 Telecom Italia 2.27


𝐿/ = (𝑛 + 1)
100 20 AXA 2.39
where,
y = % point at which the distribution is being 21 Telefonica 2.49
divided.
Reading 2 Organizing, Visualizing, and Describing Data

Dividend
No. Company Thus,
Yield(%)
P10 = X5 + (5.1 – 5) (X6 – X5) = 0.26 + 0.1 (1.09 –
22 Nestle 2.55 0.26)
= 0.34%
23 Royal Bank of 2.60
Scotland Group
Calculating 90th percentile (P90):
24 ABN-AMRO Holding 2.65 L90 = (50 + 1) × (90 / 100) = 45.9
25 BNP Paribas 2.65
• It implies that 90th percentile lies
26 UBS 2.65 between the 45th observation (X45 =
5.15) and 46th observation (X46 = 5.66).
27 Tesco 2.95
28 Total 3.11 Thus,
29 GlaxoSmithKline 3.31 P90 = X45 + (45.9 – 45) (X46 – X45) = 5.15 + 0.90
(5.66 – 5.15) = 5.61%
30 BT Group 3.34
Calculating 1stQuartile (i.e.P25):
31 Unilever 3.53
L25 = (50 + 1) × (25 / 100) = 12.75
32 BASF 3.59
33 Santander Central 3.66 • It implies that 25th percentile lies
Hispano between the 12th observation (X12 =
1.51) and 13th observation (X13 = 1.75).
34 Banco Bilbao 3.67
Thus,
VizcayaArgentaria
P25 = Q1 = X12 + (12.75 – 12) (X13 – X12) = 1.51 +
35 Diageo 3.68 0.75 (1.75 – 1.51) = 1.69%

36 HBOS 3.78 Calculating 2nd Quartile (i.e.P50):


37 E.ON 3.87 L50 = (50 + 1) × (50 / 100) = 25.5

38 Shell Transport and 3.88


• It implies that P50 lies between the 25th
Co.
observation (X25 = 2.65) and 26th
39 Barclays 4.06 observation (X26 = 2.65).
• Since, X25 = X26 = 2.65, no interpolation is
40 Royal Dutch 4.27 needed.
Petroleum Co.
41 Fortus 4.28 Thus,
P50 = Q2 = 2.65% = Median
42 Bayer 4.45
43 DiamlerChrysler 4.68 Calculating 3rd Quartile (i.e.P75):
L75 = (50 + 1) × (75 / 100) = 38.25
44 Suez 5.13
45 Aviva 5.15 • It implies that P75 lies between the 38th
observation (X38 = 3.88) and 39th
46 Eni 5.66
observation (X39 = 4.06).
47 ING Group 6.16
Thus,
48 Prudential 6.43
P75 = Q3 = X38 + (38.25 – 38) (X39 – X38)
49 Lloyds TSB 7.68 = 3.88 + 0.25 (4.06 – 3.88)
= 3.93%
50 AEGON 8.14
Calculating 20th percentile (P20) = 1st Quintile:
Calculating 10th percentile (P10): Total number L20 = (50 +1) × (20 /100) = 10.2
of observations in the table above = n = 50
• It implies that P20 lies between the 10th
L10 = (50 + 1) × (10 / 100) = 5.1
observation (X10 = 1.39) and 11th
observation (X11 = 1.41).
• It implies that 10th percentile lies
between 5th observation (X5 = 0.26) and Thus,
6th observation (X6 = 1.09).
Reading 2 Organizing, Visualizing, and Describing Data

1st quintile = P20 = X10 + (10.2 – 10) (X11 – X10) =


1.39 + 0.20 (1.41 – 1.39) = 1.394% or 1.39% Interquartile range (IQR) = Third quartile - First
quartile
8.2 Quantiles in Investment Practice = Q3 – Q1

• It reflects the length of the interval that


Quantiles are frequently used by investment
contains the middle 50% of the data.
analysts to rank performance i.e., portfolio
• The larger the interquartile range, the
performance. For example, an analyst may
greater the dispersion, all else constant.
rank the portfolio of companies based on their
market values to compare performance of
small companies with large ones i.e.
Refer to Curriculum, Reading 2, Exhibit 44
• 1st decile contains the portfolio of and 45 for Box and Whisker Chart
companies with the smallest market
values.
• 10th decile contains the portfolio of
companies with the largest market Practice: Example 14 and 15
values. Volume 1, Reading 2.

Quantiles are also used for investment research


purposes.

9. Measures of Dispersion

The variability around the central mean is


called Dispersion. The measures of dispersion 9.2 The Mean Absolute Deviation
provide information regarding the spread or
variability of the data values.
Mean absolute deviation (MAD) is the average
Relative dispersion: It refers to the amount of of the absolute values of deviations from the
dispersion/variation relative to a reference mean.
value or benchmark e.g., coefficient of
variation. (It is discussed below). ∑-EfJ|𝑋' − 𝑋c|
𝑀𝐴𝐷 =
𝑛
Absolute Dispersion: It refers to the variation where,
around the mean value without comparison to 𝑋c = Sample mean
any reference point or benchmark. Measures n = Number of observations in the
of absolute dispersion include: sample

i. Range • The greater the MAD, the riskier the


ii. Mean absolute deviation asset.
iii. Variance
iv. Standard deviation Example:
Suppose there are 4 observations i.e., 15, -5, 12,
22.
9.1 The Range
Mean = (15 – 5 + 12 + 22)/4 = 11%
Range = Maximum value - Minimum value MAD = (|15 – 11| + |–5 – 11| + |12 – 11| + |22
– 11|)/4 = 32/4 = 8%
Advantage: It is easy to compute.
Advantage: MAD is superior relative to
Disadvantages: range because it is based on all the
observations in the sample.
• It does not provide information
regarding the shape of the distribution Drawback: MAD is difficult to compute
of data. relative to range.
• It only reflects extremely large or small
outcomes that may not be
representative of the distribution.
Reading 2 Organizing, Visualizing, and Describing Data

Practice: Example 16, CFA


Important to note:
Program Curriculum
Volume 1, Reading 2.
• The MAD will always be ≤ S.D. because
the S.D. gives more weight to large
deviations than to small ones.
Sample Variance and Sample • When a constant amount is added to
9.3
Standard Deviation each observation, S.D. and variance
remain unchanged.
Variance: Variance is the average of the
squared deviations around the mean.

Standard deviation (S.D.): Standard deviation is Refer to Curriculum, Reading 2, Exhibit 46 for
the positive square root of the variance. It is Steps to Calculate Sample Standard
easy to interpret relative to variance because Deviation and Variance
standard deviation is expressed in the same
unit of measurement as the observations.

9.3.1 Sample Variance Practice: Example 17, CFA


Program Curriculum
It is computed as: Volume 1, Reading 2.
-
ˆEfJ(𝑋E − 𝑋c)j
𝑠j =
𝑛−1
where,
9.3.3) Dispersion and the Relationship between
𝑋c=Sample mean
Arithmetic and the Geometric Means
n = Number of observations in the sample
Geometric mean return
• The sample mean is defined as an 𝑉𝑎𝑟𝑖𝑎𝑛𝑐𝑒 𝑜𝑓 𝑟𝑒𝑡𝑢𝑟𝑛
unbiased estimator of the population ≈ Arithmetic mean return –
2
mean.
• (n – 1) is known as the number of 𝑠j
degrees of freedom in estimating the 𝑋cw ≈ 𝑋• −
2
population variance. The larger the variance of the sample, the
wider will be the difference between the
9.3.2) Sample Standard Deviation geometric mean and the arithmetic mean.

It is computed as:
-
ˆEfJ(𝑋E − 𝑋c)j
𝑠= ‰
𝑛−1

10. DOWNSIDE DEVIATION AND COEFFICIENT OF VARIATION

Downside deviation is a risk measure that


-
focuses on returns that fall below a minimum (𝑋E − 𝐵)j
threshold or minimum acceptable return as 𝑠Z@+A(' = ž m
𝑛−1
*$+ @%% ¡ ¢D
investors are typically only concerned about
the values that fall below some minimum target where,
return. B = target value,
n = number of observations.
Standard deviation considers all deviations
from the mean. Downside deviations only Example: Stock returns = 16.2, 20.3, 9.3%, –11.1%
considers the negative deviations from the and –17.0%.
mean. Therefore, downside deviation is less Target return = B = 10%
than the standard deviation S.D.
Target semideviation =
Target semideviation is the measure of [(¤.F –J¥.¥)j h (–JJ.J – J¥.¥)j h (–J¦.¥ – J¥.¥)j]
dispersion of observations below the stated £
GIJ
target.
Target semideviation = √293.675 = 17.14%
Reading 2 Organizing, Visualizing, and Describing Data

where,
s = sample S.D.
Practice: Example 18 and 19,
𝑋c = sample mean.
CFA Program Curriculum
Volume 1, Reading 2.
• CV is a scale-free measure (i.e., has no
units of measurement); therefore, it can
be used to directly compare dispersion
10.1 Coefficient of Variation across different data sets.

• Interpretation of CV: The greater the


Relative dispersion is the amount of dispersion value of CV, the higher the risk.
relative to a reference value or benchmark.
One of such measures is coefficient of æXö
variation. = çç ÷÷
• An inverse CV è S ø è It indicates
Coefficient of Variation, (CV), is the ratio of unit of mean value (e.g., % of return)
standard deviation of set of values to their per unit of S.D.
mean value.

Coefficient of Variation (CV) measures the


amount of risk (S.D.) per unit of mean value. Practice: Example 20, CFA
Program Curriculum,
𝑆 Volume 1, Reading 2.
𝐶𝑉 = - c ®
𝑋

When stated in %, CV is:


𝑆
𝐶𝑉 = - c ® × 100%
𝑋

11. THE SHAPE OF DISTRIBUTIONS

Symmetrical return distribution or Normal extreme gains i.e. limited but frequent
distribution: downside.

It is a return distribution that is symmetrical • It has a long tail on its right side.
about its mean i.e. equal loss and gain intervals • It has skewness > 0.
have same frequencies. It is referred to as • In a positively skewed unimodal
normal distribution. distributionè mode < median < mean.
• Generally, investors prefer positive
• A symmetrical distribution has skewness skewness (all else equal).
=0
b) Negatively skewed or left-skewed
Characteristics of the normal distribution: Distribution: It is a return distribution that
1) In a normal distribution, mean = median. reflects frequent small gains and a few
2) A normal distribution is completely extreme losses i.e. unlimited but less
described by two parameters i.e. its mean frequent upside.
and variance.
• It has a long tail on its left side.
Skewed distribution: The distribution that is not • It has skewness < 0.
symmetrical around the mean is called • In a negatively skewed unimodal
skewed. distribution è mean < median < mode.

a) Positively skewed or right-skewed


Distribution: It is a return distribution that
reflects frequent small losses and a few
Reading 2 Organizing, Visualizing, and Describing Data

Sample skewness (for large values of n ≥ 100) is


computed as follows:

1 ∑- (𝑋E − 𝑋c)F
𝑆± ≈ - ® EfJ F
𝑛 𝑆

n = number of observations in the sample


s = sample S.D.

Note: Cubing in the formula preserves the sign


of the deviation from the mean. The Sample excess kurtosis (for larger sample
size(n)) is computed as:
11.1 The Shape of the Distributions: Kurtosis
¹ )𝟒
𝟏 ∑𝑵 (𝑿𝒊 − 𝑿
𝑲𝑬 = ´- ® 𝒊f𝟏 𝟒 ½−𝟑
Kurtosis is used to identify how peaked or flat 𝒏 𝒔
the distribution is relative to a normal
distribution.
• For a normal distribution (mesokurtic),
Leptokurtic: It is a distribution that is more kurtosis = 3.0.
peaked (i.e., greater number of observations • For a leptokurtic distribution, kurtosis> 3.
closely clustered around the mean value) and • For a platykurtic distribution, kurtosis < 3.
has fatter tails (i.e., greater number of
observations with large deviations from the NOTE: Kurtosis is free of scale (i.e., it has no units
mean value) than the normal distribution. of measurement).

• It has more frequent extremely large It is always positive number because the
deviations from the mean than a deviations are raised to the 4th power.
normal distribution.
• Ignoring fatter tails in analysis results in Excess kurtosis = Kurtosis – 3
underestimation of the probability of
extreme outcomes. • A normal or mesokurtic distribution has
• The more leptokurtic the distribution is, excess kurtosis = 0.
the higher the risk. • A leptokurtic distribution has excess
kurtosis > 0.
Platykurtic: It is a distribution that is less peaked • A platykurtic distribution has excess
than normal. kurtosis < 0.

Mesokurtic: It is a distribution that is identical to


the normal distribution.
Practice: Example 21, CFA
Program Curriculum
Volume 1, Reading 2.
Reading 2 Organizing, Visualizing, and Describing Data

12. CORRELATION BETWEEN TWO VARIABLES

Correlation measures the linear relationship


between two variables. NOTE:
Unlike Covariance, Correlation has no unit of
Firstly, determine how two variables vary measurement; it is a simple number.
together their covariance.
Example:
Sample covariance measures how to 𝐶𝑜𝑣É/ = 47.78 𝑆Éj = 40 𝑆/j = 250
variables in a sample move together i.e.,
measures the joint variability of two random 47.78
𝑟= = 0.478
variables. r(40)(250)

The sample covariance is calculated as:


12.1 Properties of Correlation
∑-EfJ(𝑋E − 𝑋c)(𝑌E − 𝑌c)
𝑠 ¿ =
𝑛−1
1. The correlation coefficient can range
from -1 to +1.
where, 2. Two variables are perfectly positively
n = sample size correlated if correlation coefficient is
Xi = ith observation on variable X +1.
𝑋c = mean of the variable X observations 3. Correlation coefficient of -1 indicates a
Yi = ith observation on variable Y perfect inverse (negative) linear
𝑌c = mean of the variable Y observations relationship.
4. When correlation coefficient equals 0,
Positive Covariance: When both variables tend there is no linear relationship.
to move in the same direction, they are 5. The closer the correlation coefficient is
referred to as positively correlated and have to +1 or -1, the stronger the relationship.
positive covariance.

Negative Covariance: When both variables Scatter plots are useful tool for a sensible
tends to move in the opposite direction, they interpretation of a correlation coefficient as it
are referred to as negatively correlated and demonstrates the relationship graphically.
have negative covariance.

Covariance can range from –𝛼 to + 𝛼.


Refer to Exhibit 51, Reading 2, CFA Curriculum
The covariance number doesn’t tell if the for “Scatter Plots Showing Various Degrees of
relationship between two variables is strong or Correlation”.
weak. It only tells the direction of the
relationship.

Correlation coefficient measures the Practice: Example 22, CFA


direction and strength of linear association Program Curriculum
between two variables. The correlation Volume 1, Reading 2.
coefficient between two assets X and Y
can be calculated using the following
formula:
12.2 Limitations of Correlation Analysis
.$Y@+E@-.( $* @-Ä ¿
𝒓𝑿𝒀 = #@xÅ%( #'@-Ä@+Ä #@xÅ%( #'@-Ä@+Ä
- ®- ®
Ä(YE@'E$- $* Ä(YE@'E$- $* ¿ 1. Linearity: Correlation only measures linear
relationships properly.
𝑐𝑜𝑣 ¿
=
(𝑠 )(𝑠¿ ) 2. Outliers: Correlation may be an unreliable
or measure when outliers are present in one
𝑐𝑜𝑣(𝑥, 𝑦) or both of the series.
𝑟=
r𝑣𝑎𝑟(𝑥)r𝑣𝑎𝑟(𝑦)
Reading 2 Organizing, Visualizing, and Describing Data

3. No proof of causation: Based on correlation


we cannot assume x causes y; there could 5. Correlation does not tell the whole story:
be third variable causing change in both Knowing two variables’ means, standard
variables. deviations and their correlation does not
tell the whole story.
4. Spurious Correlations: Spurious correlation is
a correlation in the data without any
causal relationship. This may occur when:
For details refer to Case Anscombe’s Quartet
Exhibit 55, Reading 2, CFA Program
i. two variables have only chance Curriculum.
relationships.
ii. two variables that are uncorrelated but
may be correlated if mixed by third
variable. Practice: End of Chapter
iii. correlation between two variables Questions from CFA Institute’s
resulting from a third variable. Curriculum & FinQuiz Question-
bank.
NOTE: Spurious correlation may suggest
investment strategies that appear profitable
but actually would not be so, if implemented.

You might also like