Professional Documents
Culture Documents
November 2019
Sponsored by
FDP Institute Study Guide
November 2019
Sponsored by
Copyright © FDP Institute, 2019
Outline of FDP Exam Study Guide Contents
Introduction to FDP Program ............................................................................................................................... 3
The FDP is a self-study program designed to educate financial professionals about the
emerging field of data science and its practical applications in the financial sector, and to
award the FDP charter to candidates who complete its exam successfully. The FDP curriculum
provides breadth and depth by first requiring candidates to complete the prerequisites of the
FDP program, which consist of one or more short online classes covering the foundations of
mathematics, statistics, and Python programming. Candidates who satisfy the prerequisites
will be able to sit for the FDP exam using the readings and learning objectives discussed in
this document.
The university faculty and industry practitioners who have helped create the FDP Charter
program bring years of experience in the financial services industry. Consequently, the
curriculum is consistent with recent advances in the applications of data science to the
financial industry.
Passing the FPD examination is an important accomplishment and will require a significant
amount of preparation. All candidates will need to study and become familiar with the FDP
curriculum material in order to develop the knowledge and skills necessary to be successful on
examination day.
This study guide is organized to facilitate quick learning and easy retention. Each topic is
structured around learning objectives that define the content to be tested on the exam.
The learning objectives are an important way for candidates to organize their study, as they
form the basis for examination questions. All learning objectives reflect the content in the
FDP curriculum and all exam questions are written to directly address the learning objectives.
A candidate who is able to meet all learning objectives in the study guide should be well
prepared for the exam. For these reasons, we believe that the FDP Institute has built a
rigorous program with high standards, while also maintaining an awareness of the value of
candidates’ time.
Candidates for the FDP exam are required to have completed the online prerequisite classes
and therefore are assumed to have an understanding of the central concepts
of quantitative analysis. In addition, since the FDP program is designed for finance
professionals, it is assumed that candidates have an understanding of finance. This
includes awareness of the roles and characteristics of various financial institutions and
instruments as well as the financial models employed by these institutions to value the
instruments and measure risk. These concepts are covered in CAIA®, CFA®, and FRM® exams,
and in dedicated undergraduate or graduate courses covering financial markets, investments,
and risk management.
Data Science
Understanding data science begins with three basic areas:
•M
athematics, probability and statistics: Data scientists use concepts from these disciplines
to explore data, develop and test a hypothesis, and present the results to end users.
•C
omputer programming: Coding allows a data scientist to manipulate large data sets,
calculate critical variables needed for statistical analysis and create output that is useful to
decision makers.
•D
omain knowledge: This distinguishes a data scientist from a statistician or a computer
programmer. A data scientist must be able to formulate the questions that are critical to
a data-driven decision-making process, identify the variables that will enter the algorithm,
and present the results in a manner that is most useful to decision makers.
Finance Math/Stat
FDP
Coding
FDP candidates can satisfy the prerequisite requirements by completing a set of short online
classes offered by a select group of online educational organizations approved by the FDP
Institute. Only one set of classes offered by these providers must be completed to meet the
pre-requisite requirements of the FDP program. Depending on the candidate’s background, it
is estimated to take 25-60 hours to complete the prerequisites. No programming background
is required to complete the pre-requisite requirement.
As of now, the following organizations have been designated to satisfy a set of prerequisites to
the core topics of data science and its applications in the financial services industry.
• Datacamp: https://www.datacamp.com/
• Dataquest: https://www.dataquest.io/
• Metis: https://www.thisismetis.com/
The list of prerequisite classes for each approved online provider appears on FDP Institute’s
website as well as in this document.
The approved online courses offered by Dataquest and Datacamp are available as soon as
a candidate registers on their respective sites. The approved course offered through Metis
is offered once a month throughout the year. All three providers offer limited free access to
their classes. Candidates should take advantage of the limited free access to determine which
platform’s approach is more suited to their needs. Candidates cannot mix and match classes
from different providers. Finally, the approved classes offered by the three platforms assume
no prior knowledge of Python language or any specific computer programming language.
Datacamp
Candidates can access Datacamp classes through their website at https://www.datacamp.com/.
Candidates are responsible for the cost of classes offered at Datacamp. Candidates are
encouraged to take advantage of limited free access offered by Datacamp to evaluate its
method of teaching. The classes listed below are short and, depending on the candidate’s
background, each one can be completed within 3-7 hours.
1. Introduction to Python
https://www.datacamp.com/courses/intro-to-python-for-data-science
An introduction to the basic concepts of Python. Candidates will learn how to use Python
both interactively and through a script. Candidates will learn about NumPy, a very important
Python package, variables, and Python’s basic data types.
3. P
andas Foundations
https://www.datacamp.com/courses/pandas-foundations
Pandas dataframes are the most widely used representation of complex data collections
within Python. Whether in finance, scientific fields, or data science, familiarity with Pandas
is essential. This class teaches candidates to work with real-world data sets containing both
string and numeric data, often structured around time series.
Datacamp continued
6. Statistical thinking with Python I
https://www.datacamp.com/courses/statistical-thinking-in-python-part-1
In this class, candidates will start building the foundation they need to think statistically.
This refresher class on statistics will use Python to demonstrate the applications of
important statistical techniques.
Dataquest
Candidates can access Dataquest classes through their website at https://www.dataquest.io/.
Candidates are responsible for the cost of classes offered at Dataquest. Candidates are
encouraged to take advantage of limited free access offered by Dataquest to evaluate its
method of teaching. The classes listed below are short and, depending on the candidate’s
background, each one can be completed within 3-7 hours.
5. Statistics Fundamentals
https://www.dataquest.io/course/sql-fundamentals/
This class builds on the prior knowledge of data analysis and helps candidates to work with
more advanced tools offered by Python. The course uses Python to provide an introductory
refresher course in statistics.
Dataquest continued
7. Probability and Statistics in Python: Intermediate
https://www.dataquest.io/course/probability-statistics-intermediate/
This class will complete the refresher classes in probability and statistics. Candidates will
learn to use Python to perform basic statistical techniques that most candidates are likely
to be familiar with already.
Metis
Candidates can access the Metis course through their website at https://www.thisismetis.com/.
Candidates are responsible for the cost of the course offered at Metis. Candidates are
encouraged to take advantage of free sample videos that are offered by Metis to evaluate its
method of teaching.
Unlike the classes offered through Dataquest and Datacamp, which consist of pre-recorded
videos and texts, Metis offers a live online classes with dedicated instructors who are ready
to answer your questions during the live sessions as well as later during office hours. Further,
while Dataquest and Datacamp classes can be taken at any time by a candidate, Metis’s live
classes are offered monthly throughout the year. Enrolled candidates will be able to watch a
video of the class should they miss a session. The approved prerequisite course offered by
Metis lasts 6 weeks.
The single course offered by Metis and required as prerequisite for the FDP exam is titled
Beginner Python and Math for Data Science, and it consists of the following 6 topics, which is
usually covered over 12 live sessions.
https://www.thisismetis.com/courses/beginner-python-and-math-for-data-science
1. P
ython Basics
Candidates are introduced to programming in Python. Candidates will learn about
Jupyter Notebooks – a popular platform for running Python programs. This part of the course
will cover the basics of programming, including data structures, data operations, if else
statements, for and while loops, and logical operations.
2. Python Advanced
This segment of the course covers advanced functionality in Python, including functions,
debugging, error handling, string manipulations, and writing efficient code.
4. Linear Algebra
This segment of the course is a refresher in linear algebra. It will cover the fundamentals
of linear algebra, including vectors, and vector manipulations, matrices and matrix
manipulations, linear equations and solutions, eigenvalues and eigenvectors.
Metis continued
5. Calculus and Probability
This module is a refresher in the fundamentals of calculus. It reintroduces students to
such central concepts of calculus such as derivatives, integrals, determining local maximum
and minimum, and limits. In addition, the module provides a refresher on central concepts
of probability such as random variables, mean, variance, probability mass and density
functions, and cumulative distribution functions.
6. Statistics
This final refresher module covers a few important statistical concepts such as ANOVA,
hypothesis testing and p-value, and confidence intervals.
FDP Examination
The FDP examination, administered twice annually, is a four-hour computer-administered
examination that is offered at test centers throughout the world. The FDP examination is
comprised of 75 multiple choice questions weighted as 60% of the total points and two to
three constructed response questions (multi-part essay type) weighted as 40% of the total
points. The FDP exam will not contain any Python programming questions.
The FDP examination is based on its study guide, which is organized to facilitate quick
learning and easy retention. Each topic is structured around learning objectives and keywords
that define the content to be tested on the exam. The learning objectives and keywords are an
important way for candidates to organize their study, as they form the basis for examination
questions. All learning objectives reflect content in the FDP curriculum and all examination
questions are written to directly address the learning objectives.
For additional information about the FDP examination please see the Candidate’s Handbook,
which can be found on FDP Institute website.
6. Data Mining & Machine Learning: Representing & Mining Text 5-10
7. Big Data, Data Mining & Machine Learning: Ethical & Privacy Issues 5-10
Big Data, Data Mining & Machine Learning: Ethical & Privacy Issues
o Institute of Business Ethics. (2016, June). Business Ethics and Big Data (IBE Issue 52).
London, U.K.
o Institute of Business Ethics. (2018, January). Business Ethics and Artificial Intelligence
(IBE Issue 58). London, U.K.
o Institute of Business Ethics. (2018, May). Beyond Law: Ethical Culture and GDPR (IBE Issue
62). London, U.K.
Learning Objectives
Topic 1. Introduction to Data Science & Big Data
Readings
1. Provost, F. and T. Fawcett. (2019). Data Science for Business. Sebastopol, CA: O’Reilly
Media Inc. Chapters 1& 2. Retrieved from http://data-science-for-biz.com/
https://www.amazon.com/Data-Science-Business-Data-Analytic-Thinking-ebook/dp/
B00E6EQ3X4
2. Getman, G. and Q. Hasan. (2019). Big Data is a Big Deal: An investor’s guide to the
applications and challenges of alternative data, New York, NY: Lombard Odier, Retrieved
from https://www.lombardodier.com/contents/news/world-in-transition/2019/april/big-
data-is-a-big-deal.html
3. Kaul, S. (2016). Chapter 8. Big Data and Investment Management. In Citi GPS: Global
Perspectives & Solutions, (pages 58-65). Retrieved from
https://www.citivelocity.com/citigps/investment-themes-2016/
4. Guida, T. (2019). Big Data and Machine Learning in Quantitative Investments. West Sussex,
UK: John Wiley & Sons Ltd. Chapters 2, 4 & 5.
https://www.amazon.com/Machine-Learning-Quantitative-Investment-Finance/
dp/1119522196
Keywords
Data mining (p. 2) Data science (p. 4)
Churn (p. 4) Data-driven decision making (p. 5)
Data engineering (p. 5, 7) Data-analytic thinking (p. 12)
Target (p. 24) Label (p.24)
Unsupervised data mining (p. 24) Supervised data mining (p. 25)
Learning Objectives
Demonstrate proficiency in the following areas:
Reading 2. Getman, G. and Q. Hasan. (2019). Big Data is a Big Deal: An investor’s guide to
the applications and challenges of alternative data, New York, NY: Lombard Odier
Keywords
Alternative data (p. 5) Discretionary (p. 9)
Big Data (p. 5) Quantitative (p. 9)
Data Science (p. 5) Financial Data (p. 9)
Data Analytics (p. 5) Market Data (p. 9)
Moore’s Law (p. 5) Tagging (p. 11)
Bezos Law (p. 5) Incrementalism (p.12)
Unstructured (p. 7)
Learning Objectives
Demonstrate proficiency in the areas of:
3) Challenges and unique skill sets needed to translate big data into actionable insights
For example:
o Describe barriers to entry and the learning curve.
o Explain the need to critically evaluate data sources.
o Discuss three major facets of refining data.
o Compare discretionary and quantitative managers in the context of aligning investment
culture with big data requirements.
4) Implementation and costs involved in utilizing alternative data in the investment process
For example:
o Describe considerations in developing realistic expectations of the benefits of blending
artificial intelligence with discretionary decision making.
o Discuss resource requirements and methods of reducing their costs.
Reading 3. Kaul, S. (2016). Chapter 8. Big Data and Investment Management. In Citi GPS:
Global Perspectives & Solutions, (pages 58-65).
Keywords
Quantitative fundamental analysis (p. 60, 62)
Keywords
Alternative data life cycle (p.145) Quantamental investing (p. 127)
Fundamental prediction (p.124) Exhaust data (p.151)
Fundamental law of active management (p.127) Nowcasts (p.151)
Quant quake (p.110)
Learning Objectives
Demonstrate proficiency in the following areas:
1) Taming big data (Ch. 2)
For example:
o Discriminate between alternative data and big data.
o Contrast drivers of adoption of alternative data with its challenges in the
investment community.
o Identify the largest categories of alternative data types in use today.
o Evaluate the usefulness of an alternative data set.
o Describe the likely attributes that differentiate alternative data sets in terms of cost.
o Discuss some of the biggest alternative data trends.
TABLE 4.1
Average annualized return of dollar-neutral, equally-weighted portfolios of liquid US equities
Readings
1. James, G., D. Witten, T. Hastie and R. Tibshirani. (2013). An Introduction to Statistical
Learning: with applications in R. New York, NY: Springer. Chapters 1, 2.1 & 2.2.
h ttp://www-bcf.usc.edu/~gareth/ISL/
https://www.amazon.com/Introduction-Statistical-Learning-Applications-Statistics/
dp/1461471370
2. Nielsen, M. A. (2015). Using Neural Networks to Recognize Handwritten Digits. In Neural
Networks and Deep Learning, Determination Press. Retrieved from
http://neuralnetworksanddeeplearning.com/chap1.html
Keywords
Statistical learning (p. 1) Cross validation (p. 33)
Classification problems (p.28) Expected test MSE (p. 34)
Semi-supervised learning (p. 28) Bias (p. 35)
Quantitative variables (p. 28) Bias-variance trade-off (p. 36)
Qualitative response (p. 28) Error rate (p.37)
Binary response (p.28) Indicator variable (p. 37)
Regression (p. 28) Training error (p. 37)
Predictors (p. 29) Test error (p.37)
Mean squared error (MSE) (p. 29) Bayes classifier (p. 37)
Training MSE (p. 30) Conditional probability (p. 37)
Test data (p. 30) Bayes decision boundary (p. 38)
Test MSE (p. 30) Bayes error rate (p. 38)
Degrees of freedom (p. 32) K-nearest neighbors (p. 39)
Learning Objectives
Demonstrate proficiency in the following areas:
1) O
rganization and resources of the book An Introduction to Statistical Learning: with
applications in R (Ch. 1).
This chapter is assigned to facilitate your studies but no exam questions will be drawn
from this chapter.
Keywords
Perceptron neurons (p. 3) Feedforward neural networks (p. 17)
Weights (p.4) Recurrent networks (p. 17)
Threshold value (p. 4) Cost function (p. 24)
Layer (p. 6) Loss function (p. 24)
Bias (p. 6) Objective function (p. 24)
NAND gate (p. 7) Quadratic cost function (p. 25)
Input layer (p. 9) Mean Squared error (MSE) (p. 25)
Learning algorithms (p. 10) Gradient descent algorithm (p. 25)
Sigmoid neuron (p. 11) Learning rate (p. 30)
Sigmoid function (p. 12) Stochastic gradient descent (p. 34)
Activation function (p. 14) Mini-batch (p. 34)
Input neurons (p. 16) Epoch (p. 35)
Output neurons (p. 16) Validation set (p. 37)
Hidden layer (p. 16) Hyper-parameters (p. 37)
Multilayer perceptrons (p. 16) Deep neural networks (p.55)
Learning Objectives
Demonstrate proficiency in the following areas:
1) Motivation for using neural nets to recognize handwritten digits
For example:
o Explain the use of a training set as an alternative to a rules-based program to
recognize digits.
2) Perceptron neurons
For example:
o Identify a modern version of a perceptron that is in common use today.
o Identify an equation that describes how perceptron neurons work in its simplest form.
o Describe the intuition of a perceptron as a decision-making device.
oD
escribe a perceptron as a NAND gate and what it implies for perceptron networks
with respect to computing logical functions.
o Explain how perceptron neurons are more than new types of NAND gates.
Topic 3. Data Mining & Machine Learning: Regression, LASSO, Predictive Models,
Times Series & Tree Models
Readings
1. Provost, F. and T. Fawcett. (2019). Data Science for Business. Sebastopol, CA: O’Reilly
Media Inc., Chapters 3, 4, 5, & 9.
2. James, G., D. Witten, T. Hastie and R. Tibshirani. (2013). An Introduction to Statistical
learning: with applications in R. New York, NY: Springer. Chapters 3 & 6.
3. Aas, K. and X. K. Dimakos. (2004). Statistical modeling of financial time series:
An introduction. Oslo Norway: Norwegian Computing Center. Retrieved from
https://www.nr.no/files/samba/bff/SAMBA0804.pdf
Reading 1. Provost, F. and T. Fawcett. (2019). Data Science for Business. Sebastopol, CA:
O’Reilly Media Inc., Chapters 3, 4, 5, & 9.
Keywords
Information (p. 43) Linear discriminant (p. 86)
Tree induction (p. 44) Support vector machine (p. 92)
Predictive model (p. 45) Hinge-loss (p. 94)
Prediction (p. 45) Zero-one loss (p. 95)
Descriptive modeling (p. 46) Squared error (p. 95)
Target variable (p. 46) Odds ratio (p. 98)
Attributes or features (p. 46) Log-odds (p. 99)
Model induction (p. 47) Logistic function (p. 101)
Deduction (p. 47) Generalization (p. 112)
Training data (p. 47) Overfitting (p. 113)
Labeled data (p. 47) Fitting graph (p. 113)
Supervised segmentation (p. 48) Holdout data (p. 113)
Information gain (p. 51) Base rate (p. 115)
Entropy (p. 51) Sweet spot (p. 117)
Parent set (p. 52) Cross-validation (p. 126)
Child set (p. 52) Folds (p. 127)
Variance (p. 56) Learning curve (p. 131)
Entropy graph/chart (p. 58) Sub-training set (p. 134)
Classification tree (p. 63) Validation set (p. 134)
Decision nodes (p. 63) Nested holdout testing (p. 134)
Probability estimation tree (p. 64) Nested cross validation (p. 135)
Tree induction (p. 64) Sequential forward selection (p. 135)
Decision surface or boundary (p. 69) Sequential backward elimination (p. 135)
Frequency-based estimation of class Independent events (p. 236)
membership probability (p. 72) Joint probability using conditional
probability (p. 237)
Keywords continued
Laplace correction (p. 73) Bayes’ Rule (p. 237)
Linear classifier (p. 85) Naïve Bayes Classifier (p. 242)
Posterior probability (p. 238) Generative Model (p. 244)
Prior (p. 238) Lift (p. 244)
Likelihood (p. 240) Naïve-Naïve Bayes (p. 245)
Conditional independence (p. 241)
Learning Objectives
Demonstrate proficiency in the following areas:
1) Models, Induction and Prediction
For example:
o Define information and tree induction.
o Define prediction in the context of data science.
o Compare and contrast predictive modeling with descriptive modeling.
o Define attributes or features.
o Describe model induction.
o Compare and contrast induction with deduction.
o Define training data and labeled data.
2) S
upervised Segmentation
For example:
o Describe supervised segmentation.
o List the complications arising from selecting informative attributes.
o Define entropy and information gain.
o Recognize and apply entropy with a set containing two distinct groups.
o Recognize and apply entropy with maximum and minimum disorder.
o Contrast parent set with child set.
o Calculate information gain for a child relative to a parent.
o Discuss the issues with the numerical variables for supervised segmentation.
o Define variance and discuss its application to numeric variables for
supervised segmentation.
o Define an entropy graph/chart.
o Describe how an entropy chart can be used to select informative variable.
o Define a classification tree, decision nodes, probability estimation tree, and tree induction.
Keywords
Residual (p. 62) Power (p. 101)
Residual sum of squares (p. 63) Variance inflation factor (p. 101)
Population regression line (p. 63) Best subset selection (p. 205)
Least squares line (p. 63) Deviance (p.206)
Bias (p. 65) Forward stepwise selection (p. 207)
Unbiased (p. 65) Backward stepwise selection (p. 208)
Standard error (p. 65) Cp (p. 211)
Residual standard error (p. 66) Akaike information criterion (AIC) (p. 211)
Confidence interval (p. 66) Bayesian information criterion (BIC) (p. 211)
Null hypothesis (p. 67) Adjusted R2 (p. 211)
Alternative hypothesis (p. 67) Ridge regression (p. 215)
t-statistic (p. 67) Tuning parameter (p. 215)
R2 statistic (p. 70) Shrinkage penalty (p. 215)
Total sum of squares (p. 70) l2 norm (p. 216)
F-statistic (p. 75) Scale equivalent (p. 217)
Forward selection (p. 78) Sparse (p. 219)
Backward selection (p. 79) Dimension reduction (p. 229)
Mixed selection (p. 79) Linear combination (p. 229)
Dummy variable (p. 84) Principal component analysis (p. 230)
Additive linear (p. 86) Principal component regression (p. 233)
Hierarchical principle (p. 89) Partial least squares (p. 237)
Polynomial regression (p. 90) Low-dimensional (p. 238)
Heteroscedasticity (p. 95) High dimensional (p. 239)
Multicollinearity (p. 101) Curse of dimensionality (p. 242)
Learning Objectives
Demonstrate proficiency in the following areas:
1) Simple Linear Regression
For example:
o Define residual and RSS.
o Recognize and apply the least squares coefficient estimates.
o Interpret the least squares coeffcients.
o Define population regression line and least squares line.
4) Shrinkage methods
For example:
o Define ridge regression, tuning parameter, and shrinkage penalty.
o Define l2 norm and scale equivalent.
o Define standardizing the predictors.
o Describe bias-variance tradeoff.
o Describe the ridge regression.
o Describe how ridge regression improves upon least squares.
o Describe the advantage of Lasso over the ridge regression.
o Define a sparse model.
o Describe the variable selection property of the Lasso.
o Compare the Lasso to the Ridge regression.
o Describe how to select the tuning parameter.
Reading 3. Aas, K. and X. K. Dimakos. (2004). Statistical modeling of financial time series:
An introduction. Oslo Norway: Norwegian Computing Center.
Keywords
Arithmetic return (p. 3) Autoregressive model (p. 8)
Geometric return (p. 3) AR(1) model (p. 8)
Time resolution (p. 5) Stationarity (p. 9)
Time horizon (p. 5) Autocorrelation function (p. 10)
Random walk model (p. 7) GARCH (1,1) (p. 13)
Learning Objectives
Demonstrate proficiency in the following areas:
1) Concepts of time series
For example:
o Define arithmetic and geometric returns.
o Recognize and apply the relationship between arithmetic and geometric returns.
o Describe the shape of the plotted line when geometric returns are plotted against
arithmetic returns.
o Define time resolution and time horizon.
o Describe how time resolution and time horizon affect the distribution of financial data.
2) Statsitical models
For example:
o Describe a random walk model and an autoregressive model.
o Recognize and apply an AR(1) model.
o Recognize and apply the variances of a random walk and an autoregressive model.
o Define stationarity and autocorrelation function.
o Recognize and apply the formula for the autocorrelation function.
Readings
1. Provost, F. and T. Fawcett. (2019). Data Science for Business. Sebastopol, CA: O’Reilly
Media Inc., Chapter 6.
2. James, G., D. Witten, T. Hastie and R. Tibshirani. (2013). An Introduction to Statistical
Learning: with applications in R. New York, NY: Springer. Chapters 4 & 10.
Reading 1. Provost, F. and T. Fawcett. (2019). Data Science for Business. Sebastopol, CA:
O’Reilly Media Inc., Chapter 6.
Keywords
Euclidean distance (p. 144) Clustering (p. 164)
Nearest neighbors (p. 144) Hierarchical clustering (p. 165)
Combining function (p. 147) Dendogram (p. 165)
Weighted voting (p. 150) Linkage function (p. 167)
Manhattan distance (p. 159) Cluster center or centroid (p. 170)
Cosine distance (p. 160) k-means clustering (p. 170)
Edit distance or Levenshtein metric (p. 161) Distortion (p. 173)
Learning Objectives
Demonstrate proficiency in the following areas:
1) Similarity and distance
For example:
o Recognize and apply the formula for calculating the general Euclidian distance.
o Define nearest neighbors and combining function.
o Explain how combining function can be used for classification.
o Define weighted voting or similarity moderated voting.
o Calculate contributions for weighted voting for classification.
o Explain how k in k-NN can be used to address overfitting.
o Discuss issues with nearest-neighbor methods with focus on
• Intelligibility,
• Dimensionality and domain knowledge, and
• Computational efficiency.
Reading 2. James, G., D. Witten, T. Hastie and R. Tibshirani. (2013). An Introduction to
Statistical Learning: with applications in R. New York, NY: Springer. Chapters 4 & 10.
Keywords
Logistic function (p. 132) Loadings (p. 375)
Odds (p. 132) Bottom-up agglomerative clustering (p. 390)
Log odds (p. 132) Linkage (p. 394)
Likelihood function (p. 133) Inversion (p. 395)
Principal component analysis (p. 375)
3) Clustering methods
For example:
o Describe the algorithm for k-means clustering.
o Define bottom-up agglomerative clustering.
o Describe how to interpret a dendogram.
o Define linkage as well as the four types of linkage.
o Define inversion.
o Describe the choice of dissimilarity measure for clustering.
o Describe the practical issues with clustering.
Readings
o Provost, F. and T. Fawcett. (2019). Data Science for Business. Sebastopol, CA: O’Reilly
Media Inc., Chapters 7 & 8.
o Arnott, R., C. B. Harvey, and H. Markowitz. (2019). A Backtesting Protocol in the Era of
Machine Learning. Journal of Financial Data Science, 1(1), 64-74.
DOI: https://doi.org/10.3905/jfds.2019.1.06
o Colquhoun, D. (2014). An investigation of the false discovery rate and the misinterpretation
of p-values. Royal Society Open Science, London, U.K.: Royal Society Open Science,
Retrieved from https://royalsocietypublishing.org/doi/full/10.1098/rsos.140216
o López de Prado, M. (2019). A Data Science Solution to the Multiple-Testing Crisis in
Financial Research. Journal of Financial Data Science, 1(1), 99-110.
DOI: https://doi.org/10.3905/jfds.2019.1.099
Reading 1. Provost, F. and T. Fawcett. (2019). Data Science for Business. Sebastopol, CA:
O’Reilly Media Inc., Chapters 7 & 8.
Keywords
Accuracy (p. 189) Recall (p. 204)
Confusion matrix (p. 189) F-measure (p. 204)
False positive (p. 190) Profit curve (p. 212)
False negative (p. 190) ROC graph (p. 215)
Expected value (p. 194) AUC (p. 219)
Class prior (p. 201) Cumulative response curve (p. 219)
Precision (p. 204)
Learning Objectives
Demonstrate proficiency in the following areas:
1) Evaluating classifiers
For example:
o Define accuracy.
o Describe a confusion matrix.
o Define false positives and false negatives.
o Describe the problems with unbalanced data.
o Describe the problems with unequal costs and benefits.
3) V
isualizing model performance
For example:
o Describe a ranking classifier.
o Define a profit curve.
o Describe the ROC graph.
o Describe the four corners and the diagonal of the ROC graph.
o Describe how the ROC space can be used to evaluate classifiers.
o Define AUC.
o Describe a cumulative response curve.
Reading 2. Arnott, R., C. B. Harvey, and H. Markowitz. (2019). A Backtesting Protocol in the
Era of Machine Learning. Journal of Financial Data Science, 1(1), 64-74.
Keywords
Exaggerated positive (p. 68)
Reading 3. Colquhoun, D. (2014). An investigation of the false discovery rate and the
misinterpretation of p-values. Royal Society Open Science, London, U.K.: Royal Society
Open Science.
Keywords
Specificity (p. 2) Power (p. 4)
Sensitivity (p. 2)
Learning Objectives
Demonstrate proficiency in the following areas:
1) An investigation of the false discovery rate and the misinterpretation of p-values
For example:
o Define specificity and sensitivity.
o Describe the false discovery rate with the help of a tree diagram.
o Define power of a test.
o Describe the false discovery rate in simulated t-tests.
o Describe underpowered study.
o Describe the inflation effect in the context of false discovery.
o Describe what happens when we consider p=0.05 rather than p<=0.05.
o Describe Berger’s approach.
Keywords
Selection bias under multiple testing (p. 99)
Learning Objectives
Demonstrate proficiency in the following areas:
1) A Data Science Solution to the Multiple-Testing Crisis
For example:
o Define selection bias under multiple testing.
o Describe the three properties satisfied by the heatmap of the correlation matrix of
the returns of all trials.
o Describe clustering of trials.
o Describe cluster statistics and how it can reduce the probability of selecting a
false positive.
o Describe the implications for authors, journals, and financial firms.
Topic 6. Data Mining & Machine Learning: Representing & Mining Text
1. Provost, F. and T. Fawcett. (2019). Data Science for Business. Sebastopol, CA: O’Reilly
Media Inc., Chapter 10.
2. Jurafsky, D. and J. Martihttps. (2018). Chapter 4. Naïve Bayes and Sentiment Classification,
In Speech and Language Processing. Retrieved from https://web.stanford.edu/~jurafsky/
slp3/4.pdf
Reading 1. Provost, F. and T. Fawcett. (2019). Data Science for Business. Sebastopol, CA:
O’Reilly Media Inc., Chapter 10.
Keywords
Linguistic structure (p. 252) Term frequency (p. 254)
Dirty data (p. 252) Stemmed (p.255)
Document (p. 253) Stopwords (p. 25)
Token (p. 253) Inverse document frequency (p. 256)
Terms (p. 253) n-grams (p. 265)
Corpus (p. 253) Latent information model (p. 268)
Bag of words (p. 254) Information triage (p. 276)
Learning Objectives
Demonstrate proficiency in the following areas:
1) Broad issues involved in mining text
For example:
o Explain why text is “dirty” which makes mining text is difficult.
2) Text representation
For example:
o Understand the meaning of terms when used in the field of information retrieval.
o Describe the “bag of words” approach including the following steps:
• Measuring term frequency (TF)
• Measuring sparseness: inverse document frequency (IDF)
• Combining them: TFIDF
o Apply appropriate methods to search an example set of documents.
o Express entropy in terms of the IDF.
Keywords
Sentiment analysis (p. 1) Gold labels (p.11)
Probabilistic classifier (p. 2) Precision (p.12)
Generative classifier (p. 2) Recall (p.12)
Discriminative classifier (p. 2) F-measure (p.13)
Linear classifier (p. 5) Macroaveraging (p.13)
Sentiment lexicon (p. 9) Microaverageing (p.13)
Learning Objectives
Demonstrate proficiency in the following areas:
1) Classification
For example:
o Describe common applications of classifying text.
o Describe tasks often involved in classifying text.
o Compare alternative methods of classification.
1. Institute of Business Ethics. (2016, June). Business Ethics and Big Data (IBE Issue 52).
London, U.K.
2. Institute of Business Ethics. (2018, January). Business Ethics and Artificial Intelligence
(IBE Issue 58). London, U.K.
3. Institute of Business Ethics. (2018, May). Beyond Law: Ethical Culture and GDPR
(IBE Issue 62). London, U.K.
Reading 1. Institute of Business Ethics. (2016, June). Business Ethics and Big Data (IBE Issue
52). London, U.K.
Key Terms
Data trust deficit (p. 2) Veracity (p. 6)
Learning Objectives
Demonstrate proficiency in the following areas:
1) Big data for business
For example:
o Discuss the potential and concerns of big data for business.
o Explain how the new term “data trust deficit” developed.
2) Ethical issues
For example:
o List five methods of protecting human rights in the ‘Era of Big Data.’
o Provide an example of a concern for each of three main areas of privacy issues:
customer profiling, group privacy and data security.
o Discuss what constitutes informed consent.
o Provide an example of how to improve the veracity of data.
Keywords
Artificial intelligence (p. 1) Code of ethics (p. 6)
Learning Objectives
Demonstrate proficiency in the following areas:
1) The nature of and business risks of artificial intelligence (AI)
For example:
o List three main features characterizing artificial intelligence.
o List three immediate risks of artificial intelligence.
Keywords
General Data Protection Regulation (p. 1) People risk (p. 3)
Learning Objectives
Demonstrate proficiency in the following areas:
1) General Data Protection Regulation (GDPR)
For example:
o Describe the primary purpose of the GDPR.
o Describe the key changes in data protection regulation including the meaning of
• rights of the individual
• informed consent
• notification
• data portability
• supervision and enforcement, and
• liability
Readings
1. Financial Stability Board. (2017) Artificial Intelligence and Machine Learning in Financial
Services: Market Developments and Financial Stability Implications. Retrieved from
http://www.fsb.org/wp-content/uploads/P011117.pdf
2. Phoon, K. and F. Koh. (2018). Robo-Advisors and Wealth Management. Journal of
Alternative Investments, 20(3), 79-94. DOI: https://doi.org/10.3905/jai.2018.20.3.079
3. Monk, A., M. Prins, and D. Rook. (2019). Rethinking Alternative Data in Institutional
Investment. Journal of Financial Data Science, 1(1), 14-31. DOI: https://doi.org/10.3905/
jfds.2019.1.1.014
4. Simonian, J., C. Wu, D. Itano and V. Narayanan. (2019). A Machine Learning Approach to
Risk Factors: A Case Study Using the Fama–French–Carhart Model. Journal of Financial
Data Science, 1(1), 32-44. DOI: https://doi.org/10.3905/jfds.2019.1.032
5. Rasekhschaffe, K. and R. Jones. (2019). Machine Learning for Stock Selection.
Financial Analyst Journal, Forthcoming. Retrieved from https://papers.ssrn.com/sol3/
papers.cfm?abstract_id=3330946
6. Gu, S., B. Kelly, and D. Xiu. (2018). Empirical Asset Pricing via Machine Learning.
Retrieved from https://papers.ssrn.com/sol3/papers.cfm?abstract_id=3281018
7. h
ttps://www.bryankellyacademic.org/ http://dachxiu.chicagobooth.edu/download/ML.pdf
The paper posted on the FDP Institute’s website will be the version used for
exam questions.
8. López de Prado, M. (2018). The 10 Reasons Most Machine Learning Funds Fail. The Journal
of Portfolio Management, 44 (6) 120-133; DOI: https://doi.org/10.3905/jpm.2018.44.6.120
9. Harvey, C. R. and Y. Liu (2014). Evaluating Trading Strategies. [Special 40th Anniversary
Issue]. The Journal of Portfolio Management, 40(5), 108-118. DOI: https://doi.org/10.3905/
jpm.2014.40.5.108
10. Guida, T. (2019). Big Data and Machine Learning in Quantitative Investments. West Sussex,
UK: John Wiley & Sons Ltd. Chapter 10.
11. Das, S., S. Kim and B. Kothari. (2019). Zero-Revelation RegTech: Detecting Risk through
Linguistic Analysis of Corporate Emails and News. The Journal of Financial Data Science,
1(2), 8-34. DOI: https://doi.org/10.3905/jfds.2019.1.2.008
Keywords
Big data (p. 4) Fraud detection (p. 11)
Artificial intelligence (p. 4) RegTech (p. 11)
Natural language processing (p. 5) InsurTech (p. 13)
Machine learning (ML) (p. 4) Chatbots (p. 14)
Supervised learning (p. 5) Know your customer (KYC) (p. 20)
Unsupervised learning (p. 5) SupTech (p. 21) Reinforcement learning (p. 5) Auditability (p.33)
Deep learning (p. 5) Fintech (p. 35)
Sentiment indicators (p. 10) Rob-advisors (p.35)
Trading signals (p. 11) Tonality analysis (p.36)
Learning Objectives
Demonstrate proficiency in the following areas:
1) Categories of machine learning algorithms
For example:
o Define four categories of machine learning algorithms based on the degree of
human intervention.
o Describe the role of machine learning algorithms in determining causality vs correlation.
o Define ‘augmented intelligence’.
4) The micro effects of uses of artificial intelligence and machine learning in the
financial sector.
For example:
o Describe the uses of artificial intelligence and machine learning in information gathering
and processing their potential impacts on financial markets.
o Describe the uses of artificial intelligence and machine learning in improving efficiency
of financial institutions.
o Describe the uses of artificial intelligence and machine learning by financial institutions
and their potential impacts on customers and investors.
5) The macro effects of uses of artificial intelligence and machine learning in the financial sector
For example:
o Describe economic growth and enhanced economic efficiency that could result from
the applications of artificial intelligence and machine learning to financial services.
o Describe the implications of uses of artificial intelligence and machine learning by
financial institutions for market concentration and systemic importance of
those institutions.
o Describe how the uses of artificial intelligence and machine learning by financial
institutions could be sources of greater stability and vulnerability in financial markets.
o Describe how the uses of artificial intelligence and machine learning by insurance
industry could affect both moral hazard and adverse selection problems.
o Describe challenges posed by the lack of interpretability or auditability in applications
of artificial intelligence and machine learning in the financial industry.
Keywords
Fintech (p. 79) D2C platforms (p. 86)
Robo-advisor (p. 80) Hybrid (p. 86)
Work-flow (p. 83) B2B platforms (p. 86)
Learning Objectives
Demonstrate proficiency in the following areas:
1) Robo-advisors, their characteristics and services offered by them
For example:
o List the three areas of passive strategies.
o Compare and contrast North American and Asian robo-advisors.
o Describe the generic workflow employed by robo-advisors to provide financial advice.
o List the key characteristics of robo-advisors.
o List the three main areas of private wealth management.
o Describe areas of private wealth management that robo-advisors could be more effective.
• Investment advisory and asset management
• Retirement planning
• Estate planning
• Tax planning
Reading 3. Monk, A., M. Prins, and D. Rook. (2019). Rethinking Alternative Data in
Institutional Investment. Journal of Financial Data Science, 1(1), 14-31.
Keywords
Alternative data (p. 14) Volume (p. 21)
Social media (p. 14) Velocity (p. 21)
Microdata (p. 14) Variety (p. 21)
Data exhaust (p.14) Veracity (p. 21)
Rivalry (p. 16) Granularity (p. 21)
Excludability (p.16) Relationality (p. 21)
Defensive strategies (p. 17) Flexibility (p. 21)
Defensible strategies (p. 18) Actionability (p. 22)
Operational alpha (p. 19) Excludable (p. 28)
Aggregation (p. 19) Data hoarding (p. 29)
Disaggregation (p. 19)
Keywords
Factors (p. 32) CART (p. 34)
Linear (p. 32) Binary recursive partitioning (p. 34)
Nonlinear (p. 32) Bagging (p. 34)
Random forest (p. 33) Out-of-bag data (p. 34)
Supervised (p. 34) Feature importance (p. 35)
Unsupervised (p. 34) Mean decrease accuracy (p. 34)
Root node (p. 34) Mean decrease accuracy (p. 34)
Decision node (p. 34) Fama-French-Carhart (p. 37)
Terminal node (p. 34) Probabilistic Sharpe ratio (p. 42)
Keywords
Machine learning (p. 3) Forecast horizon (p. 10)
Overfitting (p. 5) Bagging (p. 11)
Signal-to-noise ratio (p. 4) Boosting (p. 11)
Ensemble algorithms (p. 6) Fundamental factors (p. 13)
Feature engineering (p. 8) Technical factors (p. 11)
Learning Objectives
Demonstrate proficiency in the following areas:
1) The applications of machine learning algorithms to stock selection
For example:
o Describe the role of signal to noise ratio in creating overfitted models.
o Discuss the implications of the paper’s findings with regard to in-sample versus
out-of-sample errors as the number of boosting iterations increase.
Keywords
Machine Learning (p. 2) I Intrinsic error (p. 15)
Regularization (p. 2) Terminal nodes (p. 16)
Mean squared error (p. 9) Impurity (p. 16)
Ordinary least squares (p. 9) Weak learners (p. 17)
Heavy tails (p. 11) Boosting (p. 18)
Huber loss function (p. 12) Random forest (p. 18)
Penalized linear models (p. 11) Gradient boosted regression trees (p. 18)
Loss function (p. 11) Neural network (p. 19)
Penalty function (p. 11) Hidden layers (p. 19)
Net elastic (p. 11) Feed-forward networks (p. 19)
Hyperparameters (p. 12) Input layer (p. 20)
Tuning parameters (p. 12) Output layer (p. 20)
LASSO (p. 13) ReLU function (p. 21)
Principal components (p. 13) Stochastic gradient descent (p. 21)
Partial least squares (p. 13) Early stopping (p. 22)
Approximation error (p. 15) Sharpe ratio (p. 36)
Estimation error (p. 15)
Learning Objectives
Demonstrate proficiency in the following areas:
1) Applications of machine learning algorithms to empirical asset pricing
For example:
o Describe the three components of the definition of machine learning.
o Describe the three aspects of empirical asset pricing model that makes it attractive
for the applications of machine learning algorithms.
Keywords
Backtesting (p. 122) Precision (p. 128)
Volume clock (p. 123) Recall (p. 128)
Dollar bars (p. 123) F1-score (p. 128)
Stationary (p. 123) Walk-forward approach (p. 129)
Integer differentiation (p. 123) Leakage (p. 129)
Fractional differentiation (p. 124) Probabilistic Sharpe ratio (p. 132)
Triple barrier method (p. 127) Deflated Sharpe ratio (p. 132)
Keywords
T-statistics (p. 110) BHY hurdle (p. 112)
Family-wise error rate (p. 111) Type I error (p. 113)
False discovery rate (p. 111) Type II error (p. 113)
Bonferroni test (p. 112) p-value (p. 114)
Holm test (p. 112)
Keywords
Mainstream (p. 336) Part of Speech Tagging (p. 349)
Primary source (p. 336) Stemming (p. 350)
Social media (p. 337) Lemmatization (p. 350)
Sentiment analysis (p. 339) Naïve Bayes (p. 355)
Natural language processing (p.347) FNN (p. 363)
Tokenization (p. 348) RNN (p. 363)
Word filter (p. 348) CNN (p. 363)
Keywords
Natural language processing (p.8) Episode (p. 11)
RegTech (p. 8) Topic Net sentiment (p. 16)
Fintech (p. 11) Polarity (p. 16)
Fabula (p. 11) Disagreement (p. 16)
Syuzhet (p. 11) Term document matrix(TDM) (p.29)
Topic modeling (p. 11) Document term matrix (DTM) (p. 29)
Learning Objectives
Demonstrate proficiency in the following areas:
1) Using linguistic analysis to perform risk analysis of investments.
For example:
o Explain the difficulties associated with manual parsing of unstructured text.
o Describe how content and structure of emails could be used for risk analysis.
o Explain the effectiveness of textual versus hard numbers in corporate risk analysis.
o Define RegTech.
o Apply the net sentiment metric to calculate the polarity of a text using its Pos and
Neg figures.
o Calculate the disagreement measure of a text using Pos and Neg figures of a text.
o Describe the findings of the paper regarding the effectiveness of email length as
predictor of risk analysis of Enron.
Action Words
In each of the above learning objectives, action words are used to direct your study focus. Below
is a list of all action words used in this study guide, along with definitions and two examples of
usage, in a question example and in a description. Should you not understand what is required
for any learning objective, we suggest you refer to the table below for clarification.
NOTE: The question examples in this table are NOT sample questions for the current exam.
B. The short convertible may be called in and the position must Correct Answer: B
be delivered, forcing the hedge to be unwound at an
inopportune time.
Apply Make use of Alicia Weeks, CFA, Real Estate Investment Advisor, works in an You have to apply
Asian country where there are no securities laws or regulations. CFA Institute Standard I to
According to CFA Institute Standard I, Fundamental find the correct answer.
Note: If you are
Responsibilities, Alicia:
asked to apply
a model to A. Must adhere to the standards as defined in a neighboring
data, you will country that has the strictest laws and regulations. Correct Answer: C
be expected
to have the B. Need not concern herself with ethics codes and standards.
appropriate
equation C. Must adhere to the CFA Institute’s codes and standards.
memorized,
unless the D. Must adhere to the standards as defined in a neighboring
question also country that has the least strict laws and regulations.
contains the
action word
“recognize”.
Argue Prove by Why did the shape of the supply curve for venture capital funds You have to describe how
reason or by change after 1979? the curve has changed
presenting AND argue why it changed
the associated by providing reasons and
pros and cons; supporting the reasons with
debate statements of facts (e.g.,
change in regulations).
Calculate Determine Consider a set of 100 people. Eighty percent have feature A and You have to calculate
a value twenty percent do not have feature A. What is the entropy for entropy based on the
mathematically this set? given probabilities.
A. 0.72
Note: You will
be expected to B. 0.88
have the Correct Answer: B
appropriate C. 0.93
equation
memorized,
unless the
question also
contains the
action word
“recognize”.
Compare Describe Which of the following least accurately compares the Sharpe You must compare the ratios
similarities and Treynor ratios? based on their most important
and similarities and their most
differences D. Both ratios contain excess return in the numerator. important differences.
C. The present value of the payment streams of the two loan types
are the same.
Construct Make or form A reverse convertible arbitrage hedge consists of a: You must combine positions to
by combining or construct the hedge.
arranging parts A. Short convertible position plus a put option on the stock.
or elements
B. Long convertible position plus a put option on the stock.
C. Short convertible position plus a call option on the stock. Correct Answer: D
Contrast Expound on Which of the following best characterizes a difference between You have to contrast the
the differences value at risk (VaR) and modified VaR? assumptions of the first
model to those of the
A. Modified VaR is expressed as a percent while VaR is a second model so that the
dollar value. differences are clear.
Define State the The interest rate charged by banks with excess reserves at a Federal You must define, in this case,
precise meaning Reserve Bank to banks needing overnight loans to meet reserve the federal funds rate.
requirements is called the:
A. Prime rate.
Correct Answer: C
B. Discount rate.
B. Average
Differentiate Constitute What type of convertible hedge entails shorting a convertible and You must differentiate one
the distinction going long in the underlying stock? type of hedge from another.
between;
distinguish A. Call-option hedge
D. Reverse hedge
Discuss Examine Discuss the limitations of private equity data. You must present a
or consider discussion of a set of ideas in
a subject a list or paragraph.
Explain Illustrate 1. Explain why return on assets (ROA) rather than return on equity 1. You must place a series
the meaning (ROE) might be the preferred measure of performance in the case of thoughts together as an
of hedge funds. explanation of a term
or issue.
or
Identify Establish The investments that have historically performed best during periods You must identify the term
the identity of recession are: that best meets the criterion
of the question
A. Commodities.
B. Treasury bills.
D. Gold.
Illustrate Clarify through For two types of convergence hedges, what situations present You must provide an example
examples or profitable opportunities, how are the hedges set up, and what are for each hedge or compare
comparisons the associated risks? the two to illustrate how
they work.
C. 30-day CD if you expect that interest rates will fall in the future.
D. 2-year CD if you expect that interest rates will fall in the future.
List Create a series List the determinants of real interest rates. You must differentiate from
of items a list those items that are
consistent with the question.
Outline Summarize hich of the following best characterizes the steps in computing a
W You must outline the study’s
tersely geometric mean return based on a series of periodic returns from most important findings
T time periods? rather than explain them
in detail.
A. Add one to each return, add them together, divide by T and
subtract one.
B. Add one to each return, multiply them together, divide by T Correct Answer: D
and subtract one.
C. Add one to each return, add them together, take the Tth root
and subtract one.
D. Add one to each return, multiply them together, take the Tth
root and subtract one.
Recognize Recall the What is the following equation called and used for in the context You must recognize that
purpose of a of artificial neural networks? this is a sigmoid function, also
given equation referred to as a logistic
or term, and function, which is used as the
its name when basis for a sigmoid neuron.
appropriate.
The activities of investment advisors in the venture capital market Correct Answer: C
No part of this publication may be reproduced or used in any form (graphic, electronic or
mechanical, including photocopying, recording, taping or information storage and retrieval
systems) without permission by Financial Data Professional Institute, Inc. (“FDP”). The views
and opinions expressed in the book are solely those of the authors. This book is intended to
serve as a study guide only; it is not a substitute for seeking professional advice.
FDP disclaims all warranties with respect to any information presented herein, including
all implied warranties of merchantability and fitness. All content contained herein is provided
“AS IS” for general informational purposes only. In no event shall FDP be liable for any
special, indirect or consequential changes or any damages whatsoever, whether in an
action of contract, negligence or other action, arising out of or in connection with the content
contained herein. The information presented herein is not financial advice and should not be
taken as financial advice. The opinions and statements made in all articles and introductions
herein do not necessarily represent the views or opinions of FDP.