An Empirical Studyon Predicting Future Preferencesof Mutual FundInvestor Sin Turkey Using Data Mining Methods

See discussions, stats, and author profiles for this publication at: https://www.researchgate.
net/publication/349412795
An Empirical Study on Predicting Future Preferences of Mutual Fund Investors

in Turkey using Data Mining Methods
Experiment Findings · February 2021
CITATIONS READS
0 47
1 author:
Mete Çobanoğlu
Bogazici University
1 PUBLICATION 0 CITATIONS
SEE PROFILE
Some of the authors of this publication are also working on these related projects:
An Empirical Study on Predicting Future Preferences of Mutual Fund Investors in Turkey using Data Mining Methods View project
All content following this page was uploaded by Mete Çobanoğlu on 18 February 2021.
The user has requested enhancement of the downloaded file.

Boğaziçi University
Financial Engineering FE 572 Project
An Empirical Study on Predicting Future Preferences of

Mutual Fund Investors in Turkey using Data Mining Methods
by
Mete ÇOBANOĞLU
Project Coordinator: Refik Güllü, Professor
February 2021
An Empirical Study on Predicting Future Preferences of Mutual Fund Investors
in Turkey using Data Mining methods.
ABSTRACT
Mutual funds market is in a rapid and robust growth trend in Turkey for recent years with the
boost of the improving integrity of capital markets of Turkey with the global markets. Owing
to improvements in access to modern financial investments, Turkish investors shifted from
conventional saving methods and financial investment to much complicated and modern
instruments by investing into mutual and hedge funds. As mutual fund market expended, The
Capital Markets Board Of Turkey (CMB), regulatory and supervisory authority of all capital
and securities markets in Turkey, to ensure the stability, efficiency, and transparency of the
mutual funds market, took the authorization to issue and manage funds from banks, investment
companies, insurance companies, or any other type of company and gave it to the portfolio
management companies in 2013. With this comprehensive change in the market, a considerable
rise in the number of portfolio management companies has been seen and accordingly by virtue
of the increasing marketing activities of these companies, the number of mutual funds
investments almost reached to two and a half size. Marketing units are responsible to reach out
to potential fund investors, convince and direct them to the suitable type of fund to increase the
size of the funds and to collect more management fee from funds. Therefore, accurate and in
time marketing strategies are crucial. Besides the historical success of the fund management
team, marketing units may use present trends and near future expectations of benchmarks such
as stock markets, foreign exchange rates, or fixed income yields, to forecast fund investors
decisions about investing or leaving the funds market in the future. Additionally, future
tendency of the fund investors among fund types can be predicted by observing related
benchmark trends. Obtaining such valuable and substantial sight may help portfolio
management companies to steer their limited workforce and capital sources more precisely.
The objective of this empirical study is to determine the relationship between the amount of
mutual funds invested in the past by investors and the historical returns of the benchmarks,
using data mining methods by learning from big historical data. Multiple linear and logistic
regressions are used in models to learn from more than forty-six thousand observations and
nineteen variables. Backtesting of the models with the real data exhibit significant results
ranging from quite successful to inadequate depending on fund type. An important subject to
consider is that the period used as backtesting starts with January 2020 and ends with November
2020 which includes fatal effects of Covid-19 global pandemic for both humanity and
economies. As expected, growth rates turned to negative, commodities like gold and silver
skyrocketed, interest rates declined to zero band or below the band, and foreign exchanges
fluctuated globally. However, the impacts of the pandemic for the period on stock markets was
extraordinary; while all economies in the world were in a shrinkage, unlike the expectations,
stock markets hit all-time highs and generated immense returns. This anomaly frankly generates
conflicts with the models which are based on past behaviors of the variables and may lower the
exactness of the predictions.
2
CHAPTER ONE
Introduction
1.1 Outlook of the market
First mutual fund in Turkey was started by T. İş Bankası in 1986I and followed by other state
and private banks for following years. By 1990s investment and insurance companies
established mutual funds as well and the number of funds in the market moved from sixty to
two hundred. The transparency and the disclosure mechanism of the market was not sufficient
to obtain data or statics about the funds until 2007. After 2007 CMB started to announce
monthly data of mutual funds market including fund portfolio allocation to number of fund
units sold. Graph-1 exhibits the historical change in number of funds issued, number of units
sold and number of portfolio management company number from January 2007 to January
2020II.
Graph-1
Classification of funds by type changed several times by CMB. Until 2013 funds classified by
the riskiness of the assets invested. Funds holding risky investments such as derivatives, stocks,
or foreign exchanges named as Type-A, funds investing in relatively less risky assets like fixed
income debt instruments or depo assets classified as Type-B which are more liquid funds
contrary to Type-A funds. Subsequent to extensive legal framework update in 2013, funds
classified and named according to the assets hold in portfolio II. Graph-2 indicates portfolio
allocation of funds by funds which are classified by the major asset class.
I
Bülent Öztürk, Captial Markets Board of Turkey, Research Report, Ankara - May 2002.
https://spk.gov.tr/SiteApps/Yayin/YayinGoster/948
II
Captial Markets Board of Turkey, Monthly Statics Bulletin.
https://www.spk.gov.tr/SiteApps/Yayin/AylikIstatistikBultenleri
3
Graph-2
1.2 Motivation and Importance
Performance of fund managers and evaluation of the success of mutual funds have been the
fundamental topics of academic research and studies for several years. Developments in data
mining and prediction techniques ramified studies through future forecasting but only in a
limited way. Most studies are about fund returns and performance of the managers. No similar
studies on mutual fund investor behavior or investor preference forecasting by employing data
mining tools have not been conducted in Turkey yet. With this motivation, I have tried to
develop accurate and sound models for Turkish fund markets to have the honor of being
vanguard in the literature.
4
1.3 Research Objectives
The base objective of this research is to interpret the behavior of Turkish mutual fund market
investors and accordingly to predict near future inclination of investors about mutual funds.
Subject to the success of the study and results, an user friendly application consisting of
effective and accurate, tailor-made models that are developed for certain criteria or target
groups may come in handy for the sector. Privilege of having a sight about near future
preferences of investors, might help companies to adjust their marketing strategies faster and
wisely.
The research tries to answer two questions throughout the study; First, is there any meaningful
relationship between benchmarks and fund units sold to construct model and how strong the
relationship is? Second, how accurate the model is when backrests performed.
1.4 Outline of The Project
The project consists of six chapters. First chapter introduces the general structure of the project.
Second chapter reviews the literature used in the models. Third chapter explains the
methodologies followed through data mining and prediction processes, how data obtained and
handled. Fourth chapter particularly explains how models are generated and run-on R script.
Chapter five clarifies how backtests managed and what are the outcomes. In chapter 6 summary
of the project results are discussed. Appendix and references can be found at the end of the
study.
CHAPTER TWO
Literature Review
2.1 Literature Review
2.1.1 Data Mining
Data mining is the process of analyzing massive volumes of data to seize meaningful
relationships between observations and to generate data patterns to make predictions. The
advancements in computer processors and exponential growth in digital storage capacity in the
recent years led computers to cope with complex calculations in an immense pace. Institutions
and academic researchers harnessing data mining almost in every concept to improve processes
and to make precise decisions. Statics is the core of all data mining algorithms however there
are significant pros of data mining contrary to statics. Statics, with regard to computation
limitations, only focuses on the sample of a population, conversely data mining takes all the
population into account no matter how ample the data is. Additionally, data mining can easily
5
deal with non-numeric problems. The process followed in this study is called Knowledge
Discovery in DatabasesIII (KDD; Fayyad, Piatetsky-Shapiro, & Smyth, 1996; Han, Kamber, &
Pei, 2011) KDD describes a series of steps to be taken in the general process of converting data
into knowledge. Figure-1IV illustrates the steps of a KDD process and Table-1describes the
steps.
Figure-1
1. Selection: Selecting a data set with a subset of variables or data samples. Selection is
guided by prior domain knowledge and end-user goals.
2. Preprocessing: Cleaning the data set; this includes removing outliers or noise and
handling missing data.
3. Transformation: The data are further transformed into a form more useful for the data
mining task; this can include reducing the number of feature variables to the most
relevant, or projecting the features to a more useful space, such as a logarithmic rather
than a linear scale.
4. Data mining: Applying the appropriate task and method to the data; tasks include
Classification, Regression, Clustering, and Subgroup Discovery.
5. Interpretation: Task-dependent evaluation of the patterns learned via data mining;
domain knowledge is used to assess whether these patterns make sense with respect to
the domain to avoid spurious results.
Table-1
The unique and ultimate feature of the process is the ability to involve elements of iterations
between steps in order to build confidence in the results from the discovery process.
III
Demchenko, Grosso, de Laat, & Membrey, 2013, Addressing Big Data Issues in Scientific Data
Infrastructure
https://www.researchgate.net/publication/256082290_Addressing_Big_Data_Issues_in_Scientific_Dat
a_Infrastructure
Gandomi & Haider, 2015; Beyond the hype: Big data concepts, methods, and analytics
https://www.sciencedirect.com/science/article/pii/S0268401214001066
Ishwarappa &Anuradha,2015, A Brief Introduction on Big Data 5Vs Characteristics and Hadoop
Technology
https://www.researchgate.net/publication/282536587_A_Brief_Introduction_on_Big_Data_5Vs_Char
acteristics_and_Hadoop_Technology
IV
Joseph C. Mellor, Michael A. Stone, John Keane, Application of Data Mining to “Big Data” Acquired
in Audiology: Principles and Potential,
https://journals.sagepub.com/doi/full/10.1177/2331216518776817#
6
2.1.2 Multiple Linear Regression
The concept of simple linear regression where a single predictor variable X was used to model
the response variable Y. In many applications, there is more than one factor that influences the
response. Multiple regression models thus describe how a single response variable Y depends
linearly on a number of predictor variables.
A multiple linear regression model with k predictor variables X1, X2, ..., Xk and a response Y,
can be written as,
y = β0 + β1x1 + β2x2 + ··· βkxk + ε
The ε are the residual terms of the model and the distribution assumption placed on the residuals
allow us later to do inference on the remaining model parameters. Interpret the meaning of the
regression coefficients β0, β1, β2, ..., βk in the model.
The simplest multiple regression model for two predictor variables is,
y = β0 + β1x1 + β2x2 + ε
and the surface that corresponds to the model is a plane in tree dimensional universe with
different slopes in x1 and x2 directions.
Graph-3
III
M. Bremer, University of Wollongong, MATH 261A, supplement 5 - multiple regression, 2012
https://www.coursehero.com/file/24577572/supplement-5-multiple-regressionpdf/
7
The evaluation of a multiple linear regression can be made by taking a sample of the data as
test data and use the model as a predictor on the test data. The R2, also called the coefficient of
determination, is used to explain the degree to which input explain the variation of output
variables. It ranges from 0 to 1. The higher the R2 the better the variable explains the change in
the output variable. Adding new variables to the model may generate higher R2 however, this
concept does not improve the model, unlike it may led to over-fitting and reduce the
effectiveness of the model. Therefore, to avert this phenomenon, adjusted R2 could be used to
evaluate the model.
The adjusted R2 is a modified version of R2 that accounts for predictors that are not significant
in a regression model. In other words, the adjusted R2 shows whether adding additional
predictors improve a regression model or not. Adjusted R2 illustrates how well terms fit to
model line. Thus, if adding new variables to the model causes lower adjusted R2, than additional
variables are not adding value to the model. The best model among all variable combinations,
is the one with the highest adjusted R2.
2.1.3 Logistic Regression
Logistic Regression is one of the most used Machine Learning algorithms for binary
classification IV-V. Logistic Regression measures the relationship between the dependent
variable and the one or more independent variables by estimating probabilities using its
underlying logistic function.
These probabilities must then be transformed into binary values in order to make a prediction.
This is the task of the logistic function, also called the sigmoid function. The Sigmoid-Function
is an S-shaped curve that can take any real-valued number and map it into a value between the
range of 0 and 1, but never exactly at those limits. This values between 0 and 1 then be
transformed into either 0 or 1 using a threshold classifier.
Figure-2 below illustrates the steps that logistic regression goes through to give desired output
and Graph-4 exhibits the sigmoid function.
Figure-2 Graph-4
8
Goodness of the fit can be measured by the improvement in null deviance. The null deviance
shows how well the response variable is predicted by a model that includes only the intercept.
By including the independent variables, residual deviance is calculated. The positive effect of
adding variables to the model should generate a lower residual deviance than the null deviance.
Null deviance equation is
Where,
• y is the outcome.
• μ^ is the estimate of the model.
• θS and θ0 are the parameters of the fitted saturated and proposed models, respectively.
A saturated model has as many parameters as it has training points, that is, p=n. Thus,
it has a perfect fit. The proposed model can be the any other model.
• p(y|θ) is the likelihood of data given the model.
The deviance indicates the extent to which the likelihood of the saturated model exceeds the
likelihood of the proposed model. If the proposed model has a good fit, the deviance will be
small. If the proposed model has a bad fit, the deviance will be high.
Another method to evaluate the goodness of the model is using Akaike Information Criterion
(AIC). AIC provides a method for assessing the quality of the model through comparison of
related models. It is based on the deviance but penalizes model for making the model more
complicated. Much like adjusted R-squared, its intent is to prevent model from including
irrelevant predictors. However, unlike adjusted R-squared, the number itself is not meaningful.
Therefore, it should be used to compare models, where the model with smaller AIC means
better fit.
AIC is defined as
where L head is the maximum of the likelihood function.
IV
R-Bloggers, 2015 https://www.r-bloggers.com/2015/08/evaluating-logistic-regression-models/
V
Machine Learning Glossary, 2017
https://ml-cheatsheet.readthedocs.io/en/latest/logistic_regression.html
9
CHAPTER THREE
Methodology
3.1 Data Preparation
3.1.1 Fund Data
As mentioned in the begging of the study, proper, constant, reliable, and most important
accessible fund data first published in January 2007 on CMB website VI. The data distributed is
in monthly frequency and consists of fundamental details about every live fund in the market
in related date. Fund name, fund type, issuer, number of fund units sold, unit price, net asset
value, and detailed portfolio allocation by asset class can be found in the bulletin in Excel file
format. With the help of a piece of Python code, all fund data is retrieved from website and
consolidated in an Excel file. Throughout the time, format and pattern of the bulletin has been
changed by CMB several times. Once again with the cooperation of a script, but this time by
employing Visual Basic language, whole data fitted in a pattern. This study heavily depends on
the change of number of fund units sold by time, thus when the change in number of fund units
sold is visualized to determine outliers, noises, and missing data, results were not appropriate
to use in a regression model. To transform the data into a more compact format, the lowest
negative change ratio in a fund during life span, excluding the termination period due to gigantic
negative percentages related to calling of all units in the market to the issuer, %98 is determined
as a threshold. Consequently, observations with absolute change percentage greater than %98
are defined as outlier and removed from the data.
When large scale graphs inspected (Graph-5 & 6), extreme noise data points can be seen in red
circles, which are probably arose from typo or data format convention errors during bulletin
generation by the data provider.
Graph-5 Graph-6
10
After cleaning process, noteworthy improvement has seen in mean and standard deviation of
the data. (Table-2) The visualization of data distribution indicates the nearing of data to mean
as well. (Graph-7&8)
Before After
Mean 16% 4%
Standard Deviation 1282% 34%
Max 196590% 98%
Min -98% -98%
Table-2
Graph-7
Graph-8
Missing data collection was the most strenuous part of the study. There were three missing
months and the only way to acquire data was to manually search through Public Disclosure
Platform (PDP) announcementsVII. PDP is a government empowered institution to ensure the
transparency of the capital markets by setting announcement rules and supervising
announcement activities of the members of capital markets in Turkey.
11
Unfortunately, all announcements made by fund issuers in PDP is in PDF format, path of these
files are cryptic and pattern of the files are varying. Hence, automation of the process was not
possible and all missing observations, more than seven hundred with nineteen variables in each,
obtained manually from PDF files. Another difficulty encountered during the data preparation
process was the detection of the faulty inputs and once again replacing these data with the
correct ones found in PDP by hand. Finally data set of the study ended up with 44,845
observations including twenty six parameters with the total of 97 trillion data to be used in
model generating.
Fund data worked in the models are primarily classified by the major asset class in their
portfolio recent portfolio. To avoid defects in models, funds with merger or acquisition events
in their history are excluded from the data due to possible fund strategy changes in the past.
After classification process, summation of the fund units sold in every period, by fund type used
in the models as dependent variable.
3.1.2 Benchmark Data
Benchmarks to be used in the models are determined according to direct relationship with asset
classes found in fund portfolios and their related indices. Below table indicates which
benchmarks are chosen related to fund type. (Table-3)
Debt Money Precious

Eurobond Foreign Partipication Stocks Variable
Instruments Market Metals
USD/TRY FX Rate X X X X X
EUR/TRY FX Rate X X X X X
XAU/USD FX Rate X X X X
XU100 Index X X X
XUTUM Index X X X
SP500 Index X X X
CBT Policy Rate X X X X X
Inflatıon Rate in Turkey X X X X X
Treasury Bonds 5 Year Benchmark Rate X X X X X
O/N Repo Rate X X X X X
Turkey 5 Year CDS Rate X X X X
FED Policy Rate X X X X
Table-3
Benchmarks are limited to twelve for to reduce complexity of the models and to avoid
multicollinearity between variables. Benchmark data is constructed with the same structure as
fund data, monthly regularity from January 2007 to January 2020, to match with the model.
Benchmark data set is retrieved from both privateVIII and publicIX sources and cross-checked if
possible.
VI
Captial Markets Board of Turkey, Monthly Statics Bulletin
VII
Public Disclosure Platform https://www.kap.org.tr/en/
VIII
Thompson Reuters Eikon, https://eikon.thomsonreuters.com/index.html
IX
Central Bank of Turkey, Electronic Data Deliver System, https://evds2.tcmb.gov.tr/
12
3.2 Methodology
3.2.1 Data Mining
Predictive modeling generation is handled by data mining techniques in the empirical part of
the study by using R Studio. R Studio is a user friendly, open source and free to use software
backed with statistics techniques. Among popular data mining software alternatives such as
Python, RapidMiner, SQL, etc., R Studio is one of the most preferred data science software in
the area. According to polls conducted in 2015 and 2016 by KDnuugets.com, R Studio is the
most widely used data science software globally X (Graph-9).
Graph-9
3.2.2 Model Generating
With respect to quantitative structure of the data sets and the goal of the study, multiple linear
regression and generalized multiple linear regression (logistic regression) models are utilized
to create regression and classification forecasts respectively. Dependent variable in models,
number of fund units sold, by nature is a multi-digit number ranging from thousands to billions.
Though, numerical form of independent variables is minor compared to dependent variable.
Nonetheless, monthly change in independent variables, for instance in USD/TRY foreign
exchange rate, has never witnessed a difference greater than %41 in February 2001, global
economic crisis. However, a fund with only a few investors may observe a drastic decline in
fund units related to investor quits or related to an agreement with an institution, massive
upsurge in fund units may be observed. Therefore, to normalize the figures and change
percentages in data set, various normalizing and return calculation practices used, as explained
in below table (Table-4). These models are created to determine the way dependent and
independent variables moved in time and to measure the strength of the relationship among
them.
X
The 17th Annual KDnuggets Software Poll Research, 2016, https://www.kdnuggets.com/2016/06/r-
python-top-analytics-data-mining-data-science-software.html
13
Model Dependent Variable Independent Variable Regression Model
# Structure Structure
Natural logarithm value of the Natural logarithm values of the Multiple Linear
1.
dependent variable. independent variables. Regression
Dependent variable is used as it Independent variable are used as Multiple Linear
2.
is on the data set. they are on the data set. Regression
Natural logarithm change of the Natural logarithm change of the Multiple Linear
3. natural logarithm value of natural logarithm values of Regression
dependent variable. independent variables.
Movement side of the change Movement side of the change Generalized Multiple
4. (up, down, or natural) in (up, down, or natural) in Linear Regression
dependent variable. independent variables. (Logistic Regression)
Table-4
Before moving to prediction process, rather than going forward with only above regression
models, sub-models are derived from the original ones using Regsubsets XI function, a stepwise
regression model package in R. Regsubsets package reproduces regression models with an
algorithm called “exhaustive method”. Aforementioned algorithm combines, forward selection
and backward selection methods. Forward selection XII starts with no predictors in the model,
iteratively adds the most contributive predictors, and stops when the improvement is no longer
statistically significant. Backward selection XII starts with all predictors in the model, iteratively
removes the least contributive predictors, and stops when all predictors are statistically
significant. Stepwise selectionXII (exhaustive method) starts with no predictors, then
sequentially add the most contributive predictors, like forward selection. After adding each new
variable, remove any variables that no longer provide an improvement in the model fit, like
backward selection. Improvement of the model determines by observing adjusted R-squared.
To ensure the integrity of the research, every model mentioned above, divided in to sub-models
and three variations with the highest adjusted R-squared scores are chosen. For the very last
step, multicollinearity check made by using Variance Inflation FactorXIII (VIF) function.
Multicollinearity is the situation when collinearity exists between three or more variables even
if no pair of variables has a particularly high correlation. In the presence of multicollinearity,
the solution of the regression model becomes unstable. For a given predictor, multicollinearity
can assessed by computing VIF, which measures how much the variance of a regression
coefficient is inflated due to multicollinearity in the model. The smallest possible value of VIF
is one and as a rule of thumb, a VIF value that exceeds 5 or 10 indicates a problematic amount
of collinearity. Accordingly, after multicollinearity control and variable adjustments, models
became appropriate to use in predictions.
XI
Thomas Lumley, Regsubets function. https://www.rdocumentation.org/packages/leaps/versions/2.1-
1/topics/regsubsets
XII
Alboukadel Kassambara PhD, Articles - Model Selection Essentials in R, 2018.
http://www.sthda.com/english/articles/37-model-selection-essentials-in-r/154-stepwise-regression-essentials-in-
r/#:~:text=There%20are%20three%20strategies%20of,is%20no%20longer%20statistically%20significant.
XIII
Alboukadel Kassambara PhD, Articles - Model Selection Diagnosticsin R, 2018.
http://www.sthda.com/english/articles/39-regression-model-diagnostics/160-multicollinearity-essentials-and-vif-
in-r/
14
3.2.3 Prediction Function
The underlying theory beneath prediction modeling is that any strategy that worked well in the
past is likely to work well in the future, and conversely, any strategy that performed poorly in
the past is likely to perform poorly in the future. Prediction utility is a built-in and generic
function of R Studio software. The Predict XIII function of R, fits the known independent
variables to the previously generated regression models to forecast dependent variable.
Independent variables used in prediction model and their close/last rate or index are listed in
table below (Table-5).
Treasury
Inflation Bonds 5 Year Turkey 5
USD/TRY EUR/TRY XAU/USD XU100 XUTUM SP500 CBT Policy Rate in Benchmark O/N Repo Year CDS FED Policy
Fx Rate Fx Rate Fx Rate Index Index Index Rate Turkey Rate Rate Rate Rate
February-20 6.2387 6.8821 1,584.74 1,059.94 1,123.04 2,954.22 10.75 12.37 12.62 11.80 346.92 1.75
March-20 6.6110 7.2968 1,571.05 896.44 947.61 2,584.59 9.75 11.86 12.59 8.75 554.28 0.25
April-20 6.9829 7.6515 1,680.09 1,011.10 1,085.18 2,912.43 8.75 10.94 10.61 7.64 579.04 0.25
May-20 6.8182 7.5739 1,726.30 1,055.20 1,137.91 3,044.31 8.25 11.39 10.73 9.30 548.56 0.25
June-20 6.8500 7.6959 1,780.67 1,165.25 1,263.40 3,100.29 8.25 12.62 10.46 8.00 493.08 0.25
July-20 6.9702 8.2345 1,974.69 1,126.90 1,248.28 3,271.12 8.25 11.76 12.02 7.01 550.01 0.25
August-20 7.3466 8.7718 1,969.75 1,078.61 1,206.98 3,500.31 8.25 11.77 13.42 9.72 527.22 0.25
September-20 7.7157 9.0485 1,885.44 1,145.24 1,308.30 3,363.00 10.25 11.75 12.90 12.50 542.57 0.25
October-20 8.3448 9.7207 1,877.95 1,112.37 1,285.09 3,269.96 10.25 11.89 14.29 14.75 547.54 0.25
November-20 7.8284 9.3398 1,777.02 1,283.58 1,463.83 3,621.63 15.00 14.03 12.37 15.75 383.18 0.25
Table-5
3.2.4 Backtesting Process
Beyond statistical analysis and evaluation techniques, the goodness of the model fit and
prediction accuracy can precisely be assessed by comparing the actual values of the dependent
variable. Actual fund units sold in backtesting period, from February 2020 to December 2020,
by fund type listed below in table (Table-6). According to the output of the backtest, useful or
meaningless models may occur, which cannot be named as failure due to nature of the
regression modeling and prediction making, conversely it should be seen as an opportunity to
tune the models for better results. However, overfitting phenomenon should be taken in to
account all the time during search for a perfectly fitting model. A biased model may lead to a
faulty results and ineffective future behavior predictions of the dependent variable.
Debt Insturments Eurobond Foreign Money Market Participation Precious Metals Stocks Variable
February-20 396,253,949,590 27,901,549,888 15,801,585,930 504,705,470,376 42,126,777,538 38,634,835,152 30,240,044,162 181,712,084,886
March-20 355,775,674,031 24,150,268,771 16,239,643,011 481,572,541,812 28,802,938,967 38,135,464,422 31,920,040,986 125,866,124,748
April-20 402,161,133,286 26,663,971,576 16,744,708,093 503,199,381,142 30,443,669,553 40,223,672,326 33,337,464,223 102,453,176,941
May-20 421,638,310,259 28,180,082,531 17,998,078,760 552,092,342,486 35,441,255,192 45,400,242,637 35,426,718,067 140,715,606,000
June-20 415,378,567,153 30,977,083,989 17,962,517,949 559,742,755,125 41,036,363,618 52,070,266,871 35,052,058,505 149,926,946,709
July-20 388,159,312,977 36,654,781,004 21,993,872,526 533,825,719,357 37,794,292,681 65,592,273,686 36,824,739,809 169,344,760,193
August-20 319,909,928,291 35,484,014,316 27,304,715,036 434,374,949,357 31,647,158,125 73,305,160,955 37,468,295,988 161,342,711,151
September-20 289,238,990,573 35,862,761,766 29,746,866,162 386,717,696,974 28,428,097,857 78,741,241,343 39,013,997,412 161,374,826,841
October-20 286,764,186,250 37,675,468,575 34,332,349,109 380,837,255,412 25,528,338,109 76,677,187,597 39,219,072,050 164,326,261,695
November-20 266,973,041,701 37,490,925,521 34,967,745,934 347,487,829,187 21,095,794,048 74,247,285,284 35,329,124,855 155,327,198,461
Table-6
XIII
Chambers, J. M. and Hastie, T. J. (1992) Statistical Models in S. Wadsworth & Brooks/Cole.
15
CHAPTER FOUR
Research Findings
4.1 Backtest Results

In all models, same processes and decision points are followed exactly to ensure consistency.
4.1.1 Multiple Linear Regressions Backtest Results

This regression models are used for to predict value of the dependent variable.
4.1.1.1 Debt Instruments Funds

None of the models were able to predict the behavior of the actual data, neither the path nor the
values. However, Model-1 precisely predicted the actual values of the Debt Instrument fund
units sold for the first 6 months.
DEBT INSTRUMENTS FUNDS
PREDICTIONS
DATES ACTUAL MODEL-1 MODEL-2 MODEL-3
Jan -2020 471,544,506,130.20 471,544,506,130.20 471,544,506,130.20 471,544,506,130.20
Feb -2020 396,253,949,590.00 401,519,550,386.43 554,264,002,997.00 475,835,947,149.67
Mar -2020 355,775,674,031.00 362,381,454,296.09 396,755,465,242.00 484,887,546,967.15
Apr -2020 402,161,133,286.00 399,803,540,523.60 586,948,443,650.00 529,121,428,401.52
May -2020 421,638,310,259.00 412,464,057,271.82 686,821,517,723.00 550,218,588,801.31
Jun -2020 415,378,567,153.00 424,229,336,303.00 670,450,757,267.00 565,731,323,422.10
Jul -2020 388,159,312,977.00 446,934,513,338.58 622,676,420,779.00 586,634,261,341.76
Aug -2020 319,909,928,291.00 462,118,439,005.08 677,124,599,156.00 598,458,623,742.73
Sep -2020 289,238,990,573.00 466,185,756,423.36 840,789,999,308.00 601,275,608,878.75
Oct -2020 286,764,186,250.00 483,696,168,944.09 799,747,249,743.00 613,841,540,796.16
Nov -2020 266,973,041,701.00 453,956,658,024.33 847,285,673,580.00 594,656,439,408.67
Table-7
Graph-10
16
4.1.1.2 Eurobond Funds
Up to June 2020, all models successfully pursued the path of the actual data with average
differences. After June 2020 only Model-3 was able to forecast flat patter of the actual data
with major discrepancy.
EUROBOND FUNDS
PREDICTIONS
Jan -2020 24,223,936,400.07 24,223,936,400.07 24,223,936,400.07 24,223,936,400.07
Feb -2020 27,901,549,888.00 24,944,647,787.36 31,791,581,061.00 21,595,556,190.49
Mar -2020 24,150,268,771.00 22,265,967,632.83 22,671,052,620.00 19,218,258,819.65
Apr -2020 26,663,971,576.00 24,521,567,221.71 26,945,600,565.00 20,201,275,846.63
May -2020 28,180,082,531.00 24,325,431,294.64 29,437,156,481.00 20,697,322,068.36
Jun -2020 30,977,083,989.00 25,244,888,641.52 36,066,727,487.00 21,422,037,557.78
Jul -2020 36,654,781,004.00 25,555,654,028.53 35,119,829,692.00 21,665,546,490.08
Aug -2020 35,484,014,316.00 28,406,544,837.13 35,953,975,905.00 22,898,603,692.15
Sep -2020 35,862,761,766.00 30,498,853,071.08 45,159,558,879.00 23,583,989,732.37
Oct -2020 37,675,468,575.00 34,200,323,804.90 49,555,791,064.00 24,560,179,767.19
Nov -2020 37,490,925,521.00 35,383,791,878.25 51,394,850,193.00 26,885,644,392.22
Table-8
Graph-11
17
4.1.1.3 Foreign Funds
All models followed the plane pattern of the actual data closely until June 2020 and except
Model-2, rest receded from actual data. However, Model-2 accurately predicted the pattern of
the actual data with two period lags.
FOREIGN FUNDS
PREDICTIONS
Jan -2020 13,583,854,409.29 13,583,854,409.29 13,583,854,409.29 13,583,854,409.29
Feb -2020 15,801,585,930.00 15,596,849,900.82 17,867,225,139.00 13,385,248,581.53
Mar -2020 16,239,643,011.00 15,158,482,669.54 14,162,370,552.00 12,964,775,967.20
Apr -2020 16,744,708,093.00 16,001,146,813.93 14,553,385,308.00 13,774,761,085.27
May -2020 17,998,078,760.00 16,493,068,022.29 14,896,104,764.00 14,237,696,946.01
Jun -2020 17,962,517,949.00 16,857,956,469.97 18,374,878,817.00 14,577,822,658.73
Jul -2020 21,993,872,526.00 17,428,372,383.01 16,997,526,519.00 15,117,849,798.39
Aug -2020 27,304,715,036.00 18,099,224,456.21 17,202,356,451.00 15,755,809,457.98
Sep -2020 29,746,866,162.00 18,151,061,684.81 23,273,251,988.00 15,799,078,526.27
Oct -2020 34,332,349,109.00 18,274,851,598.78 27,363,296,383.00 15,909,891,886.12
Nov -2020 34,967,745,934.00 19,186,366,035.72 28,350,328,026.00 16,787,886,471.18
Table-9
Graph-12
18
4.1.1.4 Money Market Funds
Any of the models were able to track the actual data pattern throughout backtest period.
MONEY MARKET FUNDS

PREDICTIONS
Jan -2020 559,266,672,754.37 559,266,672,754.37 559,266,672,754.37 559,266,672,754.37
Feb -2020 504,705,470,375.55 530,025,700,461.45 392,911,949,002.00 594,642,199,516.02
Mar -2020 481,572,541,811.89 562,476,987,903.70 470,571,701,345.00 632,455,154,097.38
Apr -2020 503,199,381,142.09 574,868,913,755.65 558,662,632,455.00 635,799,728,954.69
May -2020 552,092,342,486.26 567,741,128,235.77 546,442,500,534.00 631,306,069,225.14
Jun -2020 559,742,755,125.48 567,306,890,809.32 541,352,118,325.00 629,702,421,429.34
Jul -2020 533,825,719,356.87 603,440,796,355.03 580,256,743,061.00 658,590,110,287.23
Aug -2020 434,374,949,357.34 626,141,134,515.53 535,792,878,336.00 670,098,149,798.66
Sep -2020 386,717,696,974.08 684,297,022,468.45 601,450,315,303.00 734,786,161,741.34
Oct -2020 380,837,255,412.04 714,152,879,866.54 585,425,829,367.00 751,709,037,304.80
Nov -2020 347,487,829,186.60 751,604,186,962.25 551,921,493,132.00 797,364,067,429.09
Table-10
Graph-13
19
4.1.1.5 Participation Funds
All models failed to predict the behavior of the actual data.
PARTICIPATION FUNDS
PREDICTIONS
Jan -2020 64,864,385,754.23 64,864,385,754.23 64,864,385,754.23 64,864,385,754.23
Feb -2020 42,126,777,538.00 42,531,574,543.77 56,382,468,863.00 61,100,300,030.04
Mar -2020 28,802,938,967.00 58,452,981,365.35 60,675,269,616.00 58,659,865,819.24
Apr -2020 30,443,669,553.00 64,961,553,723.30 80,561,269,603.00 59,284,859,727.33
May -2020 35,441,255,192.00 67,053,031,936.17 82,715,328,187.00 61,283,064,342.87
Jun -2020 41,036,363,618.00 69,631,732,762.85 95,410,821,201.00 63,805,292,392.05
Jul -2020 37,794,292,681.00 71,775,892,891.74 112,846,280,151.00 66,614,241,265.13
Aug -2020 31,647,158,125.00 74,430,246,589.42 112,303,135,304.00 73,989,510,844.84
Sep -2020 28,428,097,857.00 75,154,118,696.92 138,414,288,332.00 72,068,781,801.43
Oct -2020 25,528,338,109.00 78,308,392,296.91 144,577,407,374.00 78,213,918,743.75
Nov -2020 21,095,794,048.00 75,493,435,742.20 148,924,708,059.00 80,936,113,671.93
Table-11
Graph-14
20
4.1.1.6 Precious Metals Funds
Model-1 and Model-3 precisely forecasted the movement of the actual data during first five
months but afterwards lost their ability to predict. Model-2, by an increasing discrepancy with
the actual data, followed similar pattern, ups and downs.
PRECIOUS METALS FUNDS

PREDICTIONS
Jan -2020 31,484,682,102.81 31,484,682,102.81 31,484,682,102.81 31,484,682,102.81
Feb -2020 38,634,835,152.00 40,544,182,227.40 28,934,075,739.00 32,399,365,514.02
Mar -2020 38,135,464,422.00 40,063,423,910.46 28,831,199,443.00 37,421,604,017.53
Apr -2020 40,223,672,326.00 43,914,884,710.37 33,687,430,731.00 42,755,591,072.28
May -2020 45,400,242,637.00 44,943,476,266.86 36,132,863,871.00 43,443,792,040.07
Jun -2020 52,070,266,871.00 44,207,782,571.37 38,097,032,241.00 43,551,222,623.10
Jul -2020 65,592,273,686.00 47,331,761,335.08 48,754,125,680.00 44,571,456,071.44
Aug -2020 73,305,160,955.00 46,830,274,697.29 49,982,340,380.00 44,183,962,316.10
Sep -2020 78,741,241,343.00 44,798,110,049.51 54,430,103,862.00 43,628,091,638.60
Oct -2020 76,677,187,597.00 43,762,213,784.08 57,439,052,691.00 43,645,807,697.64
Nov -2020 74,247,285,284.00 38,670,818,492.22 50,148,290,654.00 39,289,328,240.11
Table-12
Graph-15
21
4.1.1.7 Stocks Funds
Model-1 and Model-2 were able to imitate the pattern of the actual data with minor disparities.
STOCKS FUNDS
PREDICTIONS
Jan -2020 26,982,950,041.26 26,982,950,041.26 26,982,950,041.26 26,982,950,041.26
Feb -2020 30,240,044,162.00 32,130,485,221.33 24,853,251,079.00 28,664,972,994.24
Mar -2020 31,920,040,986.00 34,974,270,350.65 23,720,696,913.00 31,253,939,027.14
Apr -2020 33,337,464,223.00 34,251,701,540.49 29,933,413,952.00 30,894,911,431.79
May -2020 35,426,718,067.00 34,414,648,838.56 32,244,895,665.00 31,171,570,654.08
Jun -2020 35,052,058,505.00 33,587,765,661.90 33,518,636,470.00 30,579,190,648.59
Jul -2020 36,824,739,809.00 33,980,550,095.83 36,932,550,172.00 31,388,736,452.93
Aug -2020 37,468,295,988.00 35,012,742,673.89 38,119,015,428.00 32,336,911,143.96
Sep -2020 39,013,997,412.00 33,907,800,281.85 43,671,044,564.00 31,144,701,943.73
Oct -2020 39,219,072,050.00 34,772,208,036.77 43,587,957,499.00 31,923,299,876.67
Nov -2020 35,329,124,855.00 31,893,385,742.73 44,506,811,652.00 29,382,176,201.96
Table-13
Graph-16
22
4.1.1.8 Variable Funds
None of the models were not good enough to fit the actual data.
VARIABLE FUNDS
PREDICTIONS
Jan -2020 179,475,856,444.17 179,475,856,444.17 179,475,856,444.17 179,475,856,444.17
Feb -2020 181,712,084,886.00 181,607,914,199.45 153,981,785,130.00 183,886,894,832.75
Mar -2020 125,866,124,748.00 169,410,479,503.85 141,977,516,695.00 193,080,531,023.12
Apr -2020 102,453,176,941.00 184,488,076,420.15 180,221,884,802.00 205,719,770,281.75
May -2020 140,715,606,000.00 189,838,954,567.54 187,782,977,112.00 211,685,987,112.17
Jun -2020 149,926,946,709.00 192,202,777,403.78 204,238,483,200.00 211,096,686,457.48
Jul -2020 169,344,760,193.00 198,788,663,866.63 215,382,055,606.00 220,565,919,523.16
Aug -2020 161,342,711,151.00 201,088,234,451.87 218,383,268,703.00 225,157,986,810.55
Sep -2020 161,374,826,841.00 200,115,231,627.06 261,427,031,913.00 220,912,476,362.49
Oct -2020 164,326,261,695.00 202,416,267,924.43 268,228,294,871.00 224,323,041,917.18
Nov -2020 155,327,198,461.00 189,308,908,723.04 279,100,646,435.00 207,996,852,101.18
Table-14
Graph-17
23
4.1.2 Generalized Multiple Linear Regression Backtest Results
Logistic regression model is used for to predict movement pattern of the dependent variable
with categorical variables; up, down, and natural.
4.1.2.1 Periodic Movement
When below graphs are observed, Participation, Precious Metals and Stocks Funds are
following almost the same ups and downs with lags. But other models are failed to catch the
movement of the actual data.
Graph-18
24
4.1.2.1 Cumulative Movement
Cumulative movement graphs exhibit the precision of models to track movement of the actual
data. Debt instrument, Eurobond and Participation Fund models fit almost perfectly with the
pattern of actual data. Unfortunately, other models are unfruitful to predict the dependent
variable’s movement.
25
4.2 Summary
By virtue of their complicated spirit and form, mutual funds, with variety of instruments in their
portfolio, render the search for a model to predict its investors’ behavior into a challenging task.
While it is a compelling mission to determine the behavior of an investor about a single asset,
attempting to predict investors’ preferences to combination of multiple assets is per se requires
exhaustive effort. Making predictions about this kind of complex structures requires proper and
clean data, plenty of regressions and a sound knowledge of related literature and practices.
The objective of this study was not to handle all above difficulties and come up with a perfect
result, but to create a framework for further detailed and tailormade studies. With results
ranging from perfect to abortive, all outputs gave a sight about the relationship of variables.
Before explicating the output of the study, one should keep in mind that effects of a global
pandemic on financial markets have not witnessed before in humankind history. Therefore,
there were no sufficient knowledge about the impacts of the Covid-19 global pandemic on
markets to discard from the data when this study was conducted. Another point to put forward
is the anomaly, which stated in the abstract of this study, seen on stock markets. Despite the
devastating effect of Covid-19 on economies, stock exchanges worked on this study (SP500
and BIST100) both observed the all-time highs. The effect of this anomaly might have led to
distorted results especially on Foreign funds.
Starting with the unsuccessful results; The attempt to predict behavior of Participation fund
investors was by far the most ineffective one. When relatively accurate regression results on
other fund types are considered, the failure may be explained by the short history of this type
of fund. Participation funds first issued in 2013 and started to develop in recent years. Another
possibility to think about is that, investors of this fund type are highly sensible to interest due
to their religious view.
When Debt Instruments and Money Market funds with a strategy of mainly to invest in liquid
domestic fixed income instruments such as Treasury bonds and bills, corporate bonds or
overnight repo and depo assets, are observed the inaccuracy of the results may be caused with
respect to foundation purpose of these funds. These funds are generally named as liquid funds,
means that investors can give back these funds to issuer anytime and receive back money
immediately. Therefore, past behavior of the investors of this fund type may be meaningless or
useless in regression models.
Variable funds consist of many assets varying from riskless fixed income assets to extremely
risky derivatives. They hold remarkable amount of “others” type of asset class in their portfolio,
which details are not shared by the issuers. Thereby, a good fitting model to predict movement
of this funds’ unit change is not successful.
Prediction of foreign assets including funds can be classified as successful when results are
interpreted. Eurobond, Foreign and Precious Metal funds are all mostly invest on assets in other
currencies than TRY. When past data of this fund type is observed, it can be easily stated that
popularity of this type of funds rises when global risks are on surge or the economic status of
Turkey seems not promising in short run. Correlatively, impacts of the Covid-19, brought forth
jump in units sold of these funds. This raise splendidly forecasted by all models.
26
The headmost results were seen in predicting Stocks funds’ movements. All three multiple
linear regression models genuinely estimated the path of the funds. These funds with respect to
obligation of regulations, must satisfy minimum of %90 correlation with BIST30 index. In
times when Istanbul Stock Exchange is bullish, number of stocks investors raise and when
bearish market shows up number of the investors diminish. Accordingly, same investor pattern
can be seen in this type of funds thanks to its high correlation with stock indices.
When results are inspected corresponding to the model structures, normalizing tools applied
models performed better than the raw models.
Model-4, logistic regression for to predict movement path, displayed similar results with linear
regression on same fund types. To ensure consistency of the study, variables used and processes
followed in linear regressions applied to logistic regression as well. Customizing variables or
changing categoric levels may generate better outputs.
To conclude, despite the success or failure of the regressions, these models can be employed in
more particular and small scales and may come up with better results. Focusing on a fund type,
on a group of people, or to funds which are issued by same manager may lead to more accurate
results.
27
View publication stats

An Empirical Studyon Predicting Future Preferencesof Mutual FundInvestor Sin Turkey Using Data Mining Methods

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

An Empirical Studyon Predicting Future Preferencesof Mutual FundInvestor Sin Turkey Using Data Mining Methods

Uploaded by

Copyright:

Available Formats

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

An Empirical Study on Predicting Future Preferences of Mutual Fund Investors

Experiment Findings · February 2021

The user has requested enhancement of the downloaded file.

Financial Engineering FE 572 Project

An Empirical Study on Predicting Future Preferences of

Project Coordinator: Refik Güllü, Professor

1.2 Motivation and Importance

1.4 Outline of The Project

2.1.1 Data Mining

2.1.3 Logistic Regression

Null deviance equation is

where L head is the maximum of the likelihood function.

3.1.1 Fund Data

3.1.2 Benchmark Data

Debt Money Precious

3.2.1 Data Mining

3.2.4 Backtesting Process

4.1 Backtest Results

4.1.1 Multiple Linear Regressions Backtest Results

4.1.1.1 Debt Instruments Funds

MONEY MARKET FUNDS

All models failed to predict the behavior of the actual data.

PRECIOUS METALS FUNDS

4.1.2.1 Periodic Movement

View publication stats

You might also like