You are on page 1of 24

CRYPTOCURRENCY PRICE

PREDICTION AND ANALYSIS

J COMPONENT REPORT

MACHINE LEARNING CSE4020


Embedded Project
WINTER SEMESTER 2017-18

E2 SLOT

Submitted by

NAME REG
MAYANK GUPTA 15BCE0477
HARSHIT SHARMA 15BCE0506
PIYUSH JAISWAL 15BCE0611

Submitted to
Prof. Vijayasherly V., SCOPE

School of Computer Science and Engineering

1|Page
DECLARATION

I hereby declare that the J Component report entitled “Cryptocurrency Price Prediction
and Analysis” submitted by me to Vellore Institute of Technology, Vellore-14 in partial
fulfilment of the requirement for the award of the degree of B. Tech in Computer science
and engineering is a record of bonafide undertaken by me under the supervision of Prof.
Vijayasherly V. I further declare that the work reported in this report has not been submitted
and will not be submitted, either in part or in full, for the award of any other degree or
diploma in this institute or any other institute or university.

Signature Signature Signature


Name: Mayank Gupta Name: Harshit Sharma Name- Piyush Jaiswal

Reg. Number: 15BCE0477 Reg. Number- 15BCE0506 Reg. No- 15BCE0611

2|Page
Table of Contents

Abstract…………………………………………………………………...page 4

1.Introduction………………………………………………........................page 5

2.Literature Survey………………………………………………………. ..page 6-7

3.Methodology……………………………………………………………...page 8-10

3.1.Multiple Linear Regression…………………………………………..page 8

3.2.Random Forest Algorithm……………………………………………page 8-9

3.3.Support Vector Machine……………………………………………...page 9

3.4.Polynomial Regression……………………………………………….page 10

3.5.Ensemble Learning…………………………………………………...page 10

4.Results……………………………………………………………….........page 11-12

5.Conclusion………………………………………………………………...page 13

References……………………………………………………...………….page 14

Appendix: Sample Code…………………………………………………....page 15-24

3|Page
ABSTRACT

This project addresses problem of predicting direction of movement of cryptocurrency prices


for the top ten ranked cryptocurrencies in the current market. The study compares four
prediction models, Multiple Linear Regression, Support Vector Machine (SVM), random
forest regression and polynomial regression. We take the closing price, high and low range
and the spread for the previous day to predict the opening price for the following day. Each
of the four algorithms is used to predict the opening price and on applying an ensemble
average weighting algorithm on all of these predictions we obtain the final most appropriate
value as our result. The opening price’s relation with each of the four input variables is
analysed and the result is produced in a graphical format. Each of the four algorithms
produces an output with respect to each of the four variables except for multiple linear
regression which uses all the variables at once and the thirteen outcomes produced and
subjected to the ensemble classifier to be considered valid or invalid and finally the valid
outcomes are subjected to the weighting algorithm to produce the final result.

4|Page
1. Introduction

Digital currencies have been increasing in attention during last years, inevitably it reached in
academia, finance, and public policy atmospheres. From the academia perspective, this
importance arises on the fact that it has features that generate several conflicts in political
and financial environments. Even the definition is ambiguous, as a product of an information
of technology conception, it can be defined as a protocol, platform, currency or payment
method (Athey et al. 2016). Among digital currencies, Bitcoin has been capturing almost all
of its reflection, this virtual currency was created in 2009 and serves as a peer-to-peer version
of electronic cash that let to do transactions on the internet without the intermediation of the
financial system (Nakamoto 2008).

Digital coins or cryptocurrencies, named in such way due to their characteristic of using
encryption systems that regulate the creation of coins and transfers have to be identified from
an economic analysis perspective. Hence, it is important to examine which social, financial
and macroeconomic factors determine its price in order to know the scope and consequences
of the economy.

5|Page
2. Literature Survey

Paper – 1

Title: REGIME CHANGE AND TREND PREDICTION FOR


CRYTOCURRENCY TIME SERIES DATA

Authors: Osamu Kodama,1 Lukáš Pichl,2 Taisei Kaizoji3

Among the notable attempts to model the prediction of extreme events in a systematic way
are those of Hallerberg et al., (2008) assessing under what circumstances the extreme events
may be more predictable the bigger they are, or the recent work by Franzke (2012) who
develops a nonlinear stochastic-dynamical model. In the economic context, extreme events
mean a bubble formation or a bubble burst, and their precursors are of vital importance in
risk management. To extract the causal extent (deterministic segment) buried in the noisy
data, various techniques have been proposed, for instance recurrent neural network with
memory feedback (Elman, 1990) or support vector machines (Cortes and Vapnik, 1995). A
survey of recent methods can be found in the work of Akansu et al. (2016). Binary classifiers
separating the upward and downward trend (positive or negative sign of logarithmic return),
which easily evaluate against the dataset in terms of hit ratios (precision of binary classifier
output), are common.

Paper – 2

Title: Predicting Price Changes in Cryptocurrency

Authors: Matthew Chen, Neha Narwal and Mila Schultz

Specifically, it assumes that the input features (i.e., the groups of 6 price points) are
conditionally independent given the label (i.e., a positive price change [+1] or a negative
price change [-1]) Support Vector Machine Like logistic regression, the support vector
machine algorithm yields a binary classification model while making very few assumptions
about the dataset. Analysis of price data from Coinbase (2017) shows that between August
30th, 2015 and October 19th, 2017, Cryptocurrency had a monthly volatility of 21.73%. Over
that same time span, Cryptocurrency had a monthly volatility of 77.91%. For comparison, the
S&P 500 has a historical monthly volatility of about 14%, suggesting that the price of Ether
is significantly less predictable than that of either Cryptocurrency or common stock. (i.e.,
even as the price of Cryptocurrency might vary, the distribution describing the magnitude of
the price changes between iterations would stay constant.) This assumption guided our idea
to use price changes (and the sign of the price change) as input features into a SVM-based

6|Page
model, but the model underperformed the ARIMA-based model all the same, even when the
data was standardized and/or normalized.

Their results substantiate earlier research done by Madan, Saluja, and Zhao (2014), who
found that by using the Cryptocurrency price sampled every 10 minutes as the primary
feature for a random-forest model, they could predict the direction of the next change in
Cryptocurrency price with 57.4% accuracy.3 Datasets The primary dataset consists of the
price of Ether sampled at approximately one-hour intervals between August 30, 2015 and
December 2, 2017 (Etherchain, 2017)Dataset Truncation The variance of the dataset is large,
relative to its mean, and so we initially attempted to reduce variance by truncating the dataset
to only include data points occurring after February 26th, 2017.

Most of these models used 6 price points as the input feature and were based on binomial
classification algorithms, including Logistic Regression, Support Vector Machine, Random
Forest and Naive Bayes. Hegazy and Mumford (2016) compute an exponentially-smoothed
Cryptocurrency price every eight minutes; using the first five left derivatives of this price as
features in a decision-tree based algorithm, they predict the direction of the next change in
Cryptocurrency price with 57.11% accuracy. (For reference, a naive model which takes no
input and always predicts the price will increase yields a baseline accuracy of 55.8%.) We
interrogated the dataset with TSNE, LDA, and PCA to discover that the classes are not
qualitatively well-separated by any of those methods.

Final results are obtained by training on the trainingand development sets and testing on the
test set.1 3.3 Feature Selection Features were generated by grouping the original data points,
which contained Ether prices, into series of six points, such that each point was separated
from its neighbors by one hour.This date marks a natural inflection point in Cryptocurrency’s
price history, which seemed to support Bovaird’s hypothesis, and roughly indicates when
large institutions increased their interest in Cryptocurrency (since if there was no such
interest, the institutions would not have announced the consortium’s formation.)However,
truncating the dataset in this way did not significantly change results.Other types of features
were also tested, including the price change between time points and the sign of the price
change between time points, as well as normalized and standard versions of all the features
already described.

Results The ratio of positive to negative price changes in the dataset is almost 1:1; as such,
the models areevaluated based on their prediction accuracy.

7|Page
3. Methodology

After merging the entire data set we are applying the following 4 algorithms on the
dataset to finally predict the accuracy of each model.

3.1 Multiple Linear Regression

This model generalizes the simple linear regression in two ways. It allows the mean
function E () y to depend on more than one explanatory variables and to have shapes
other than straight lines, although it does not allow for arbitrary shapes.

Let y denotes the dependent (or study) variable that is linearly related to k
independent (or explanatory) variables 1 2 , ,..., XX Xk through the parameters 1 2 ,
,...,   k and we write 11 2 2 ... . k k yX X X        This is called as the
multiple linear regression model. The parameters 1 2 , ,...,   k are the regression
coefficients associated with 1 2 , ,..., X X Xk respectively and  is the random error
component reflecting the difference between the observed and fitted linear
relationship. There can be various reasons for such difference, e.g., joint effect of
those variables not included in the model, random factors which cannot be accounted
in the model etc.

3.2 Random Forest Algorithm

Random Forest is a supervised learning algorithm. Like you can already see from it’s
name, it creates a forest and makes it somehow random. The „forest“ it builds, is an
ensemble of Decision Trees, most of the time trained with the “bagging” method.
The general idea of the bagging method is that a combination of learning models
increases the overall result.
Why Random Forest Algorithm

The same random forest algorithm or the random forest classifier can use
for both classification and the regression task.

Random forest classifier will handle the missing values.

When we have more trees in the forest, random forest classifier won’t
overfit the model.

Can model the random forest classifier for categorical values also.

8|Page
Pseudo Code for The Algorithm

1. Randomly select “k” features from total “m” features.

1. Where k << m

2. Among the “k” features, calculate the node “d” using the best split point.

3. Split the node into daughter nodes using the best split.

4. Repeat 1 to 3 steps until “l” number of nodes has been reached.

Build forest by repeating steps 1 to 4 for “n” number times to create “n” number of
tree

3.3 Support Vector Machine

In machine learning, support vector machines (SVMs, also support vector


networks[1]) are supervised learning models with associated learning algorithms that
analyze data used for classification and regression analysis. Given a set of training
examples, each marked as belonging to one or the other of two categories, an SVM
training algorithm builds a model that assigns new examples to one category or the
other, making it a nonprobabilistic binary linear classifier (although methods such as
Platt scaling exist to use SVM in a probabilistic classification setting). An SVM
model is a representation of the examples as points in space, mapped so that the
examples of the separate categories are divided by a clear gap that is as wide as
possible. New examples are then mapped into that same space and predicted to
belong to a category based on which side of the gap they fall.
In addition to performing linear classification, SVMs can efficiently perform a non-
linear classification using what is called the kernel trick, implicitly mapping their
inputs into high-dimensional feature spaces.
When data are not labeled, supervised learning is not possible, and an unsupervised
learning approach is required, which attempts to find natural clustering of the data to
groups, and then map new data to these formed groups. The support vector clustering
algorithm created by Hava Siegelmann and Vladimir Vapnik, applies the statistics of
support vectors, developed in the support vector machines algorithm, to categorize
unlabeled data, and is one of the most widely used clustering algorithms in industrial
applications.

9|Page
3.4 Polynomial Regression
polynomial regression is a form of regression analysis in which the relationship
between the independent variable x and the dependent variable y is modelled as
an nth degree polynomial in x. Polynomial regression fits a nonlinear relationship
between the value of x and the corresponding conditional mean of y, denoted E(y |x),
and has been used to describe nonlinear phenomena such as the growth rate of
tissues, the distribution of carbon isotopes in lake sediments, and the progression of
disease epidemics. Although polynomial regression fits a nonlinear model to the
data, as a statistical estimation problem it is linear, in the sense that the regression
function E(y | x) is linear in the unknown parameters that are estimated from the data.
For this reason, polynomial regression is considered to be a special case of multiple
linear regression.

3.5 Ensemble Learning (Weighted Classifier)


An ensemble contains a number of learners which are usually called base learners.
The generalization ability of an ensemble is usually much stronger than that of base
learners. Actually, ensemble learning is appealing because that it is able to boost
weak learners which are slightly better than random guess to strong learners which
can make very accurate predictions. So, “base learners” are also referred as “weak
learners”. It is noteworthy, however, that although most theoretical analyses work on
weak learners, base learners used in practice are not necessarily weak since using
not-so-weak base learners often results in better performance.
Base learners are usually generated from training data by a base learning algorithm
which can be decision tree, neural network or other kinds of machine learning
algorithms. Most ensemble methods use a single base learning algorithm to produce
homogeneous base learners, but there are also some methods which use multiple
learning algorithms to produce heterogeneous learners. In the latter case there is no
single base learning algorithm and thus, some people prefer calling the learners
individual learners or component learners to “base learners”, while the names
“individual learners” and “component learners” can also be used for homogeneous
base learners

10 | P a g e
4. Results

Input:
Closing: - 756
High: - 798
Low: - 567
Spread: - 231

Output:

11 | P a g e
Analysis:

12 | P a g e
5. Conclusion

Cryptocurrency have gained a sudden boom in the market and are disrupting the online
trading industry. With digitalization placing its foots in the market, cryptocurrencies can
become a great asset to countries and even show promise when it comes to security. After the
analysis we concluded that prices of cryptocurrencies do not follow any linearity or pattern.
The analysis shows us a unique graph with the relationship of cryptocurrencies with various
attributes. This uncertainty in there is the reason a person cannot make use of a single
algorithm to predict their prices. Our model makes use of four different prediction algorithms
and through ensemble learning combines their best attributes to produce a result far superior
than any individual algorithm could produce. The dataset used to train the model contains
10000 data entries and the accuracy obtained through our model on the test set is more than
90%. We hence conclude our model to be a highly fit model for any cryptocurrency related
predictions and with respect to time as the training set gets larger it’s accuracy will keep
improving.

13 | P a g e
References

[1]. www.kaggle.com

[2]. www.blockchain.info/

[3]. Automated Bitcoin Trading via Machine Learning Algorithms by Isaac Madan, Shaurya
Saluja, Aojia Zhao
[4]. Regime change and trend prediction for cryptocurrency time series data by Osamu
Kodama,1 Lukáš Pichl,2 Taisei Kaizoji3

[5] T. B. Trafalis and H. Ince. Support vector machine for regression and applications
to financial forecasting. IJCNN2000, 348-353.
http://www.svms.org/regression/TrIn00.pdf

[6]. H. Yang, L. Chan, and I. King. Support vector machine regression for volatile
stock market prediction. Proceedings of the Third International Conference on
Intelligent Data Engineering and Automated Learning, 2002.
http://www.cse.cuhk.edu.hk/~lwchan/papers/ideal2002.pdf

[7]. Predicting Price Changes in Cryptocurrency by Matthew Chen, Neha Narwal and Mila
Schultz, Stanford University Stanford, CA 94305

14 | P a g e
Appendix: Sample Code

# Importing the libraries


import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

print ("Select the Cryptocurrency to predict the opening price\n\n")


print ("1 : Bitcoin\n")
print ("2 : Ethereum\n")
print ("3 : Ripple\n")
print ("4 : Bitcoin Cash\n")
print ("5 : Cardano\n")
print ("6 : Stellar\n")
print ("7 : Litecoin\n")
print ("8 : NEO\n")
print ("9 : EOS\n")
print ("10 : NEM\n\n")
print ("Input your choice\n\n")

choice = int(input())
if (choice == 1):
a=0
b = 1492
c = "Bitcoin"
elif (choice == 2):
a = 1493
b = 2397
c = "Ethereum"

elif (choice == 3):


a = 2398
b = 4035
c = "Ripple"

15 | P a g e
elif (choice == 4):
a = 4036
b = 4224
c = "Bitcoin Cash"

elif (choice == 5):


a = 4225
b = 4343
c = "Cardano"

elif (choice == 6):


a = 4344
b = 5615
c = "Stellar"

elif (choice == 7):


a = 5616
b = 7351
c = "Litecoin"

elif (choice == 8):


a = 7352
b = 7810
c = "NEO"

elif (choice == 9):


a = 7811
b = 8018
c = "EOS"

elif (choice == 10):


a = 8019
b = 9051
c = "NEM"

16 | P a g e
else:
print("Invalid Choice\n")
print("Please select a number between (1-10)\n")
ch = int(input())
if (ch == 1):
a=0
b = 1492
c = "Bitcoin"
elif (ch == 2):
a = 1493
b = 2397
c = "Ethereum"

elif (ch == 3):


a = 2398
b = 4035
c = "Ripple"

elif (ch == 4):


a = 4036
b = 4224
c = "Bitcoin Cash"

elif (ch == 5):


a = 4225
b = 4343
c = "Cardano"

elif (ch == 6):


a = 4344
b = 5615
c = "Stellar"

elif (ch == 7):


a = 5616

17 | P a g e
b = 7351
c = "Litecoin"

elif (ch == 8):


a = 7352
b = 7810
c = "NEO"

elif (ch == 9):


a = 7811
b = 8018
c = "EOS"

elif (ch == 10):


a = 8019
b = 9051
c = "NEM"

# Importing the dataset


dataset = pd.read_csv('ML_proj_dataset.csv')
X_mlr = dataset.iloc[a:b, [6,7,8,12]].values
X_svr = dataset.iloc[a:b, 6:7].values
X_close = dataset.iloc[a:b, 8:9].values
X_high = dataset.iloc[a:b, 6:7].values
X_low = dataset.iloc[a:b, 7:8].values
X_spread = dataset.iloc[a:b, 12:13].values
y = dataset.iloc[a+1:b+1, 5].values
y_svr = dataset.iloc[a+1:b+1, 5:6].values
y_svr = y_svr.reshape(-1,1)

print("\n")
print ("Input the parameters")

high = float(input())

18 | P a g e
low = float(input())
close = float(input())
spread = float(input())

arr = np.array([[high,low,close,spread]])
p_sum = 0
count = 0

# Feature Scaling for SVR


from sklearn.preprocessing import StandardScaler
sc_X = StandardScaler()
sc_y = StandardScaler()
X_svr = sc_X.fit_transform(X_svr)
y_svr = sc_y.fit_transform(y_svr)

# Fitting the SVR to the dataset


from sklearn.svm import SVR
regressor = SVR(kernel = 'rbf')
regressor.fit(X_svr, y_svr)

# Predicting a new result with SVR


y_pred4 = sc_y.inverse_transform(regressor.predict(sc_X.transform(np.array([[high]]))))

if ((y_pred4) >= low and (y_pred4) <=high):


p_sum = p_sum + y_pred4
count = count + 1

# Splitting the dataset into the Training set and Test set for Multiple Linear Regression
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X_mlr, y, test_size = 0.2, random_state = 0)

#Fitting Multiple linear regression to the training set


from sklearn.linear_model import LinearRegression
regressor = LinearRegression()
regressor.fit(X_train, y_train)

19 | P a g e
#Predicting the test set results using MLR
y_pred1 = regressor.predict(arr)

if ((y_pred1) >= low and (y_pred1) <=high):


p_sum = p_sum + y_pred1
count = count + 1

# Fitting the Random Forest Regression Model to the dataset and predicting the result

# High
from sklearn.ensemble import RandomForestRegressor
regressor = RandomForestRegressor(n_estimators = 1000, random_state = 0)
regressor.fit(X_high, y)

y_pred3_1 = regressor.predict(high)

if ((y_pred3_1) >= low and (y_pred3_1) <=high):


p_sum = p_sum + y_pred3_1
count = count + 1

# Close
from sklearn.ensemble import RandomForestRegressor
regressor = RandomForestRegressor(n_estimators = 1000, random_state = 0)
regressor.fit(X_close, y)

y_pred3_2 = regressor.predict(close)

if ((y_pred3_2) >= low and (y_pred3_2) <=high):


p_sum = p_sum + y_pred3_2
count = count + 1

# Low
from sklearn.ensemble import RandomForestRegressor

20 | P a g e
regressor = RandomForestRegressor(n_estimators = 1000, random_state = 0)
regressor.fit(X_low, y)

y_pred3_3 = regressor.predict(low)

if ((y_pred3_3) >= low and (y_pred3_3) <=high):


p_sum = p_sum + y_pred3_3
count = count + 1

# Spread
from sklearn.ensemble import RandomForestRegressor
regressor = RandomForestRegressor(n_estimators = 1000, random_state = 0)
regressor.fit(X_spread, y)

y_pred3_4 = regressor.predict(spread)

if ((y_pred3_4) >= low and (y_pred3_4) <=high):


p_sum = p_sum + y_pred3_4
count = count + 1

# Fitting Polynomial Regression to the dataset and predicting the result

print ("\n\nThe dependence of the close value against the various input parameters is shown
as follows:\n\n")

# High
from sklearn.preprocessing import PolynomialFeatures
poly_reg = PolynomialFeatures(degree = 4)
X_poly = poly_reg.fit_transform(X_high)
lin_reg2 = LinearRegression()
lin_reg2.fit(X_poly, y)

y_pred2_1 = lin_reg2.predict(poly_reg.fit_transform(high))

21 | P a g e
if ((y_pred2_1) >= low and (y_pred2_1) <=high):
p_sum = p_sum + y_pred2_1
count = count + 1

plt.scatter(X_high, y, color = 'red')


plt.plot(X_high, regressor.predict(X_high), color = 'blue')
plt.title('Opening Price Vs High')
plt.xlabel('High')
plt.ylabel('Open')
plt.show()

# Close
from sklearn.preprocessing import PolynomialFeatures
poly_reg = PolynomialFeatures(degree = 4)
X_poly = poly_reg.fit_transform(X_close)
lin_reg2 = LinearRegression()
lin_reg2.fit(X_poly, y)

y_pred2_2 = lin_reg2.predict(poly_reg.fit_transform(close))

if ((y_pred2_2) >= low and (y_pred2_2) <=high):


p_sum = p_sum + y_pred2_2
count = count + 1

plt.scatter(X_close, y, color = 'red')


plt.plot(X_close, regressor.predict(X_close), color = 'blue')
plt.title(' Opening Price Vs Closing Price ')
plt.xlabel('Close')
plt.ylabel('Open')
plt.show()

# Low
from sklearn.preprocessing import PolynomialFeatures
poly_reg = PolynomialFeatures(degree = 4)
X_poly = poly_reg.fit_transform(X_low)

22 | P a g e
lin_reg2 = LinearRegression()
lin_reg2.fit(X_poly, y)

y_pred2_3 = lin_reg2.predict(poly_reg.fit_transform(low))

if ((y_pred2_3) >= low and (y_pred2_3) <=high):


p_sum = p_sum + y_pred2_3
count = count + 1

plt.scatter(X_low, y, color = 'red')


plt.plot(X_low, regressor.predict(X_low), color = 'blue')
plt.title('Opening Price Vs Low')
plt.xlabel('Low')
plt.ylabel('Open')
plt.show()

# Spread
from sklearn.preprocessing import PolynomialFeatures
poly_reg = PolynomialFeatures(degree = 4)
X_poly = poly_reg.fit_transform(X_spread)
lin_reg2 = LinearRegression()
lin_reg2.fit(X_poly, y)

y_pred2_4 = lin_reg2.predict(poly_reg.fit_transform(spread))

if ((y_pred2_4) >= low and (y_pred2_4) <=high):


p_sum = p_sum + y_pred2_4
count = count + 1

plt.scatter(X_spread, y, color = 'red')


plt.plot(X_spread, regressor.predict(X_spread), color = 'blue')
plt.title('Opening Price Vs Spread')
plt.xlabel('Spread')
plt.ylabel('Open')
plt.show()

23 | P a g e
# Calculating the Final Predicted Value

prediction = p_sum/count

print ("\n\nThe Predicted Opening Price for " + c +" for the following day is : " +
str(prediction))

24 | P a g e

You might also like