You are on page 1of 10

GOLD RATE PREDICTION USING LINEAR REGRESSION

MACHINE LEARNING:

Simple terms: Making machine intelligent

According to ARTHUR SAMUEL (Pioneer of AI ) defined ML as :’The field of studies that gives
computers the ability to learn without being explicitly programmed.

Machine learning techniques:

1. Supervised Learning:
Used in application when labelled historic data predicts future events.

2. Unsupervised Learning:

Used when historical data not labelled. They are used to discover unknown
patterns in the data.

3. Semi Supervised Learning:

It is an application of supervised learning. It uses both labelled and un labelled


data.

Applications: Text processing, Video Indexing,Informatics , web page , News classification.

4. Reinforcement:
Discovers which action yields maximum rewards through trial and error.
Applications: Robotics,Gaming and Navigation.

STEPS INVOLVED IN SUPERVISED LEARNING:

 Train the machine with known data so that it learns something from it.
 Machine classify a new unknown data point using the knowledge gained in previous
step.
 Model evaluated based upon the accuracy it has classified in the unknown data.

TECHNIQUES OF SUPERVISED LEARNING:

1. CLASSIFICATION:
Used to predict discrete result
Eg: Email is spam or not
Transaction fraudulent
2. Regression:
Used to predict continuous numeric results
Eg: Price of car
Delivery time
Credit limit

PROCESS INVOLVED INMACHINE LEARNING:

Feature Engineering is the process of using domain knowledge to select or create


significant features from the historical data relevant to the problem statement.
This engineered data is divided into two sets: Train data and Test data.
Data science models are built using train data, and then the performance of the model is
evaluated on the test data.
This validated model is used for taking various decisions on new/unseen data points.

REGRESSION MODEL:
SIMPLE LINEAR REGRESSION: MULTIPLE LINEAR REGRESSION:

Y=mX+C Y=m1X1+m2X2+m3X3……

Y=Target/ Dependent variable

X=Set of predictors/Independent variable

m=Intercept

C=Coefficient/normally distributed with constant variance

There are four assumptions associated with a linear regression model:

1. Linearity: The relationship between independent variables and the mean of


the dependent variable is linear. 
2. Homoscedasticity: The variance of residuals should be equal.
3. Independence: Observations are independent of each other.
4. Normality: The dependent variable is normally distributed for any fixed value
of an independent variable.

WHY LINEAR REGRESSION?

1. The linear Regression method is very easy to use. If the relationship between the
variables (independent and dependent) is known, we can easily implement the
regression method accordingly (Linear Regression for linear relationship).
2. After performing linear regression, we get the best fit line, which is used in
prediction, which we can use according to the business requirement.
3. Easy to implement, interpret and efficient to train.
4. It handles overfitting model easily.

Training data undergoes effect (Which causes poor prediction):


1. Overfitting:
Most common in ML.
Learns details well along with the noise in the data.
Here data more correct, but not accurate
2. Underfitting:
Model cannot capture the underlying data i.e data not found.

USES OF LINEAR REGRESSION:


 SALES FORCASTING
 RISK ANALYSIS
 HOUSING APPLICATIONS: TO PREDICT PRICE AND THER FACTORS
 FINANACE APPLICATION:TO PREDICT STOCK PRICE, INVESTMENT EVALUVATION etc

TOOLS FOR MACHINE LEARNING:


R, PYTHON, SPARK, WEKA, JULIA etc
WHY PYTHON?
 Open source,general programming language
 Highly interpreted language
 High level language -deals with “WHAT TO DO” instead of “HOW TO DO”.
 Easy to learn
 Portable
 Libraries: Because of which coding is made easier.(due to which python gained more
popularity in data science)

USES OF LIBRARIES:
 Faster application development
 Enhances code efficiency
 Achieve code modularization

LIBRARIES USED IN THIS PROGRAM:

1. NUMPY:
Used for scientific computation to get mathematical and structural understanding of data.
2. PANDAS:
Data structure and Data analysis.
3. MATPLOTIB:
Plotting and visualization.
Eg:box plot , scatter plot, histogram.
4. Scikit learn/SK learn:
Used to build predictive models

PROJECT
STEP 1: To import all the libraries that are required for the program.

Here matplotlib.plyplot is used


for 2d graphics

PLOTTING is of 2 types:
MATLAB and OBJECT ORIENTED

MATLAB: (matplotlib.plyplot). It
is the simple way.Used for box ,scatter, histogram.
OBJECT ORIENTED:Used for more control and customization in plot.

STEP 2: HISTORICAL DATA

They read the input data as there are multiple forms of data available like (eg:’json’,
‘csv’,’xslx’,….)
CSV=COMMA SEPARATED VALUES

print(df.shape)
it describes about shape of the file, here it is number of rows and column.

STEP 3: PREPROCESSING
Since the data does not contain an uniform data format,this can be changed either in excel or
python.

df.info()
It gives the complete summary of dataframe(our 2d data structure).It includes
 List of all columns with their data types.
 Number of nonvalues in each column etc…

From the above output it is very clear that DATE IS IN THE FORM OF OBJECT AND NOT IN ITS
DATE FORM.

So for that we use pd.to_datetime():

Which converts the object format to the date time format.


Finally lets convert the required data format by using dt.strftime('%m-%d-%Y')

%m,%d,%Y- Represents the format code nothing but month date and year

NOTE : “Y” represents year with century say:2011,2012…

If “y” represents year with without century say:11,12…


[4954 rows x 6 columns]

STEP 4: FEATURE EXTRACTION/DIMENSION REDUCTION

Since y=mx+c

X= DATE as it is in the form of date format, we need to convert them as integer.

(df['Date']-df['Date'][0]) :

Here we add an extra column Date1 to which difference in the days are
added.
np.timedelta64(1,'D'):

Still, they are in days format we convert them to integer using this
syntax.

Were D representing date

STEP 5: ML ALGORITHM AND MODEL CREATION:

Iloc:

It is a pandas module .

Which helps us to locate the specific row or column from data set.

df.iloc[a,b]

Here a- Row

b- Column

reshape(-1, 1):

Used to change the vertical data frame to a horizontal array. Only reshaping occurs not the
changing of data takes place.

fit(x, y):

It tells that the model to represent in linear regression form

{model.intercept_}:

It represents to find the “c” value

{model.coef_}:

It represents to identify the “m” value


Similarly for standard gold rate:
In order to add label for the comparison plot we use legend() instead of show()

Advantages of using SCATTER PLOT:

1. Simplicity: It is a simple and non-mathematical method of studying the correlation between the
two variables.
2. Easily understandable: It can be easily understood and interpreted. It enables us to know the
presence or absence of correlation at a single glance of the diagram.
3. Not affected by extreme items: It is not influenced by the size of extreme values, whereas most of
the mathematical methods lack this quality.
4. First Step: It is a step in investigating the relationship between two variables.

You might also like