You are on page 1of 8

Multiple Linear Regression using Python Machine Learning

Objective:- The objective of this exercise is to predict the Net

Primary Productivity-(NPP, major ecosystem health indicator) from
climate and land use data for Upper Blue Nile Basin, Ethiopia,
East Africa(Figure 1). It’s derived from Gross Primary
Productivity (GPP) which is an ecosystem level parameter that
refers to the rate at which green plants produce organic matter
by assimilating carbon dioxide using solar energy through
photosynthesis(Liang et al., 2012). Net Primary Productivity is
the difference between GPP and plant autotrophic respiration.
Approximately 50% the organic matter generated by gross primary
production is released into the atmosphere through plant
respiration. The other half, which constitutes NPP is the biomass
produced in a given time (Liang et al., 2012). The following
variables were used:
 The NPP dataset(dependent variable) from the year 2001 to 2010
was downloaded from NASA’s Reverb/ECHO website. Data from 2001
was taken for regression analysis.
 Precipitation: GPCC-Global Precipitation Climatology Centre,
raster image.
 Land use land cover classification image for 2001 and 2010
were acquired from MODIS Land Cover(MCDQ12) from Reverb/ECHO.
 Fraction of Photosynthetically Active Radiation (fAPAR) SPOT
satellite, AfSIS raster image(
 Digital Elevation Model(DEM)-
 Minimum Temperature, Vapor Pressure, WSI(Water Stress Index )
derived from Potential Evapotranspiration and Actual
Evapotranspiration of CRU 3.22 Time-Series data (Climate
Research Unit, University of East Anglia)

Kaleab Woldemariam, June 2017

Multiple Linear Regression using Python Machine Learning

Figure 1: Location of the Study Area.

In this exercise, a total of 2,377 random sample points were

collected from the raster data using ArcGIS 10.3. I used Pandas
module for loading comma delimited(csv) file, Numpy module to
convert the data into array, Scikit_Learn for computing multiple
linear regression and Matplotlib module for plotting the result.

Certain assumptions about the dataset must be met before

conducting multiple linear regression. In ecological studies,
statistical and spatial contexts must be considered in modeling.
To simplify, statistical assumptions were met. Multiple linear
regression assumes

(i) Normality
(ii) Homogeneity of Variance

Kaleab Woldemariam, June 2017

Multiple Linear Regression using Python Machine Learning

(iii) Fixed X (X represents explanatory variables)

(iv) Independence
(v) Correct model specification (Zuur et al., 2007).

Note that land use-land cover (LULC) data were categorical and
needed to be converted to dummies (0/1 values).I used a Pandas
function, pd.get_dummies, to manipulate the nominal LULC data to
include it in predicting NPP.

To segregate the numerical and categorical data, I used a

separate pandas DataFrame for Precipitation, fAPAR, Minimum
Temperature, Vapor Pressure, WSI features (numerical independent
variables)as data1 and categorical LULC as dummies and eventually
join the two datasets as a numpy array “X”. The dependent
variable NPP2001 was also converted to array “y” using numpy.

The model is trained to predict the known outputs and later
tested using test data and applied to generalize other non-
trained data. Test data is used to test the prediction ability
(accuracy) of the model. Training data (X_train,y_train) is used
to fit the regression model(make a linear model).This model is
used to predict NPP2001 from independent variables.

'''Regression for predicting NPP using features(independent variables) in Machine

Learning '''

import math
import numpy as np
import pandas as pd
from sklearn import preprocessing,svm
from sklearn.preprocessing import StandardScaler
from sklearn import model_selection,metrics

from sklearn.linear_model import LinearRegression

from sklearn.model_selection import

Kaleab Woldemariam, June 2017

Multiple Linear Regression using Python Machine Learning

import matplotlib.pyplot as plt

from matplotlib import style
import datetime
df = pd.read_csv(raw_data)
# Create a DataFrame for numerical features
data1 = pd.DataFrame(df,

# Create a DataFrame for categorical features

cols_to_transform =
dummies = pd.get_dummies(cols_to_transform)
# Join data1 and dummies using Numpy and yield as array
X = np.array(data1.join(dummies))

# Specify the dependent variable as array

y = np.array(df['NPP2001'])

lm = LinearRegression(n_jobs=-1)

'''To check the accuracy/confidence level of the prediction,

we have 25% test datasets, while 75% is used for training.'''
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25)
print (X_train.shape, y_train.shape)
print (X_test.shape, y_test.shape)
# First we fit a model,y_train)
#print the coefficents
print("The linear cofficients",model.coef_)
# Try to predict the y ( NPP_Predict) for the test data-features(independent
# Accuracy of the prediction
confidence = lm.score(X_test, y_test)
print("This is predicted NPP2001 Values",predictions)
print("This is the prediction accuracy",confidence)
plt.title("Actual NPP2001 vs. NPP2001_Predict", size=10)
plt.scatter(y_test,predictions,color='c', marker='.')
plt.xlabel("Actual NPP2001", size=10)
plt.ylabel("NPP2001_Predict", size=10)

plt.title("Homogeneity of Variance")
plt.xlabel("Actual NPP2001")
#Perform 10 fold Cross Validation (KFold)
print ("Cross Validated Scores",scores)
kf=KFold(n_splits=10, random_state=None,shuffle=True)
for train_index, test_index in kf.split(X):

Kaleab Woldemariam, June 2017

Multiple Linear Regression using Python Machine Learning

print ("TRAIN", train_index, "TEST", test_index)

X_train,X_test=X[train_index], X[test_index]
# Make Cross Validated predictions
#Check the R2- the proportion of variance in the dependent variable explained by the
print ("This is R2",accuracy)
plt.scatter(y,predictions2,color='c', marker='.')
plt.xlabel("Actual NPP2001", size=10)
plt.ylabel("NPP2001_Predict", size=10)
plt.title("Actual and Predicted NPP2001 Values using 10 Fold Cross

The steps used so far are:-

 Load the data.

 Convert categorical variables to dummies and join to
numerical variables.
 Split the sample (2,377 pts) into training and test sets.
 Use training data to fit a regression model.
 Made predictions based on the X_test data.
 Computed accuracy of the prediction (score).

Train/Test split is not enough to guarantee the randomness of the

samples. If samples fail to be random, this might result in
overfitting. Overfitting means the model is “too well trained”,
although it cannot be applied to other data. Overfitting happens
when the model uses too many predictors; while it works too well
on the training set, it fails on new untrained data. This means
we cannot make inferences from our model.

Cross-Validation method called – K-Folds Cross Validation is used

to subset the sample into k different subsets (or folds). We use
k-1 subsets to train our data and leave the last subset as test
data. We then average the model against each of the folds and
then finalize our model. After that we test it against the test
set. Cross Validated predictions are made by supplying

Kaleab Woldemariam, June 2017

Multiple Linear Regression using Python Machine Learning

cross_val_predict function with the model, X(original/not test

independent variables) and the y(dependent variable),and the
cv(cross validation fold). The plot will have 10x points due to
cross validation.
#Perform 10 fold Cross Validation (KFold)
print ("Cross Validated Scores",scores)

Cross Validated Scores [ 0.34638801 0.56139146 0.61525375 0.7076254 0.70162425

0.49563864 0.61883974 0.52543957 0.33933734 0.10156286]

# Make Cross Validated predictions


Finally, the R2-the proportion of variance explained by the

predictors is given by:


The result indicates that the predictors account for 70.2% of the
variance in the Net Primary Productivity for year 2001.

The linear equation:


Kaleab Woldemariam, June 2017

Multiple Linear Regression using Python Machine Learning

Kaleab Woldemariam, June 2017

Multiple Linear Regression using Python Machine Learning

cross-validation-in-python-80b61beca4b6 retrieved on June 28,

Liang, S.,Li,X., Wang, J., 2012. Advanced Remote Sensing:

Terrestrial Information Extraction and Applications, Academic
Press, pp. 800.

Zuur, A. K., Ieno, E.N., Smith, G. M., 2007. Statistics for

Biology and Health: Analyzing Ecological Data, Springer Science +
Business Media, LLC.

Kaleab Woldemariam, June 2017

You might also like