You are on page 1of 26

1/21/22, 10:10 PM Predictive Modelling - Secondary

1.1 Read the data and do exploratory data analysis. Describe the data briefly. (Check the null
values, data types, shape, EDA). Perform Univariate and Bivariate Analysis. (8 marks)

In [1]:
### Loading nesscessary library for the model

import pandas as pd

import numpy as np

import matplotlib.pyplot as plt

%matplotlib inline

import seaborn as sns

from sklearn.model_selection import train_test_split

from sklearn.linear_model import LinearRegression

from sklearn import metrics

import matplotlib.pyplot as plt

import matplotlib.style

import scipy.stats as stats

from warnings import filterwarnings

filterwarnings('ignore')

In [2]:
df = pd.read_csv('Firm_level_data (1) (1) (1).csv')

In [3]:
df.head()

Out[3]: Unnamed:
sales capital patents randd employment sp500 tobinq
0

0 0 826.995050 161.603986 10 382.078247 2.306000 no 11.049511 162

1 1 407.753973 122.101012 2 0.000000 1.860000 no 0.844187 24

2 2 8407.845588 6221.144614 138 3296.700439 49.659005 yes 5.205257 2586

3 3 451.000010 266.899987 1 83.540161 3.071000 no 0.305221 6

4 4 174.927981 140.124004 2 14.233637 1.947000 no 1.063300 6

In [4]:
df.tail()

Out[4]: Unnamed:
sales capital patents randd employment sp500 tobinq
0

754 754 1253.900196 708.299935 32 412.936157 22.100002 yes 0.697454 267.11

755 755 171.821025 73.666008 1 0.037735 1.684000 no NaN 228.47

756 756 202.726967 123.926991 13 74.861099 1.460000 no 5.229723 580.43

757 757 785.687944 138.780992 6 0.621750 2.900000 yes 1.625398 309.93

758 758 22.701999 14.244999 5 18.574360 0.197000 no 2.213070 18.94

In [5]:
df.shape

Out[5]: (759, 10)

In [6]:
df.info()

localhost:8888/nbconvert/html/Predictive Modelling - Secondary.ipynb?download=false 1/26


1/21/22, 10:10 PM Predictive Modelling - Secondary

<class 'pandas.core.frame.DataFrame'>

RangeIndex: 759 entries, 0 to 758

Data columns (total 10 columns):

# Column Non-Null Count Dtype

--- ------ -------------- -----

0 Unnamed: 0 759 non-null int64

1 sales 759 non-null float64

2 capital 759 non-null float64

3 patents 759 non-null int64

4 randd 759 non-null float64

5 employment 759 non-null float64

6 sp500 759 non-null object

7 tobinq 738 non-null float64

8 value 759 non-null float64

9 institutions 759 non-null float64

dtypes: float64(7), int64(2), object(1)

memory usage: 59.4+ KB

In [7]:
### Data Description

df.describe(include = 'all').T

Out[7]: count unique top freq mean std min 25% 50%

Unnamed: 0 759.0 NaN NaN NaN 379.0 219.248717 0.0 189.5 379.0

sales 759.0 NaN NaN NaN 2689.705158 8722.060124 0.138 122.92 448.577082

capital 759.0 NaN NaN NaN 1977.747498 6466.704896 0.057 52.650501 202.179023

patents 759.0 NaN NaN NaN 25.831357 97.259577 0.0 1.0 3.0

randd 759.0 NaN NaN NaN 439.938074 2007.397588 0.0 4.628262 36.864136

employment 759.0 NaN NaN NaN 14.164519 43.321443 0.006 0.9275 2.924

sp500 759 2 no 542 NaN NaN NaN NaN NaN

tobinq 738.0 NaN NaN NaN 2.79491 3.366591 0.119001 1.018783 1.680303

value 759.0 NaN NaN NaN 2732.73475 7071.072362 1.971053 103.593946 410.793529

institutions 759.0 NaN NaN NaN 43.02054 21.685586 0.0 25.395 44.11

In [8]:
### Null value check

df.isnull().sum()

Out[8]: Unnamed: 0 0

sales 0

capital 0

patents 0

randd 0

employment 0

sp500 0

tobinq 21

value 0

institutions 0

dtype: int64

In [9]:
# Percentage of missing values

100 * df.isnull().sum() / len(df)

Out[9]: Unnamed: 0 0.000000

sales 0.000000

localhost:8888/nbconvert/html/Predictive Modelling - Secondary.ipynb?download=false 2/26


1/21/22, 10:10 PM Predictive Modelling - Secondary

capital 0.000000

patents 0.000000

randd 0.000000

employment 0.000000

sp500 0.000000

tobinq 2.766798

value 0.000000

institutions 0.000000

dtype: float64

In [10]:
df2=df.drop(['Unnamed: 0'], axis = 1)

In [11]:
df2.head()

Out[11]: sales capital patents randd employment sp500 tobinq value in

0 826.995050 161.603986 10 382.078247 2.306000 no 11.049511 1625.453755

1 407.753973 122.101012 2 0.000000 1.860000 no 0.844187 243.117082

2 8407.845588 6221.144614 138 3296.700439 49.659005 yes 5.205257 25865.233800

3 451.000010 266.899987 1 83.540161 3.071000 no 0.305221 63.024630

4 174.927981 140.124004 2 14.233637 1.947000 no 1.063300 67.406408

In [12]:
### check for duplicates in data

dups = df2.duplicated()

print('Number of duplicate rows = %d' % (dups.sum()))

Number of duplicate rows = 0

In [13]:
### unique values for categorical variables

for column in df2.columns:

if df2[column].dtype == 'object':

print(column.upper(),': ',df2[column].nunique())

print(df2[column].value_counts().sort_values())

print('\n')

SP500 : 2

yes 217

no 542

Name: sp500, dtype: int64

In [14]:
# Treating Null value

mean_value=df2['tobinq'].mean()

df2['tobinq'].fillna(value=mean_value, inplace=True)

print('Updated Dataframe:')

print(df2)

Updated Dataframe:

sales capital patents randd employment sp500 \

0 826.995050 161.603986 10 382.078247 2.306000 no

1 407.753973 122.101012 2 0.000000 1.860000 no

2 8407.845588 6221.144614 138 3296.700439 49.659005 yes

3 451.000010 266.899987 1 83.540161 3.071000 no

4 174.927981 140.124004 2 14.233637 1.947000 no

.. ... ... ... ... ... ...

754 1253.900196 708.299935 32 412.936157 22.100002 yes

localhost:8888/nbconvert/html/Predictive Modelling - Secondary.ipynb?download=false 3/26


1/21/22, 10:10 PM Predictive Modelling - Secondary

755 171.821025 73.666008 1 0.037735 1.684000 no

756 202.726967 123.926991 13 74.861099 1.460000 no

757 785.687944 138.780992 6 0.621750 2.900000 yes

758 22.701999 14.244999 5 18.574360 0.197000 no

tobinq value institutions

0 11.049511 1625.453755 80.27

1 0.844187 243.117082 59.02

2 5.205257 25865.233800 47.70

3 0.305221 63.024630 26.88

4 1.063300 67.406408 49.46

.. ... ... ...

754 0.697454 267.119487 33.50

755 2.794910 228.475701 46.41

756 5.229723 580.430741 42.25

757 1.625398 309.938651 61.39

758 2.213070 18.940140 7.50

[759 rows x 9 columns]

In [15]:
df2.isnull().sum()

Out[15]: sales 0

capital 0

patents 0

randd 0

employment 0

sp500 0

tobinq 0

value 0

institutions 0

dtype: int64

In [16]:
df2.describe()

Out[16]: sales capital patents randd employment tobinq val

count 759.000000 759.000000 759.000000 759.000000 759.000000 759.000000 759.0000

mean 2689.705158 1977.747498 25.831357 439.938074 14.164519 2.794910 2732.7347

std 8722.060124 6466.704896 97.259577 2007.397588 43.321443 3.319629 7071.0723

min 0.138000 0.057000 0.000000 0.000000 0.006000 0.119001 1.9710

25% 122.920000 52.650501 1.000000 4.628262 0.927500 1.036000 103.5939

50% 448.577082 202.179023 3.000000 36.864136 2.924000 1.741800 410.7935

75% 1822.547366 1075.790020 11.500000 143.253403 10.050001 3.082979 2054.1603

max 135696.788200 93625.200560 1220.000000 30425.255860 710.799925 20.000000 95191.5911

In [17]:
##Univariate

def plot_distribution(df2, cols=5, width=20, height=15, hspace=0.2, wspace=0.5):

plt.style.use('seaborn-whitegrid')

fig, axes = plt.subplots(nrows=4,ncols=2)

fig.set_size_inches(12, 20)

a = sns.distplot(df2['sales'] , ax=axes[0][0])

a.set_title("sales Distribution",fontsize=10)

a = sns.boxplot(df2['sales'] , orient = "v" , ax=axes[0][1])

a.set_title("sales Distribution",fontsize=15)

a = sns.distplot(df2['patents'] , ax=axes[1][0])

localhost:8888/nbconvert/html/Predictive Modelling - Secondary.ipynb?download=false 4/26


1/21/22, 10:10 PM Predictive Modelling - Secondary

a.set_title("patents Distribution",fontsize=10)

a = sns.boxplot(df2['patents'] , orient = "v" , ax=axes[1][1])

a.set_title("patents Distribution",fontsize=10)

a = sns.distplot(df2['capital'] , ax=axes[2][0])

a.set_title("capital Distribution",fontsize=10)

a = sns.boxplot(df2['capital'] , orient = "v" , ax=axes[2][1])

a.set_title("capital Distribution",fontsize=10)

a = sns.distplot(df2['randd'] , ax=axes[3][0])

a.set_title("randd Distribution",fontsize=10)

a = sns.boxplot(df2['randd'] , orient = "v" , ax=axes[3][1])

a.set_title("randd Distribution",fontsize=10)

a = sns.distplot(df2['employment'] , ax=axes[0][0])

a.set_title("employment Distribution",fontsize=10)

a = sns.boxplot(df2['employment'] , orient = "v" , ax=axes[0][1])

a.set_title("employment Distribution",fontsize=10)

a = sns.distplot(df2['tobinq'] , ax=axes[1][0])

a.set_title("tobinq Distribution",fontsize=10)

a = sns.boxplot(df2['tobinq'] , orient = "v" , ax=axes[1][1])

a.set_title("tobinq Distribution",fontsize=10)

a = sns.distplot(df2['value'] , ax=axes[2][0])

a.set_title("value Distribution",fontsize=10)

a = sns.boxplot(df2['value'] , orient = "v" , ax=axes[2][1])

a.set_title("value Distribution",fontsize=10)

Out[17]: Text(0.5, 1.0, 'value Distribution')

localhost:8888/nbconvert/html/Predictive Modelling - Secondary.ipynb?download=false 5/26


1/21/22, 10:10 PM Predictive Modelling - Secondary

In [18]:
def plot_distribution(df, cols=5, width=20, height=15, hspace=0.2, wspace=0.5):

plt.style.use('seaborn-whitegrid')

fig, axes = plt.subplots(nrows=4,ncols=2)

fig.set_size_inches(12, 20)

a = sns.distplot(df2['sales'] , ax=axes[0][0])

localhost:8888/nbconvert/html/Predictive Modelling - Secondary.ipynb?download=false 6/26


1/21/22, 10:10 PM Predictive Modelling - Secondary

a.set_title("sales Distribution",fontsize=10)

a = sns.boxplot(df2['sales'] , orient = "v" , ax=axes[0][1])

a.set_title("sales Distribution",fontsize=15)

a = sns.distplot(df2['patents'] , ax=axes[1][0])

a.set_title("patents Distribution",fontsize=10)

a = sns.boxplot(df2['patents'] , orient = "v" , ax=axes[1][1])

a.set_title("patents Distribution",fontsize=10)

a = sns.distplot(df2['institutions'] , ax=axes[2][0])

a.set_title("institutions",fontsize=10)

a = sns.boxplot(df2['institutions'] , orient = "v" , ax=axes[2][1])

a.set_title("institutions",fontsize=10)

Out[18]: Text(0.5, 1.0, 'institutions')

localhost:8888/nbconvert/html/Predictive Modelling - Secondary.ipynb?download=false 7/26


1/21/22, 10:10 PM Predictive Modelling - Secondary

In [19]:
df2.columns

Out[19]: Index(['sales', 'capital', 'patents', 'randd', 'employment', 'sp500', 'tobinq',

'value', 'institutions'],

dtype='object')

localhost:8888/nbconvert/html/Predictive Modelling - Secondary.ipynb?download=false 8/26


1/21/22, 10:10 PM Predictive Modelling - Secondary

In [20]: df2.skew()

Out[20]: sales 9.219023

capital 7.555091

patents 7.766943

randd 10.270483

employment 9.068875

tobinq 3.332006

value 6.075996

institutions -0.168071

dtype: float64

In [21]:
### Data Distribution

sns.pairplot(df2, diag_kind='kde')

plt.show()

In [22]:
## Data Distribution

# Pairplot using sns

sns.pairplot(df2 ,diag_kind='kde' ,hue='sales');

localhost:8888/nbconvert/html/Predictive Modelling - Secondary.ipynb?download=false 9/26


1/21/22, 10:10 PM Predictive Modelling - Secondary

In [23]:
### checking for Correlations

df_cor = df2.corr()

plt.figure(figsize=(8,6))

sns.heatmap(df_cor, annot=True, fmt = '.2f', cmap='coolwarm')

Out[23]: <AxesSubplot:>

localhost:8888/nbconvert/html/Predictive Modelling - Secondary.ipynb?download=false 10/26


1/21/22, 10:10 PM Predictive Modelling - Secondary

1.2 Impute null values if present? Do you think scaling is necessary in this case? (8 marks)

In [24]:
df2.isnull().sum()

Out[24]: sales 0

capital 0

patents 0

randd 0

employment 0

sp500 0

tobinq 0

value 0

institutions 0

dtype: int64

In [25]:
df2[df2.isin([0])].stack(0)

Out[25]: 1 randd 0.0

6 patents 0.0

7 randd 0.0

18 patents 0.0

22 patents 0.0

...

744 patents 0.0

746 randd 0.0

749 patents 0.0

randd 0.0

751 patents 0.0

Length: 280, dtype: object

In [26]:
for column in df2.columns:

if df2[column].dtype != 'object':

median = df2[column].median()

df2[column] = df2[column].fillna(median)

df.isnull().sum()

localhost:8888/nbconvert/html/Predictive Modelling - Secondary.ipynb?download=false 11/26


1/21/22, 10:10 PM Predictive Modelling - Secondary

Out[26]: Unnamed: 0 0

sales 0

capital 0

patents 0

randd 0

employment 0

sp500 0

tobinq 21

value 0

institutions 0

dtype: int64

In [27]:
df2.dtypes

Out[27]: sales float64

capital float64

patents int64

randd float64

employment float64

sp500 object

tobinq float64

value float64

institutions float64

dtype: object

In [28]:
#from sklearn.preprocessing import StandardScaler

#sc = StandardScaler()

# get numeric data

#num_d = df2.select_dtypes(exclude=['object'])

# update the cols with their normalized values

#df2[num_d.columns] = sc.fit_transform(num_d)

In [29]:
df2.head()

Out[29]: sales capital patents randd employment sp500 tobinq value in

0 826.995050 161.603986 10 382.078247 2.306000 no 11.049511 1625.453755

1 407.753973 122.101012 2 0.000000 1.860000 no 0.844187 243.117082

2 8407.845588 6221.144614 138 3296.700439 49.659005 yes 5.205257 25865.233800

3 451.000010 266.899987 1 83.540161 3.071000 no 0.305221 63.024630

4 174.927981 140.124004 2 14.233637 1.947000 no 1.063300 67.406408

In [30]:
#outlier check

In [31]:
df2.isnull().sum()

Out[31]: sales 0

capital 0

patents 0

randd 0

employment 0

sp500 0

localhost:8888/nbconvert/html/Predictive Modelling - Secondary.ipynb?download=false 12/26


1/21/22, 10:10 PM Predictive Modelling - Secondary

tobinq 0

value 0

institutions 0

dtype: int64

In [32]:
cols = ['sales' ,'capital', 'patents', 'randd', 'employment','tobinq','value','insti

for i in cols:

sns.boxplot(df2[i])

plt.show()

localhost:8888/nbconvert/html/Predictive Modelling - Secondary.ipynb?download=false 13/26


1/21/22, 10:10 PM Predictive Modelling - Secondary

localhost:8888/nbconvert/html/Predictive Modelling - Secondary.ipynb?download=false 14/26


1/21/22, 10:10 PM Predictive Modelling - Secondary

In [33]:
cont=df2.dtypes[(df2.dtypes!='uint8') & (df2.dtypes!='object')].index

In [34]:
def remove_outlier(col):

sorted(col)

Q1,Q3=np.percentile(col,[25,75])

IQR=Q3-Q1

lower_range= Q1-(1.5 * IQR)

upper_range= Q3+(1.5 * IQR)

return lower_range, upper_range

In [35]:
for column in df2[cont].columns:

lr,ur=remove_outlier(df2[column])

df2[column]=np.where(df2[column]>ur,ur,df2[column])

df2[column]=np.where(df2[column]<lr,lr,df2[column])

In [36]:
cols = ['sales' ,'capital', 'patents', 'randd', 'employment','tobinq','value','insti

for i in cols:

sns.boxplot(df2[i])

plt.show()

localhost:8888/nbconvert/html/Predictive Modelling - Secondary.ipynb?download=false 15/26


1/21/22, 10:10 PM Predictive Modelling - Secondary

localhost:8888/nbconvert/html/Predictive Modelling - Secondary.ipynb?download=false 16/26


1/21/22, 10:10 PM Predictive Modelling - Secondary

localhost:8888/nbconvert/html/Predictive Modelling - Secondary.ipynb?download=false 17/26


1/21/22, 10:10 PM Predictive Modelling - Secondary

In [37]:
df2_cor = df2.corr()

plt.figure(figsize=(8,6))

sns.heatmap(df2_cor, annot=True, fmt = '.2f', cmap='coolwarm')

Out[37]: <AxesSubplot:>

localhost:8888/nbconvert/html/Predictive Modelling - Secondary.ipynb?download=false 18/26


1/21/22, 10:10 PM Predictive Modelling - Secondary

1.3 Encode the data (having string values) for Modelling. Data Split: Split the data into test
and
train (70:30). Apply Linear regression. Performance Metrics: Check the performance
of
Predictions on Train and Test sets using Rsquare, RMSE. (8 marks)

In [38]:
### Converting categorical to dummy variables in data
data = pd.get_dummies(df2, columns=['sp500'])

data.head()

Out[38]: sales capital patents randd employment tobinq value institutions

0 826.995050 161.603986 10.00 351.191114 2.306000 6.153448 1625.453755 80.27

1 407.753973 122.101012 2.00 0.000000 1.860000 0.844187 243.117082 59.02

2 4371.988416 2610.499299 27.25 351.191114 23.733752 5.205257 4980.010044 47.70

3 451.000010 266.899987 1.00 83.540161 3.071000 0.305221 63.024630 26.88

4 174.927981 140.124004 2.00 14.233637 1.947000 1.063300 67.406408 49.46

In [39]:
data.columns

Out[39]: Index(['sales', 'capital', 'patents', 'randd', 'employment', 'tobinq', 'value',

'institutions', 'sp500_no', 'sp500_yes'],

dtype='object')

In [40]:
#### Train/ Test split - Unrequried column already drop, changing name for self comf
data_model = data

data_model.columns

Out[40]: Index(['sales', 'capital', 'patents', 'randd', 'employment', 'tobinq', 'value',

'institutions', 'sp500_no', 'sp500_yes'],

dtype='object')

localhost:8888/nbconvert/html/Predictive Modelling - Secondary.ipynb?download=false 19/26


1/21/22, 10:10 PM Predictive Modelling - Secondary

In [41]: data_model.head()

Out[41]: sales capital patents randd employment tobinq value institutions

0 826.995050 161.603986 10.00 351.191114 2.306000 6.153448 1625.453755 80.27

1 407.753973 122.101012 2.00 0.000000 1.860000 0.844187 243.117082 59.02

2 4371.988416 2610.499299 27.25 351.191114 23.733752 5.205257 4980.010044 47.70

3 451.000010 266.899987 1.00 83.540161 3.071000 0.305221 63.024630 26.88

4 174.927981 140.124004 2.00 14.233637 1.947000 1.063300 67.406408 49.46

In [42]:
data_model.info()

<class 'pandas.core.frame.DataFrame'>

RangeIndex: 759 entries, 0 to 758

Data columns (total 10 columns):

# Column Non-Null Count Dtype

--- ------ -------------- -----

0 sales 759 non-null float64

1 capital 759 non-null float64

2 patents 759 non-null float64

3 randd 759 non-null float64

4 employment 759 non-null float64

5 tobinq 759 non-null float64

6 value 759 non-null float64

7 institutions 759 non-null float64

8 sp500_no 759 non-null uint8

9 sp500_yes 759 non-null uint8

dtypes: float64(8), uint8(2)

memory usage: 49.0 KB

In [43]:
# Copy all the predictor variables into X dataframe

X = data_model.drop('sales', axis=1)

# Copy target into the y dataframe.

y = data_model[['sales']]

In [44]:
X.head()

Out[44]: capital patents randd employment tobinq value institutions sp500_no sp

0 161.603986 10.00 351.191114 2.306000 6.153448 1625.453755 80.27 1

1 122.101012 2.00 0.000000 1.860000 0.844187 243.117082 59.02 1

2 2610.499299 27.25 351.191114 23.733752 5.205257 4980.010044 47.70 0

3 266.899987 1.00 83.540161 3.071000 0.305221 63.024630 26.88 1

4 140.124004 2.00 14.233637 1.947000 1.063300 67.406408 49.46 1

In [45]:
X.shape

Out[45]: (759, 9)

In [46]:
localhost:8888/nbconvert/html/Predictive Modelling - Secondary.ipynb?download=false 20/26
1/21/22, 10:10 PM Predictive Modelling - Secondary

y.head()

Out[46]: sales

0 826.995050

1 407.753973

2 4371.988416

3 451.000010

4 174.927981

In [47]:
y.shape

Out[47]: (759, 1)

In [48]:
# Split X and y into training and test set in 70:30 ratio

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3 , random_sta

In [49]:
## Linear Regression Model

# invoke the LinearRegression function and find the bestfit model on training data

regression_model = LinearRegression()

regression_model.fit(X_train,y_train)

Out[49]: LinearRegression()

In [50]:
# Let us explore the coefficients for each of the independent attributes

for idx, col_name in enumerate(X_train.columns):

print("The coefficient for {} is {}".format(col_name, regression_model.coef_[0][

The coefficient for capital is 0.4061544084041557

The coefficient for patents is -4.6473268486727575

The coefficient for randd is 0.6398846045072621

The coefficient for employment is 78.61372479076532

The coefficient for tobinq is -39.92578934013681

The coefficient for value is 0.24462524514528727

The coefficient for institutions is 0.21743855519970573

The coefficient for sp500_no is -83.06604336852845

The coefficient for sp500_yes is 83.06604336852763

In [51]:
# Let us check the intercept for the model

intercept = regression_model.intercept_[0]

print("The intercept for our model is {}".format(intercept))

The intercept for our model is 155.8971701239957

In [52]:
# R square on training data

regression_model.score(X_train, y_train)

Out[52]: 0.9358806629736066

In [53]:
# R square on testing data

localhost:8888/nbconvert/html/Predictive Modelling - Secondary.ipynb?download=false 21/26


1/21/22, 10:10 PM Predictive Modelling - Secondary

regression_model.score(X_test, y_test)

Out[53]: 0.924129439335239

In [54]:
#RMSE on Training data

predicted_train=regression_model.fit(X_train, y_train).predict(X_train)

np.sqrt(metrics.mean_squared_error(y_train,predicted_train))

Out[54]: 394.6129494572075

In [71]:
#RMSE on Testing data

predicted_test=regression_model.fit(X_train, y_train).predict(X_test)

np.sqrt(metrics.mean_squared_error(y_test,predicted_test))

Out[71]: 399.74321332112794

In [55]:
from statsmodels.stats.outliers_influence import variance_inflation_factor

vif = [variance_inflation_factor(X.values, ix) for ix in range(X.shape[1])]

In [56]:
i=0

for column in X.columns:

if i < 11:

print (column ,"--->", vif[i])

i = i+1

capital ---> 5.884834435358601

patents ---> 2.5564811032960173

randd ---> 2.9241166081719343

employment ---> 5.289087439090918

tobinq ---> 1.4736588698814541

value ---> 6.0730692748610045

institutions ---> 1.2923225457814675

sp500_no ---> 5.627713456806028

sp500_yes ---> 7.007866608862636

In [57]:
### Using Statsmodel library

data_train = pd.concat([X_train, y_train], axis=1)

data_train.head()

Out[57]: capital patents randd employment tobinq value institutions sp500_no s

626 1315.696256 15.0 73.275818 16.472000 1.657513 2231.870118 31.47 1

333 15.258002 2.0 9.252643 0.566000 0.381755 9.877838 21.69 1

257 538.188036 20.0 87.388641 6.627000 2.126738 1019.443780 69.64 1

173 807.215091 0.0 68.900185 7.607001 3.151469 2221.768944 69.69 0

242 402.508010 2.0 0.000000 1.550000 2.154388 358.040202 85.42 1

In [58]:
data_train.columns

Out[58]: Index(['capital', 'patents', 'randd', 'employment', 'tobinq', 'value',

'institutions', 'sp500_no', 'sp500_yes', 'sales'],

dtype='object')

localhost:8888/nbconvert/html/Predictive Modelling - Secondary.ipynb?download=false 22/26


1/21/22, 10:10 PM Predictive Modelling - Secondary

In [59]: import statsmodels.formula.api as smf

lm1 = smf.ols(formula= 'sales ~ capital + patents + randd + employment+ tobinq + val


lm1.params

Out[59]: Intercept 103.931447

capital 0.406154

patents -4.647327

randd 0.639885

employment 78.613725

tobinq -39.925789

value 0.244625

institutions 0.217439

sp500_no -31.100320

sp500_yes 135.031767

dtype: float64

In [60]:
print(lm1.summary()) #Inferential statistics

OLS Regression Results

==============================================================================

Dep. Variable: sales R-squared: 0.936

Model: OLS Adj. R-squared: 0.935

Method: Least Squares F-statistic: 952.4

Date: Fri, 21 Jan 2022 Prob (F-statistic): 1.05e-305

Time: 10:53:57 Log-Likelihood: -3927.7

No. Observations: 531 AIC: 7873.

Df Residuals: 522 BIC: 7912.

Df Model: 8

Covariance Type: nonrobust

================================================================================

coef std err t P>|t| [0.025 0.975]

--------------------------------------------------------------------------------

Intercept 103.9314 42.150 2.466 0.014 21.128 186.735

capital 0.4062 0.042 9.651 0.000 0.323 0.489

patents -4.6473 2.789 -1.666 0.096 -10.127 0.833

randd 0.6399 0.232 2.753 0.006 0.183 1.096

employment 78.6137 4.765 16.498 0.000 69.252 87.975

tobinq -39.9258 12.145 -3.288 0.001 -63.784 -16.067

value 0.2446 0.026 9.592 0.000 0.195 0.295

institutions 0.2174 0.902 0.241 0.810 -1.555 1.990

sp500_no -31.1003 25.504 -1.219 0.223 -81.203 19.003

sp500_yes 135.0318 49.490 2.728 0.007 37.808 232.256

==============================================================================

Omnibus: 185.527 Durbin-Watson: 1.966

Prob(Omnibus): 0.000 Jarque-Bera (JB): 1284.253

Skew: 1.351 Prob(JB): 1.34e-279

Kurtosis: 10.123 Cond. No. 2.47e+19

==============================================================================

Notes:

[1] Standard Errors assume that the covariance matrix of the errors is correctly spe
cified.

[2] The smallest eigenvalue is 5.53e-30. This might indicate that there are

strong multicollinearity problems or that the design matrix is singular.

In [61]:
# Let us check the sum of squared errors by predicting value of y for test cases and
# subtracting from the actual y for the test cases

mse = np.mean((regression_model.predict(X_test)-y_test)**2)

In [62]:
# underroot of mean_sq_error is standard deviation i.e. avg variance between predict

import math

math.sqrt(mse)

localhost:8888/nbconvert/html/Predictive Modelling - Secondary.ipynb?download=false 23/26


1/21/22, 10:10 PM Predictive Modelling - Secondary

Out[62]: 399.743213321128

In [63]:
# Model score - R2 or coeff of determinant

# R^2=1–RSS / TSS

regression_model.score(X_test, y_test)

Out[63]: 0.924129439335239

In [64]:
# predict mileage (mpg) for a set of attributes not in the training or test set

y_pred = regression_model.predict(X_test)

In [65]:
plt.scatter(y_test['sales'], y_pred)

Out[65]: <matplotlib.collections.PathCollection at 0x2182b2904f0>

In [66]:
### ITERATION 2

import statsmodels.formula.api as smf

lm2 = smf.ols(formula= 'sales ~ capital + patents + randd + employment+ tobinq + val


lm2.params

Out[66]: Intercept 103.931447

capital 0.406154

patents -4.647327

randd 0.639885

employment 78.613725

tobinq -39.925789

value 0.244625

institutions 0.217439

sp500_no -31.100320

sp500_yes 135.031767

dtype: float64

In [67]:
print(lm2.summary()) #Inferential statistics

OLS Regression Results

==============================================================================

Dep. Variable: sales R-squared: 0.936

Model: OLS Adj. R-squared: 0.935

Method: Least Squares F-statistic: 952.4

Date: Fri, 21 Jan 2022 Prob (F-statistic): 1.05e-305

Time: 10:53:58 Log-Likelihood: -3927.7

localhost:8888/nbconvert/html/Predictive Modelling - Secondary.ipynb?download=false 24/26


1/21/22, 10:10 PM Predictive Modelling - Secondary

No. Observations: 531 AIC: 7873.

Df Residuals: 522 BIC: 7912.

Df Model: 8

Covariance Type: nonrobust

================================================================================

coef std err t P>|t| [0.025 0.975]

--------------------------------------------------------------------------------

Intercept 103.9314 42.150 2.466 0.014 21.128 186.735

capital 0.4062 0.042 9.651 0.000 0.323 0.489

patents -4.6473 2.789 -1.666 0.096 -10.127 0.833

randd 0.6399 0.232 2.753 0.006 0.183 1.096

employment 78.6137 4.765 16.498 0.000 69.252 87.975

tobinq -39.9258 12.145 -3.288 0.001 -63.784 -16.067

value 0.2446 0.026 9.592 0.000 0.195 0.295

institutions 0.2174 0.902 0.241 0.810 -1.555 1.990

sp500_no -31.1003 25.504 -1.219 0.223 -81.203 19.003

sp500_yes 135.0318 49.490 2.728 0.007 37.808 232.256

==============================================================================

Omnibus: 185.527 Durbin-Watson: 1.966

Prob(Omnibus): 0.000 Jarque-Bera (JB): 1284.253

Skew: 1.351 Prob(JB): 1.34e-279

Kurtosis: 10.123 Cond. No. 2.47e+19

==============================================================================

Notes:

[1] Standard Errors assume that the covariance matrix of the errors is correctly spe
cified.

[2] The smallest eigenvalue is 5.53e-30. This might indicate that there are

strong multicollinearity problems or that the design matrix is singular.

In [68]:
# concatenate X and y into a single dataframe

data_train = pd.concat([X_train, y_train], axis=1)

data_test=pd.concat([X_test,y_test],axis=1)

data_train.head()

Out[68]: capital patents randd employment tobinq value institutions sp500_no s

626 1315.696256 15.0 73.275818 16.472000 1.657513 2231.870118 31.47 1

333 15.258002 2.0 9.252643 0.566000 0.381755 9.877838 21.69 1

257 538.188036 20.0 87.388641 6.627000 2.126738 1019.443780 69.64 1

173 807.215091 0.0 68.900185 7.607001 3.151469 2221.768944 69.69 0

242 402.508010 2.0 0.000000 1.550000 2.154388 358.040202 85.42 1

In [69]:
data_test.head()

Out[69]: capital patents randd employment tobinq value institutions sp500_no s

480 50.688001 1.0 47.173386 1.147000 1.006168 34.516077 34.92 1

622 80.960002 3.0 50.251263 3.400000 1.259892 164.840772 18.88 1

638 1119.000008 19.0 78.623947 18.988003 1.900413 2114.826950 47.94 0

389 68.742010 3.0 44.827785 1.204000 2.262480 82.287341 24.65 1

748 308.770949 2.0 79.026939 3.264000 1.741800 533.056000 16.05 1

In [70]:
for i,j in np.array(lm2.params.reset_index()):

localhost:8888/nbconvert/html/Predictive Modelling - Secondary.ipynb?download=false 25/26


1/21/22, 10:10 PM Predictive Modelling - Secondary

print('({}) * {} +'.format(round(j,2),i),end=' ')

(103.93) * Intercept + (0.41) * capital + (-4.65) * patents + (0.64) * randd + (78.6


1) * employment + (-39.93) * tobinq + (0.24) * value + (0.22) * institutions + (-31.
1) * sp500_no + (135.03) * sp500_yes +
1.4 Inference: Based on these predictions, what are the business insights and
recommendations.
(6 marks)

In [ ]:

In [ ]:

localhost:8888/nbconvert/html/Predictive Modelling - Secondary.ipynb?download=false 26/26

You might also like