Predictive Modelling Alternative Firm Level PDF

1/21/22, 10:10 PM Predictive Modelling - Secondary
1.1 Read the data and do exploratory data analysis. Describe the data briefly. (Check the null
values, data types, shape, EDA). Perform Univariate and Bivariate Analysis. (8 marks)
In [1]:
### Loading nesscessary library for the model
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn import metrics
import matplotlib.pyplot as plt
import matplotlib.style
import scipy.stats as stats
from warnings import filterwarnings
filterwarnings('ignore')
In [2]:
df = pd.read_csv('Firm_level_data (1) (1) (1).csv')
In [3]:
df.head()
Out[3]: Unnamed:
sales capital patents randd employment sp500 tobinq
0
0 0 826.995050 161.603986 10 382.078247 2.306000 no 11.049511 162
1 1 407.753973 122.101012 2 0.000000 1.860000 no 0.844187 24
2 2 8407.845588 6221.144614 138 3296.700439 49.659005 yes 5.205257 2586
3 3 451.000010 266.899987 1 83.540161 3.071000 no 0.305221 6
4 4 174.927981 140.124004 2 14.233637 1.947000 no 1.063300 6
In [4]:
df.tail()
Out[4]: Unnamed:
sales capital patents randd employment sp500 tobinq
0
754 754 1253.900196 708.299935 32 412.936157 22.100002 yes 0.697454 267.11
755 755 171.821025 73.666008 1 0.037735 1.684000 no NaN 228.47
756 756 202.726967 123.926991 13 74.861099 1.460000 no 5.229723 580.43
757 757 785.687944 138.780992 6 0.621750 2.900000 yes 1.625398 309.93
758 758 22.701999 14.244999 5 18.574360 0.197000 no 2.213070 18.94
In [5]:
df.shape
Out[5]: (759, 10)
In [6]:
df.info()
localhost:8888/nbconvert/html/Predictive Modelling - Secondary.ipynb?download=false 1/26

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 759 entries, 0 to 758
Data columns (total 10 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Unnamed: 0 759 non-null int64
1 sales 759 non-null float64
2 capital 759 non-null float64
3 patents 759 non-null int64
4 randd 759 non-null float64
5 employment 759 non-null float64
6 sp500 759 non-null object
7 tobinq 738 non-null float64
8 value 759 non-null float64
9 institutions 759 non-null float64
dtypes: float64(7), int64(2), object(1)
memory usage: 59.4+ KB
In [7]:
### Data Description
df.describe(include = 'all').T
Out[7]: count unique top freq mean std min 25% 50%
Unnamed: 0 759.0 NaN NaN NaN 379.0 219.248717 0.0 189.5 379.0
sales 759.0 NaN NaN NaN 2689.705158 8722.060124 0.138 122.92 448.577082
capital 759.0 NaN NaN NaN 1977.747498 6466.704896 0.057 52.650501 202.179023
patents 759.0 NaN NaN NaN 25.831357 97.259577 0.0 1.0 3.0
randd 759.0 NaN NaN NaN 439.938074 2007.397588 0.0 4.628262 36.864136
employment 759.0 NaN NaN NaN 14.164519 43.321443 0.006 0.9275 2.924
sp500 759 2 no 542 NaN NaN NaN NaN NaN
tobinq 738.0 NaN NaN NaN 2.79491 3.366591 0.119001 1.018783 1.680303
value 759.0 NaN NaN NaN 2732.73475 7071.072362 1.971053 103.593946 410.793529
institutions 759.0 NaN NaN NaN 43.02054 21.685586 0.0 25.395 44.11
In [8]:
### Null value check
df.isnull().sum()
Out[8]: Unnamed: 0 0
sales 0
capital 0
patents 0
randd 0
employment 0
sp500 0
tobinq 21
value 0
institutions 0
dtype: int64
In [9]:
# Percentage of missing values
100 * df.isnull().sum() / len(df)
Out[9]: Unnamed: 0 0.000000
sales 0.000000

capital 0.000000
patents 0.000000
randd 0.000000
employment 0.000000
sp500 0.000000
tobinq 2.766798
value 0.000000
institutions 0.000000
dtype: float64
In [10]:
df2=df.drop(['Unnamed: 0'], axis = 1)
In [11]:
df2.head()
Out[11]: sales capital patents randd employment sp500 tobinq value in
0 826.995050 161.603986 10 382.078247 2.306000 no 11.049511 1625.453755
1 407.753973 122.101012 2 0.000000 1.860000 no 0.844187 243.117082
2 8407.845588 6221.144614 138 3296.700439 49.659005 yes 5.205257 25865.233800
3 451.000010 266.899987 1 83.540161 3.071000 no 0.305221 63.024630
4 174.927981 140.124004 2 14.233637 1.947000 no 1.063300 67.406408
In [12]:
### check for duplicates in data
dups = df2.duplicated()
print('Number of duplicate rows = %d' % (dups.sum()))
Number of duplicate rows = 0
In [13]:
### unique values for categorical variables
for column in df2.columns:
if df2[column].dtype == 'object':
print(column.upper(),': ',df2[column].nunique())
print(df2[column].value_counts().sort_values())
print('\n')
SP500 : 2
yes 217
no 542
Name: sp500, dtype: int64
In [14]:
# Treating Null value
mean_value=df2['tobinq'].mean()
df2['tobinq'].fillna(value=mean_value, inplace=True)
print('Updated Dataframe:')
print(df2)
Updated Dataframe:
sales capital patents randd employment sp500 \
0 826.995050 161.603986 10 382.078247 2.306000 no
1 407.753973 122.101012 2 0.000000 1.860000 no
2 8407.845588 6221.144614 138 3296.700439 49.659005 yes
3 451.000010 266.899987 1 83.540161 3.071000 no
4 174.927981 140.124004 2 14.233637 1.947000 no
.. ... ... ... ... ... ...
754 1253.900196 708.299935 32 412.936157 22.100002 yes

755 171.821025 73.666008 1 0.037735 1.684000 no
756 202.726967 123.926991 13 74.861099 1.460000 no
757 785.687944 138.780992 6 0.621750 2.900000 yes
758 22.701999 14.244999 5 18.574360 0.197000 no
tobinq value institutions
0 11.049511 1625.453755 80.27
1 0.844187 243.117082 59.02
2 5.205257 25865.233800 47.70
3 0.305221 63.024630 26.88
4 1.063300 67.406408 49.46
.. ... ... ...
754 0.697454 267.119487 33.50
755 2.794910 228.475701 46.41
756 5.229723 580.430741 42.25
757 1.625398 309.938651 61.39
758 2.213070 18.940140 7.50
[759 rows x 9 columns]
In [15]:
df2.isnull().sum()
Out[15]: sales 0
capital 0
patents 0
randd 0
employment 0
sp500 0
tobinq 0
value 0
institutions 0
dtype: int64
In [16]:
df2.describe()
Out[16]: sales capital patents randd employment tobinq val
count 759.000000 759.000000 759.000000 759.000000 759.000000 759.000000 759.0000
mean 2689.705158 1977.747498 25.831357 439.938074 14.164519 2.794910 2732.7347
std 8722.060124 6466.704896 97.259577 2007.397588 43.321443 3.319629 7071.0723
min 0.138000 0.057000 0.000000 0.000000 0.006000 0.119001 1.9710
25% 122.920000 52.650501 1.000000 4.628262 0.927500 1.036000 103.5939
50% 448.577082 202.179023 3.000000 36.864136 2.924000 1.741800 410.7935
75% 1822.547366 1075.790020 11.500000 143.253403 10.050001 3.082979 2054.1603
max 135696.788200 93625.200560 1220.000000 30425.255860 710.799925 20.000000 95191.5911
In [17]:
##Univariate
def plot_distribution(df2, cols=5, width=20, height=15, hspace=0.2, wspace=0.5):
plt.style.use('seaborn-whitegrid')
fig, axes = plt.subplots(nrows=4,ncols=2)
fig.set_size_inches(12, 20)
a = sns.distplot(df2['sales'] , ax=axes[0][0])
a.set_title("sales Distribution",fontsize=10)
a = sns.boxplot(df2['sales'] , orient = "v" , ax=axes[0][1])
a = sns.distplot(df2['patents'] , ax=axes[1][0])

a.set_title("patents Distribution",fontsize=10)
a = sns.boxplot(df2['patents'] , orient = "v" , ax=axes[1][1])
a = sns.distplot(df2['capital'] , ax=axes[2][0])
a.set_title("capital Distribution",fontsize=10)
a = sns.boxplot(df2['capital'] , orient = "v" , ax=axes[2][1])
a.set_title("capital Distribution",fontsize=10)
a = sns.distplot(df2['randd'] , ax=axes[3][0])
a.set_title("randd Distribution",fontsize=10)
a = sns.boxplot(df2['randd'] , orient = "v" , ax=axes[3][1])
a.set_title("randd Distribution",fontsize=10)
a = sns.distplot(df2['employment'] , ax=axes[0][0])
a.set_title("employment Distribution",fontsize=10)
a = sns.boxplot(df2['employment'] , orient = "v" , ax=axes[0][1])
a.set_title("employment Distribution",fontsize=10)
a = sns.distplot(df2['tobinq'] , ax=axes[1][0])
a.set_title("tobinq Distribution",fontsize=10)
a = sns.boxplot(df2['tobinq'] , orient = "v" , ax=axes[1][1])
a.set_title("tobinq Distribution",fontsize=10)
a = sns.distplot(df2['value'] , ax=axes[2][0])
a.set_title("value Distribution",fontsize=10)
a = sns.boxplot(df2['value'] , orient = "v" , ax=axes[2][1])
a.set_title("value Distribution",fontsize=10)
Out[17]: Text(0.5, 1.0, 'value Distribution')

In [18]:
def plot_distribution(df, cols=5, width=20, height=15, hspace=0.2, wspace=0.5):
plt.style.use('seaborn-whitegrid')
fig, axes = plt.subplots(nrows=4,ncols=2)
fig.set_size_inches(12, 20)
a = sns.distplot(df2['sales'] , ax=axes[0][0])

a = sns.boxplot(df2['sales'] , orient = "v" , ax=axes[0][1])
a = sns.distplot(df2['patents'] , ax=axes[1][0])
a = sns.boxplot(df2['patents'] , orient = "v" , ax=axes[1][1])
a = sns.distplot(df2['institutions'] , ax=axes[2][0])
a.set_title("institutions",fontsize=10)
a = sns.boxplot(df2['institutions'] , orient = "v" , ax=axes[2][1])
a.set_title("institutions",fontsize=10)
Out[18]: Text(0.5, 1.0, 'institutions')

In [19]:
df2.columns
Out[19]: Index(['sales', 'capital', 'patents', 'randd', 'employment', 'sp500', 'tobinq',
'value', 'institutions'],
dtype='object')

In [20]: df2.skew()
Out[20]: sales 9.219023
capital 7.555091
patents 7.766943
randd 10.270483
employment 9.068875
tobinq 3.332006
value 6.075996
institutions -0.168071
dtype: float64
In [21]:
### Data Distribution
sns.pairplot(df2, diag_kind='kde')
plt.show()
In [22]:
## Data Distribution
# Pairplot using sns
sns.pairplot(df2 ,diag_kind='kde' ,hue='sales');

In [23]:
### checking for Correlations
df_cor = df2.corr()
plt.figure(figsize=(8,6))
sns.heatmap(df_cor, annot=True, fmt = '.2f', cmap='coolwarm')
Out[23]: <AxesSubplot:>

1.2 Impute null values if present? Do you think scaling is necessary in this case? (8 marks)
In [24]:
df2.isnull().sum()
Out[24]: sales 0
capital 0
patents 0
randd 0
employment 0
sp500 0
tobinq 0
value 0
institutions 0
dtype: int64
In [25]:
df2[df2.isin([0])].stack(0)
Out[25]: 1 randd 0.0
6 patents 0.0
7 randd 0.0
18 patents 0.0
22 patents 0.0
...
744 patents 0.0
746 randd 0.0
749 patents 0.0
randd 0.0
751 patents 0.0
Length: 280, dtype: object
In [26]:
for column in df2.columns:
if df2[column].dtype != 'object':
median = df2[column].median()
df2[column] = df2[column].fillna(median)
df.isnull().sum()

Out[26]: Unnamed: 0 0
sales 0
capital 0
patents 0
randd 0
employment 0
sp500 0
tobinq 21
value 0
institutions 0
dtype: int64
In [27]:
df2.dtypes
Out[27]: sales float64
capital float64
patents int64
randd float64
employment float64
sp500 object
tobinq float64
value float64
institutions float64
dtype: object
In [28]:
#from sklearn.preprocessing import StandardScaler
#sc = StandardScaler()
# get numeric data
#num_d = df2.select_dtypes(exclude=['object'])
# update the cols with their normalized values
#df2[num_d.columns] = sc.fit_transform(num_d)
In [29]:
df2.head()
Out[29]: sales capital patents randd employment sp500 tobinq value in
0 826.995050 161.603986 10 382.078247 2.306000 no 11.049511 1625.453755
1 407.753973 122.101012 2 0.000000 1.860000 no 0.844187 243.117082
2 8407.845588 6221.144614 138 3296.700439 49.659005 yes 5.205257 25865.233800
3 451.000010 266.899987 1 83.540161 3.071000 no 0.305221 63.024630
4 174.927981 140.124004 2 14.233637 1.947000 no 1.063300 67.406408
In [30]:
#outlier check
In [31]:
df2.isnull().sum()
Out[31]: sales 0
capital 0
patents 0
randd 0
employment 0
sp500 0

tobinq 0
value 0
institutions 0
dtype: int64
In [32]:
cols = ['sales' ,'capital', 'patents', 'randd', 'employment','tobinq','value','insti

for i in cols:
sns.boxplot(df2[i])
plt.show()


In [33]:
cont=df2.dtypes[(df2.dtypes!='uint8') & (df2.dtypes!='object')].index
In [34]:
def remove_outlier(col):
sorted(col)
Q1,Q3=np.percentile(col,[25,75])
IQR=Q3-Q1
lower_range= Q1-(1.5 * IQR)
upper_range= Q3+(1.5 * IQR)
return lower_range, upper_range
In [35]:
for column in df2[cont].columns:
lr,ur=remove_outlier(df2[column])
df2[column]=np.where(df2[column]>ur,ur,df2[column])
df2[column]=np.where(df2[column]<lr,lr,df2[column])
In [36]:
cols = ['sales' ,'capital', 'patents', 'randd', 'employment','tobinq','value','insti

for i in cols:
sns.boxplot(df2[i])
plt.show()



In [37]:
df2_cor = df2.corr()
plt.figure(figsize=(8,6))
sns.heatmap(df2_cor, annot=True, fmt = '.2f', cmap='coolwarm')
Out[37]: <AxesSubplot:>

1.3 Encode the data (having string values) for Modelling. Data Split: Split the data into test
and
train (70:30). Apply Linear regression. Performance Metrics: Check the performance
of
Predictions on Train and Test sets using Rsquare, RMSE. (8 marks)
In [38]:
### Converting categorical to dummy variables in data
data = pd.get_dummies(df2, columns=['sp500'])
data.head()
Out[38]: sales capital patents randd employment tobinq value institutions
0 826.995050 161.603986 10.00 351.191114 2.306000 6.153448 1625.453755 80.27
1 407.753973 122.101012 2.00 0.000000 1.860000 0.844187 243.117082 59.02
2 4371.988416 2610.499299 27.25 351.191114 23.733752 5.205257 4980.010044 47.70
3 451.000010 266.899987 1.00 83.540161 3.071000 0.305221 63.024630 26.88
4 174.927981 140.124004 2.00 14.233637 1.947000 1.063300 67.406408 49.46
In [39]:
data.columns
Out[39]: Index(['sales', 'capital', 'patents', 'randd', 'employment', 'tobinq', 'value',
'institutions', 'sp500_no', 'sp500_yes'],
dtype='object')
In [40]:
#### Train/ Test split - Unrequried column already drop, changing name for self comf
data_model = data
data_model.columns
Out[40]: Index(['sales', 'capital', 'patents', 'randd', 'employment', 'tobinq', 'value',
'institutions', 'sp500_no', 'sp500_yes'],
dtype='object')

In [41]: data_model.head()
Out[41]: sales capital patents randd employment tobinq value institutions
0 826.995050 161.603986 10.00 351.191114 2.306000 6.153448 1625.453755 80.27
1 407.753973 122.101012 2.00 0.000000 1.860000 0.844187 243.117082 59.02
2 4371.988416 2610.499299 27.25 351.191114 23.733752 5.205257 4980.010044 47.70
3 451.000010 266.899987 1.00 83.540161 3.071000 0.305221 63.024630 26.88
4 174.927981 140.124004 2.00 14.233637 1.947000 1.063300 67.406408 49.46
In [42]:
data_model.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 759 entries, 0 to 758
Data columns (total 10 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 sales 759 non-null float64
1 capital 759 non-null float64
2 patents 759 non-null float64
3 randd 759 non-null float64
4 employment 759 non-null float64
5 tobinq 759 non-null float64
6 value 759 non-null float64
7 institutions 759 non-null float64
8 sp500_no 759 non-null uint8
9 sp500_yes 759 non-null uint8
dtypes: float64(8), uint8(2)
memory usage: 49.0 KB
In [43]:
# Copy all the predictor variables into X dataframe
X = data_model.drop('sales', axis=1)
# Copy target into the y dataframe.
y = data_model[['sales']]
In [44]:
X.head()
Out[44]: capital patents randd employment tobinq value institutions sp500_no sp
0 161.603986 10.00 351.191114 2.306000 6.153448 1625.453755 80.27 1
1 122.101012 2.00 0.000000 1.860000 0.844187 243.117082 59.02 1
2 2610.499299 27.25 351.191114 23.733752 5.205257 4980.010044 47.70 0
3 266.899987 1.00 83.540161 3.071000 0.305221 63.024630 26.88 1
4 140.124004 2.00 14.233637 1.947000 1.063300 67.406408 49.46 1
In [45]:
X.shape
Out[45]: (759, 9)
In [46]:
y.head()
Out[46]: sales
0 826.995050
1 407.753973
2 4371.988416
3 451.000010
4 174.927981
In [47]:
y.shape
Out[47]: (759, 1)
In [48]:
# Split X and y into training and test set in 70:30 ratio
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3 , random_sta
In [49]:
## Linear Regression Model
# invoke the LinearRegression function and find the bestfit model on training data
regression_model = LinearRegression()
regression_model.fit(X_train,y_train)
Out[49]: LinearRegression()
In [50]:
# Let us explore the coefficients for each of the independent attributes
for idx, col_name in enumerate(X_train.columns):
print("The coefficient for {} is {}".format(col_name, regression_model.coef_[0][
The coefficient for capital is 0.4061544084041557
The coefficient for patents is -4.6473268486727575
The coefficient for randd is 0.6398846045072621
The coefficient for employment is 78.61372479076532
The coefficient for tobinq is -39.92578934013681
The coefficient for value is 0.24462524514528727
The coefficient for institutions is 0.21743855519970573
The coefficient for sp500_no is -83.06604336852845
The coefficient for sp500_yes is 83.06604336852763
In [51]:
# Let us check the intercept for the model
intercept = regression_model.intercept_[0]
print("The intercept for our model is {}".format(intercept))
The intercept for our model is 155.8971701239957
In [52]:
# R square on training data
regression_model.score(X_train, y_train)
Out[52]: 0.9358806629736066
In [53]:
# R square on testing data

regression_model.score(X_test, y_test)
Out[53]: 0.924129439335239
In [54]:
#RMSE on Training data
predicted_train=regression_model.fit(X_train, y_train).predict(X_train)
np.sqrt(metrics.mean_squared_error(y_train,predicted_train))
Out[54]: 394.6129494572075
In [71]:
#RMSE on Testing data
predicted_test=regression_model.fit(X_train, y_train).predict(X_test)
np.sqrt(metrics.mean_squared_error(y_test,predicted_test))
Out[71]: 399.74321332112794
In [55]:
from statsmodels.stats.outliers_influence import variance_inflation_factor
vif = [variance_inflation_factor(X.values, ix) for ix in range(X.shape[1])]
In [56]:
i=0
for column in X.columns:
if i < 11:
print (column ,"--->", vif[i])
i = i+1
capital ---> 5.884834435358601
patents ---> 2.5564811032960173
randd ---> 2.9241166081719343
employment ---> 5.289087439090918
tobinq ---> 1.4736588698814541
value ---> 6.0730692748610045
institutions ---> 1.2923225457814675
sp500_no ---> 5.627713456806028
sp500_yes ---> 7.007866608862636
In [57]:
### Using Statsmodel library
data_train = pd.concat([X_train, y_train], axis=1)
data_train.head()
Out[57]: capital patents randd employment tobinq value institutions sp500_no s
626 1315.696256 15.0 73.275818 16.472000 1.657513 2231.870118 31.47 1
333 15.258002 2.0 9.252643 0.566000 0.381755 9.877838 21.69 1
257 538.188036 20.0 87.388641 6.627000 2.126738 1019.443780 69.64 1
173 807.215091 0.0 68.900185 7.607001 3.151469 2221.768944 69.69 0
242 402.508010 2.0 0.000000 1.550000 2.154388 358.040202 85.42 1
In [58]:
data_train.columns
Out[58]: Index(['capital', 'patents', 'randd', 'employment', 'tobinq', 'value',
'institutions', 'sp500_no', 'sp500_yes', 'sales'],
dtype='object')

In [59]: import statsmodels.formula.api as smf
lm1 = smf.ols(formula= 'sales ~ capital + patents + randd + employment+ tobinq + val

lm1.params
Out[59]: Intercept 103.931447
capital 0.406154
patents -4.647327
randd 0.639885
employment 78.613725
tobinq -39.925789
value 0.244625
sp500_no -31.100320
sp500_yes 135.031767
dtype: float64
In [60]:
print(lm1.summary()) #Inferential statistics
OLS Regression Results
==============================================================================
Dep. Variable: sales R-squared: 0.936
Model: OLS Adj. R-squared: 0.935
Method: Least Squares F-statistic: 952.4
Date: Fri, 21 Jan 2022 Prob (F-statistic): 1.05e-305
Time: 10:53:57 Log-Likelihood: -3927.7
No. Observations: 531 AIC: 7873.
Df Residuals: 522 BIC: 7912.
Df Model: 8
Covariance Type: nonrobust
================================================================================
coef std err t P>|t| [0.025 0.975]
--------------------------------------------------------------------------------
Intercept 103.9314 42.150 2.466 0.014 21.128 186.735
capital 0.4062 0.042 9.651 0.000 0.323 0.489
patents -4.6473 2.789 -1.666 0.096 -10.127 0.833
randd 0.6399 0.232 2.753 0.006 0.183 1.096
employment 78.6137 4.765 16.498 0.000 69.252 87.975
tobinq -39.9258 12.145 -3.288 0.001 -63.784 -16.067
value 0.2446 0.026 9.592 0.000 0.195 0.295
institutions 0.2174 0.902 0.241 0.810 -1.555 1.990
sp500_no -31.1003 25.504 -1.219 0.223 -81.203 19.003
sp500_yes 135.0318 49.490 2.728 0.007 37.808 232.256
==============================================================================
Omnibus: 185.527 Durbin-Watson: 1.966
Prob(Omnibus): 0.000 Jarque-Bera (JB): 1284.253
Skew: 1.351 Prob(JB): 1.34e-279
Kurtosis: 10.123 Cond. No. 2.47e+19
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly spe
cified.
[2] The smallest eigenvalue is 5.53e-30. This might indicate that there are
strong multicollinearity problems or that the design matrix is singular.
In [61]:
# Let us check the sum of squared errors by predicting value of y for test cases and
# subtracting from the actual y for the test cases
mse = np.mean((regression_model.predict(X_test)-y_test)**2)
In [62]:
# underroot of mean_sq_error is standard deviation i.e. avg variance between predict

import math
math.sqrt(mse)

Out[62]: 399.743213321128
In [63]:
# Model score - R2 or coeff of determinant
# R^2=1–RSS / TSS
regression_model.score(X_test, y_test)
Out[63]: 0.924129439335239
In [64]:
# predict mileage (mpg) for a set of attributes not in the training or test set
y_pred = regression_model.predict(X_test)
In [65]:
plt.scatter(y_test['sales'], y_pred)
Out[65]: <matplotlib.collections.PathCollection at 0x2182b2904f0>
In [66]:
### ITERATION 2
import statsmodels.formula.api as smf
lm2 = smf.ols(formula= 'sales ~ capital + patents + randd + employment+ tobinq + val

lm2.params
Out[66]: Intercept 103.931447
capital 0.406154
patents -4.647327
randd 0.639885
employment 78.613725
tobinq -39.925789
value 0.244625
sp500_no -31.100320
sp500_yes 135.031767
dtype: float64
In [67]:
print(lm2.summary()) #Inferential statistics
OLS Regression Results
==============================================================================
Dep. Variable: sales R-squared: 0.936
Model: OLS Adj. R-squared: 0.935
Method: Least Squares F-statistic: 952.4
Date: Fri, 21 Jan 2022 Prob (F-statistic): 1.05e-305
Time: 10:53:58 Log-Likelihood: -3927.7

No. Observations: 531 AIC: 7873.
Df Residuals: 522 BIC: 7912.
Df Model: 8
Covariance Type: nonrobust
================================================================================
coef std err t P>|t| [0.025 0.975]
--------------------------------------------------------------------------------
Intercept 103.9314 42.150 2.466 0.014 21.128 186.735
capital 0.4062 0.042 9.651 0.000 0.323 0.489
patents -4.6473 2.789 -1.666 0.096 -10.127 0.833
randd 0.6399 0.232 2.753 0.006 0.183 1.096
employment 78.6137 4.765 16.498 0.000 69.252 87.975
tobinq -39.9258 12.145 -3.288 0.001 -63.784 -16.067
value 0.2446 0.026 9.592 0.000 0.195 0.295
institutions 0.2174 0.902 0.241 0.810 -1.555 1.990
sp500_no -31.1003 25.504 -1.219 0.223 -81.203 19.003
sp500_yes 135.0318 49.490 2.728 0.007 37.808 232.256
==============================================================================
Omnibus: 185.527 Durbin-Watson: 1.966
Prob(Omnibus): 0.000 Jarque-Bera (JB): 1284.253
Skew: 1.351 Prob(JB): 1.34e-279
Kurtosis: 10.123 Cond. No. 2.47e+19
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly spe
cified.
[2] The smallest eigenvalue is 5.53e-30. This might indicate that there are
strong multicollinearity problems or that the design matrix is singular.
In [68]:
# concatenate X and y into a single dataframe
data_train = pd.concat([X_train, y_train], axis=1)
data_test=pd.concat([X_test,y_test],axis=1)
data_train.head()
626 1315.696256 15.0 73.275818 16.472000 1.657513 2231.870118 31.47 1
333 15.258002 2.0 9.252643 0.566000 0.381755 9.877838 21.69 1
257 538.188036 20.0 87.388641 6.627000 2.126738 1019.443780 69.64 1
173 807.215091 0.0 68.900185 7.607001 3.151469 2221.768944 69.69 0
242 402.508010 2.0 0.000000 1.550000 2.154388 358.040202 85.42 1
In [69]:
data_test.head()
480 50.688001 1.0 47.173386 1.147000 1.006168 34.516077 34.92 1
622 80.960002 3.0 50.251263 3.400000 1.259892 164.840772 18.88 1
638 1119.000008 19.0 78.623947 18.988003 1.900413 2114.826950 47.94 0
389 68.742010 3.0 44.827785 1.204000 2.262480 82.287341 24.65 1
748 308.770949 2.0 79.026939 3.264000 1.741800 533.056000 16.05 1
In [70]:
for i,j in np.array(lm2.params.reset_index()):

print('({}) * {} +'.format(round(j,2),i),end=' ')
(103.93) * Intercept + (0.41) * capital + (-4.65) * patents + (0.64) * randd + (78.6

1) * employment + (-39.93) * tobinq + (0.24) * value + (0.22) * institutions + (-31.
1) * sp500_no + (135.03) * sp500_yes +
1.4 Inference: Based on these predictions, what are the business insights and
recommendations.
(6 marks)
In [ ]:

In [ ]:


Predictive Modelling Alternative Firm Level PDF

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Predictive Modelling Alternative Firm Level PDF

Uploaded by

Copyright:

Available Formats

1/21/22, 10:10 PM Predictive Modelling - Secondary

import matplotlib.pyplot as plt

import seaborn as sns

from sklearn.model_selection import train_test_split

from sklearn.linear_model import LinearRegression

from sklearn import metrics

import matplotlib.pyplot as plt

import scipy.stats as stats

from warnings import filterwarnings

0 0 826.995050 161.603986 10 382.078247 2.306000 no 11.049511 162

1 1 407.753973 122.101012 2 0.000000 1.860000 no 0.844187 24

2 2 8407.845588 6221.144614 138 3296.700439 49.659005 yes 5.205257 2586

3 3 451.000010 266.899987 1 83.540161 3.071000 no 0.305221 6

4 4 174.927981 140.124004 2 14.233637 1.947000 no 1.063300 6

754 754 1253.900196 708.299935 32 412.936157 22.100002 yes 0.697454 267.11

755 755 171.821025 73.666008 1 0.037735 1.684000 no NaN 228.47

756 756 202.726967 123.926991 13 74.861099 1.460000 no 5.229723 580.43

757 757 785.687944 138.780992 6 0.621750 2.900000 yes 1.625398 309.93

758 758 22.701999 14.244999 5 18.574360 0.197000 no 2.213070 18.94

Out[5]: (759, 10)

localhost:8888/nbconvert/html/Predictive Modelling - Secondary.ipynb?download=false 1/26

RangeIndex: 759 entries, 0 to 758

Data columns (total 10 columns):

# Column Non-Null Count Dtype

--- ------ -------------- -----

0 Unnamed: 0 759 non-null int64

1 sales 759 non-null float64

2 capital 759 non-null float64

3 patents 759 non-null int64

4 randd 759 non-null float64

5 employment 759 non-null float64

6 sp500 759 non-null object

7 tobinq 738 non-null float64

8 value 759 non-null float64

9 institutions 759 non-null float64

dtypes: float64(7), int64(2), object(1)

memory usage: 59.4+ KB

sp500 759 2 no 542 NaN NaN NaN NaN NaN

100 * df.isnull().sum() / len(df)

Out[9]: Unnamed: 0 0.000000

localhost:8888/nbconvert/html/Predictive Modelling - Secondary.ipynb?download=false 2/26

Out[11]: sales capital patents randd employment sp500 tobinq value in

0 826.995050 161.603986 10 382.078247 2.306000 no 11.049511 1625.453755

1 407.753973 122.101012 2 0.000000 1.860000 no 0.844187 243.117082

2 8407.845588 6221.144614 138 3296.700439 49.659005 yes 5.205257 25865.233800

3 451.000010 266.899987 1 83.540161 3.071000 no 0.305221 63.024630

4 174.927981 140.124004 2 14.233637 1.947000 no 1.063300 67.406408

print('Number of duplicate rows = %d' % (dups.sum()))

Number of duplicate rows = 0

for column in df2.columns:

Name: sp500, dtype: int64

sales capital patents randd employment sp500 \

0 826.995050 161.603986 10 382.078247 2.306000 no

1 407.753973 122.101012 2 0.000000 1.860000 no

2 8407.845588 6221.144614 138 3296.700439 49.659005 yes

3 451.000010 266.899987 1 83.540161 3.071000 no

4 174.927981 140.124004 2 14.233637 1.947000 no

.. ... ... ... ... ... ...

754 1253.900196 708.299935 32 412.936157 22.100002 yes

localhost:8888/nbconvert/html/Predictive Modelling - Secondary.ipynb?download=false 3/26

755 171.821025 73.666008 1 0.037735 1.684000 no

756 202.726967 123.926991 13 74.861099 1.460000 no

757 785.687944 138.780992 6 0.621750 2.900000 yes