You are on page 1of 10
si772019 In [1]: InsurancePredictonLink 1 Insurance Price Prediction Case study using Linear Regression By Abhi Sheth Linear regression is a linear model, e.g. a model that assumes a linear relationship between the input variables (x) and the single output variable (y) . More specifically, that output variable (y) can be calculated from a linear combination of the input variables (x) Import Libraries import numpy as np import pandas as pd import matplotlib.pyplot as plt import mp1_toolkits import seaborn as sns Xmatplotlib inline * pandas (https:// ina table + oumpy (http:l/www.numpy.org!) - library that we will use for linear algebra operations library that we will use for plotting the data andas. pydata.org/) - library that we will use for loading and displaying the data Data > The data is related with a hospital treatment of patients. The classification goal is to find the prediction value of treatment charges of the patients. ~-> This dataset provides the patients information. It includes 1338 records and 7 fields. localhost 8888/natebooks/Python/CaseStudy/CS VInsurancePredicionLinRipynb#By-Abhi-Shoth ant si772019 In [2]: In [3]: out [3]: InsurancePredictonLink data = pd.read_csv( ‘insurance. csv") print(data. shape) print(1ist(data.columns)) (1338, 7) [‘age', ‘sex’, ‘bmi’, ‘children’, ‘smoker’, ‘region’, Here., We can see our first 5 data data.head() age sex bmi children smoker region _charges. 0 19 female 27.900 © yes southwest 16884,2400 118 male 33.770 1 no southeast 1725.66230 2 28 male 33.000 no southeast 4449,46200 3 33° male 22.705 no northwest 21984.47061 4 32 male 28.880 no northwest 3866,86520 Input Variables 1.Age ‘- Customer's age in completed years. 2.Sex :- Patient is male or female 3.bmi:- Body mass index of patients. 4.children :- Does patient have any child or not? 5.smoker :- Does patient smoke? 6.region :- residential area of patient. Predict variable (desired target): 7.charges :- total treatment charges localhost 8888/natebooks/Python/CaseStudy/CS WInsurancePredicionLinRipynb#By-Abhi-Shoth "charges" si772019 InsurancePredictonLink In [4]: data.info() Rangelndex: 1338 entries, @ to 1337 Data columns (total 7 columns): age 1338 non-null int64 sex 1338 non-null object bmi 1338 non-null floatea children 1338 non-null inte4 smoker 1338 non-null object region 1338 non-null object charges 1338 non-null floatea dtypes: float64(2), int64(2), object(3) memory usage: 73.2+ KB In [5]: data.isnul1()-sum() out[5]: age sex bmi children smoker region charges dtype: intea Here we can see that our data is not null and well organized The region column has the following categories: In [6]: print(data[ ‘region’ ].unique()) ['southwest’ ‘southeast’ ‘northwest’ ‘northeast'] The children column has the following categories: In [7]: print (datal ‘children’ ].unique()) [913254] Data Exploration localhost 8888/natebooks/Python/CaseStudy/CS VInsurancePredicionLinRipynb#By-Abhi-Shoth si772019 InsurancePredictonLink In [8]: data ‘region’ ].value_counts() .plot(kind="bar") plt.title("patients by region") plt.xlabel("region") plt.ylabel("patients") sns.despine out[8]: patients by region so x0 20 2 £20 a 150 0 % ° é é é é a & a a main The bar chart indicates that the highest patients are from southeast region. In [9]: data.charges.hist() out[9]: 500 00 300 200 100 © 10000 20000 30000 40000 50000 «0000 Here is a histogram of treatment charges. --> Now there are some scatter graphs comparing each attribute with predicted column. localhost 8888/natebooks/Python/CaseStudy/CS SInsurancePredictonLinR ipynbsBy-Abhi-Sheth ant si772019 InsurancePredictonLink In [10]: plt.scatter(data.age.head(20), data.charges.head(20)) plt.title(“dataprice vs floors") out[10]: Text(@.5,1, ‘dataprice vs floors’) dataprice vs floors 40000 ° 35000 30000 25000 20000 35000 10000 OD 5000 2 7 ry 30 © In [11]: plt.scatter(data.sex.head(20), data. charges.head(20),marker="*") plt.title("price vs latitude") Out[11]: Text(@.5,1, ‘price vs Lattitude’) price vs lattitude 40000 2 35000 30000 25000 20000 35000 10000 ‘female male si772019 In [12]: out(12]: In [13]: out [13]: InsurancePredictonLink pit. scatter (data. bmi.head(20) data. charges -head(20)) plt.title(“dataprice vs floors") Text(@.5,1, ‘dataprice vs floors’) dataprice vs floors 40000 . 35000 30000 25000 20000 35000 10000 QO ° 5000 oo a 2S 80 75 00 25 80 35 40 as plt.scatter(data.children.head(20), data. charges.head(20)) plt.title("dataprice vs floors") Text(0.5,1, ‘dataprice vs floors’) dataprice vs floors 40000 {e . 35000 30000 s 25000 . 20000 35000 10000 | ® i. . . 5000 ‘ . oo 05 10 15 20. 25 30 ent si772019 In [14]: out (14): In [15]: out[15]: InsurancePredictonLink pit. scatter (data. snoker.head(20), data. charges. head(20)) plt.title(“dataprice vs floors") Text (0.51, ‘dataprice vs floors’) dataprice vs floors 40000 ° . 35000 30000 i . 25000 . 20000 35000 10000 i smo) ° to es plt.scatter(data.region.head(20), data.charges.head(20)) plt.title("dataprice vs floors") Text(0.5,1, ‘dataprice vs floors’) dataprice vs floors 40000 . . 35000 30000 0 . 25000 . 20000 35000 10000 | * . . . . . 5000 . ° . ° rortheast northwest southeast southwest a si772019 InsurancePredictonLink In [16]: pgr=data[' smoker] replace({'no "1"}) data[ ‘smoker’ data.head() out(16]: sex bmi children smoker region _charges. © 19 female 27.900 ° 4 southwest 16884.92400 118 male 33.770 1 © southeast 1725.55230 2 28 male 33.000 0 southeast 4449.48200 333 male 22.705 northwest 21984.47061 4°32 male 28.880 0 northwest 3866.85520 In [17]: abc=data[ ‘region’ ] .replace({‘northeast': datal ‘region’ J=abc data.head() 1', ‘northwest’ :'2", ‘southeast’: out[27]: ‘age sex bmi children smoker region _—_charges © 19 female 27.900 0 1 4 16884,92400 1°18 male 33.770 1 ° 3 1725.55230 2 28 male 33.000 3 ° 3 4449.46200 333 male 22.705 ° ° 2 21984,47061 432 male 28.880 ° o 2 3866.85520 In [18]: | xyzedata['sex'].replace({'male':'@", female’ :'1'}) data[ ‘sex’ data.head() out[18]: age sex bmi children smoker region charges 0 19 1 27.900 0 1 4 16884,92400 1 18 0 33770 1 o 3 1725.55230 2 28 0 33,000 3 ° 3 4449,46200 333 0 22.705 o o 2 21984,47061 4 32 0 28.880 ° o 2 3866.85520 localhost 8888/natebooks/Python/CaseStudy/CS WInsurancePredicionLinRipynb#By-Abhi-Shoth at si772019 In [19]: out[19]: In [20]: out[20]: In [21]: In [22]: In [23]: In [24]: out(24]: InsurancePredictonLink train=data.drop([ ‘charges’ ], axis train.head() age sex bmi children smoker region 0 19 1 27.900 0 1 4 1 18 0 33.770 1 o 8 2 28 0 33.000 3 o 3 333 0 22,705 ° 0 2 4 32 0 28.880 ° 0 2 labels=datal ' charges" ] labels.head() @ — 16884.92400 1 1725.55238 2 -4449.46200 3 21984.47061 4 3866.85520 Name: charges, dtype: floated --> Import sklearn library to use linear regression model, from sklearn.linear_model import LinearRegression from sklearn.cross_validation import train_test_split C:\Users \Abhi \Anaconda\1ib\site-packages\sklearn\cross_validation.py:41: Deprec ationWarning: This module was deprecated in version @.18 in favor of the model_ selection module into which all the refactored classes and functions are moved. Also note that the interface of the new CV iterators are different from that of this module. This module will be removed in 0.20. "This module will be removed in @.20.", Deprecationwarning) inearRegression() Split the Data Into Training and Test Sets In this step we will split our dataset into training and testing subsets (in proportion 80/20%). Training data set will be used for training of our linear model. Testing dataset will be used for validating of the model. All data from testing dataset will be new to model and we may check how accurate are model predictions. xtrain , xtest , y train , y test = train_test_split(train , labels , test_size reg.fit(x_train,y_train) LinearRegression(copy_. rue, fit_intercep' rue, n_jobs=1, nornaliz localhost 8888/natebooks/Python/CaseStudy/CS WInsurancePredicionLinRipynb#By-Abhi-Shoth ont si772019 In [25]: In [26]: In [27]: In [28]: In [29]: out[29]: Tn [30]: out[3@]: In [31]: out [32]: InsurancePredictonLink print(x_train. shape) print(y_train. shape) print(x_test. shape) print(y_test. shape) (1e7e, 6) (1078, ) (268, 6) (268,) d=pd.read_csv("a_test.csv') di=pd.DataFrane(d) y_pred = reg.predict(d1) print (y_pred) [ 4674. 26886506 37382.69130141] So.,these are treatment charges prediction of user input data. Accuracy using score() and gradient boosting print('Accuracy of linear regression classifier on test set: ',(reg.score(x_test, print(‘Accuracy of linear regresssion classifier on train set:',(reg.score(x_trai Accuracy of linear regression classifier on test set: @.7445422986536503 Accuracy of linear regression classifier on train set: @.7519923667088932 from sklearn.ensemble import GradientBoostingRegressor clf= GradientBoostingRegressor(n_estimators=400,max_depth=5,min_samples_split=2,1 clf. Fit (x_train,y_train) GradientBoostingRegressor(alpha-@.9, criterion='friedman_mse', init-None, Jearning_rate=0.2, loss="Is', max_depth=5, max_features-None, max_leaf_nodes=None, min_impurity decreas min_impurity_split-None, min_samples_leaf=1, min_samples_split=2, min_weight_fraction_leat n_estimators=400, presort='auto’, random_state=None, subsample=1.8, verbose=, warm_start=False) cl#.score(x_test,y_test) @.7971318757647123 clf.score(x_train,y_train) @.9998251935700382 localhost 8888/natebooks/Python/CaseStudy/CS VInsurancePredicionLinRipynb#By-Abhi-Shoth sont

You might also like