You are on page 1of 35

 

Outlier detection - Tukey IQR


https://towardsdatascience.com/local-outlier-factor-for-anomaly-detection-cc0c770d2ebe

Identifies extreme values in data

Outliers are defined as:


Values below Q1-1.5(Q3-Q1) or above Q3+1.5(Q3-Q1)

Standard deviation from the mean is another common method to detect extreme values
But it can be problematic:

Assumes normality
Sensitive to very extreme values

from IPython.display import Image


Image(filename='Images/tukeyiqr.jpg')

# box and whisker plot (Univariate Plots)


# with this we can determine outliers in dataset
X = pima_df.loc[:,'Pregnancies':'Age']
X.plot(kind='box',subplots=True,layout=(3,3),figsize=(15,10))
plt.show()
Why 1.5 times the width of the box for the outliers? Why does that particular value demark the
difference between "acceptable" and "unacceptable" values?
Because, when John Tukey was inventing the box-and-whisker plot in 1977 to display these values, he picked
1.5×IQR as the demarkation line for outliers. This has worked well, so we've continued using that value ever
since. If you go further into statistics, you'll find that this measure of reasonableness, for bell-curve-shaped
data, means that usually only maybe as much as about one percent of the data will ever be outliers.

Removing Outliers Using Standard Deviation


Our approach was to remove the outlier points by eliminating any points that were above (Mean + 2.75SD)
and any points below (Mean - 2.75SD) before plotting the frequencies.

X = pima_df.loc[:,'Pregnancies':'Age']
outlier_df = X['Age'][((X['Age']-X['Age'].mean()).abs() > 2.75*X['Age'].std())]
print(outlier_df)
out_indices = outlier_df.index
print(out_indices)

import numpy

123   69
221   66
363   67
453   72
459   81
489   67
495   66
537   67
552   66
666   70
674   68
684   69
759   66
Name: Age, dtype: int64
Int64Index([123, 221, 363, 453, 459, 489, 495, 537, 552, 666, 674, 684, 759],
dtype='int64')

X.loc[out_indices] = np.nan
print(X)
mid = X['Age'].median()
X.loc[out_indices] = mid
print(X)

    Pregnancies Glucose BloodPressure SkinThickness     Insulin \


0           6.0   148.0     72.000000       35.00000 155.548223  
1           1.0     85.0     66.000000       29.00000 155.548223  
2           8.0   183.0     64.000000       29.15342 155.548223  
3           1.0     89.0     66.000000       23.00000   94.000000  
4           0.0   137.0     40.000000       35.00000 168.000000  
5           5.0   116.0     74.000000       29.15342 155.548223  
6           3.0     78.0     50.000000       32.00000   88.000000  
7           10.0   115.0     72.405184       29.15342 155.548223  
8           2.0   197.0     70.000000       45.00000 543.000000  
9           8.0   125.0     96.000000       29.15342 155.548223  
10           4.0   110.0     92.000000       29.15342 155.548223  
11         10.0   168.0     74.000000       29.15342 155.548223  
12         10.0   139.0     80.000000       29.15342 155.548223  
13           1.0   189.0     60.000000       23.00000 846.000000  
14           5.0   166.0     72.000000       19.00000 175.000000  
15           7.0   100.0     72.405184       29.15342 155.548223  
16           0.0   118.0     84.000000       47.00000 230.000000  
17           7.0   107.0     74.000000       29.15342 155.548223  
18           1.0   103.0     30.000000       38.00000   83.000000  
19           1.0   115.0     70.000000       30.00000   96.000000  
20           3.0   126.0     88.000000       41.00000 235.000000  
21           8.0     99.0     84.000000       29.15342 155.548223  
22           7.0   196.0     90.000000       29.15342 155.548223  
23           9.0   119.0     80.000000       35.00000 155.548223  
24         11.0   143.0     94.000000       33.00000 146.000000  
25         10.0   125.0     70.000000       26.00000 115.000000  
26           7.0   147.0     76.000000       29.15342 155.548223  
27           1.0     97.0     66.000000       15.00000 140.000000  
28         13.0   145.0     82.000000       19.00000 110.000000  
29           5.0   117.0     92.000000       29.15342 155.548223  
..           ...     ...           ...           ...         ...  
738         2.0     99.0     60.000000       17.00000 160.000000  
739         1.0   102.0     74.000000       29.15342 155.548223  
740         11.0   120.0     80.000000       37.00000 150.000000  
741         3.0   102.0     44.000000       20.00000   94.000000  
742         1.0   109.0     58.000000       18.00000 116.000000  
743         9.0   140.0     94.000000       29.15342 155.548223  
744         13.0   153.0     88.000000       37.00000 140.000000  
745         12.0   100.0     84.000000       33.00000 105.000000  
746         1.0   147.0     94.000000       41.00000 155.548223  
747         1.0     81.0     74.000000       41.00000   57.000000  
748         3.0   187.0     70.000000       22.00000 200.000000  
749         6.0   162.0     62.000000       29.15342 155.548223  
750         4.0   136.0     70.000000       29.15342 155.548223  
751         1.0   121.0     78.000000       39.00000   74.000000  
752         3.0   108.0     62.000000       24.00000 155.548223  
753         0.0   181.0     88.000000       44.00000 510.000000  
754         8.0   154.0     78.000000       32.00000 155.548223  
755         1.0   128.0     88.000000       39.00000 110.000000  
756         7.0   137.0     90.000000       41.00000 155.548223  
757         0.0   123.0     72.000000       29.15342 155.548223  
758         1.0   106.0     76.000000       29.15342 155.548223  
759         NaN     NaN           NaN           NaN         NaN  
760         2.0     88.0     58.000000       26.00000   16.000000  
761         9.0   170.0     74.000000       31.00000 155.548223  
762         9.0     89.0     62.000000       29.15342 155.548223  
763         10.0   101.0     76.000000       48.00000 180.000000  
764         2.0   122.0     70.000000       27.00000 155.548223  
765         5.0   121.0     72.000000       23.00000 112.000000  
766         1.0   126.0     60.000000       29.15342 155.548223  
767         1.0     93.0     70.000000       31.00000 155.548223  

          BMI DiabetesPedigreeFunction   Age  


0   33.600000                     0.627 50.0  
1   26.600000                     0.351 31.0  
2   23.300000                     0.672 32.0  
3   28.100000                     0.167 21.0  
4   43.100000                     2.288 33.0  
5   25.600000                     0.201 30.0  
6   31.000000                     0.248 26.0  
7   35.300000                     0.134 29.0  
8   30.500000                     0.158 53.0  
9   32.457464                     0.232 54.0  
10   37.600000                     0.191 30.0  
11   38.000000                     0.537 34.0  
12   27.100000                     1.441 57.0  
13   30.100000                     0.398 59.0  
14   25.800000                     0.587 51.0  
15   30.000000                     0.484 32.0  
16   45.800000                     0.551 31.0  
17   29.600000                     0.254 31.0  
18   43.300000                     0.183 33.0  
19   34.600000                     0.529 32.0  
20   39.300000                     0.704 27.0  
21   35.400000                     0.388 50.0  
22   39.800000                     0.451 41.0  
23   29.000000                     0.263 29.0  
24   36.600000                     0.254 51.0  
25   31.100000                     0.205 41.0  
26   39.400000                     0.257 43.0  
27   23.200000                     0.487 22.0  
28   22.200000                     0.245 57.0  
29   34.100000                     0.337 38.0  
..         ...                       ...   ...  
738 36.600000                     0.453 21.0  
739 39.500000                     0.293 42.0  
740 42.300000                     0.785 48.0  
741 30.800000                     0.400 26.0  
742 28.500000                     0.219 22.0  
743 32.700000                     0.734 45.0  
744 40.600000                     1.174 39.0  
745 30.000000                     0.488 46.0  
746 49.300000                     0.358 27.0  
747 46.300000                     1.096 32.0  
748 36.400000                     0.408 36.0  
749 24.300000                     0.178 50.0  
750 31.200000                     1.182 22.0  
751 39.000000                     0.261 28.0  
752 26.000000                     0.223 25.0  
753 43.300000                     0.222 26.0  
754 32.400000                     0.443 45.0  
755 36.500000                     1.057 37.0  
756 32.000000                     0.391 39.0  
757 36.300000                     0.258 52.0  
758 37.500000                     0.197 26.0  
759       NaN                       NaN   NaN  
760 28.400000                     0.766 22.0  
761 44.000000                     0.403 43.0  
762 22.500000                     0.142 33.0  
763 32.900000                     0.171 63.0  
764 36.800000                     0.340 27.0  
765 26.200000                     0.245 30.0  
766 30.100000                     0.349 47.0  
767 30.400000                     0.315 23.0  

[768 rows x 8 columns]


    Pregnancies Glucose BloodPressure SkinThickness     Insulin \
0           6.0   148.0     72.000000       35.00000 155.548223  
1           1.0     85.0     66.000000       29.00000 155.548223  
2           8.0   183.0     64.000000       29.15342 155.548223  
3           1.0     89.0     66.000000       23.00000   94.000000  
4           0.0   137.0     40.000000       35.00000 168.000000  
5           5.0   116.0     74.000000       29.15342 155.548223  
6           3.0     78.0     50.000000       32.00000   88.000000  
7           10.0   115.0     72.405184       29.15342 155.548223  
8           2.0   197.0     70.000000       45.00000 543.000000  
9           8.0   125.0     96.000000       29.15342 155.548223  
10           4.0   110.0     92.000000       29.15342 155.548223  
11         10.0   168.0     74.000000       29.15342 155.548223  
12         10.0   139.0     80.000000       29.15342 155.548223  
13           1.0   189.0     60.000000       23.00000 846.000000  
14           5.0   166.0     72.000000       19.00000 175.000000  
15           7.0   100.0     72.405184       29.15342 155.548223  
16           0.0   118.0     84.000000       47.00000 230.000000  
17           7.0   107.0     74.000000       29.15342 155.548223  
18           1.0   103.0     30.000000       38.00000   83.000000  
19           1.0   115.0     70.000000       30.00000   96.000000  
20           3.0   126.0     88.000000       41.00000 235.000000  
21           8.0     99.0     84.000000       29.15342 155.548223  
22           7.0   196.0     90.000000       29.15342 155.548223  
23           9.0   119.0     80.000000       35.00000 155.548223  
24         11.0   143.0     94.000000       33.00000 146.000000  
25         10.0   125.0     70.000000       26.00000 115.000000  
26           7.0   147.0     76.000000       29.15342 155.548223  
27           1.0     97.0     66.000000       15.00000 140.000000  
28         13.0   145.0     82.000000       19.00000 110.000000  
29           5.0   117.0     92.000000       29.15342 155.548223  
..           ...     ...           ...           ...         ...  
738         2.0     99.0     60.000000       17.00000 160.000000  
739         1.0   102.0     74.000000       29.15342 155.548223  
740         11.0   120.0     80.000000       37.00000 150.000000  
741         3.0   102.0     44.000000       20.00000   94.000000  
742         1.0   109.0     58.000000       18.00000 116.000000  
743         9.0   140.0     94.000000       29.15342 155.548223  
744         13.0   153.0     88.000000       37.00000 140.000000  
745         12.0   100.0     84.000000       33.00000 105.000000  
746         1.0   147.0     94.000000       41.00000 155.548223  
747         1.0     81.0     74.000000       41.00000   57.000000  
748         3.0   187.0     70.000000       22.00000 200.000000  
749         6.0   162.0     62.000000       29.15342 155.548223  
750         4.0   136.0     70.000000       29.15342 155.548223  
751         1.0   121.0     78.000000       39.00000   74.000000  
752         3.0   108.0     62.000000       24.00000 155.548223  
753         0.0   181.0     88.000000       44.00000 510.000000  
754         8.0   154.0     78.000000       32.00000 155.548223  
755         1.0   128.0     88.000000       39.00000 110.000000  
756         7.0   137.0     90.000000       41.00000 155.548223  
757         0.0   123.0     72.000000       29.15342 155.548223  
758         1.0   106.0     76.000000       29.15342 155.548223  
759         29.0     29.0     29.000000       29.00000   29.000000  
760         2.0     88.0     58.000000       26.00000   16.000000  
761         9.0   170.0     74.000000       31.00000 155.548223  
762         9.0     89.0     62.000000       29.15342 155.548223  
763         10.0   101.0     76.000000       48.00000 180.000000  
764         2.0   122.0     70.000000       27.00000 155.548223  
765         5.0   121.0     72.000000       23.00000 112.000000  
766         1.0   126.0     60.000000       29.15342 155.548223  
767         1.0     93.0     70.000000       31.00000 155.548223  

          BMI DiabetesPedigreeFunction   Age  


0   33.600000                     0.627 50.0  
1   26.600000                     0.351 31.0  
2   23.300000                     0.672 32.0  
3   28.100000                     0.167 21.0  
4   43.100000                     2.288 33.0  
5   25.600000                     0.201 30.0  
6   31.000000                     0.248 26.0  
7   35.300000                     0.134 29.0  
8   30.500000                     0.158 53.0  
9   32.457464                     0.232 54.0  
10   37.600000                     0.191 30.0  
11   38.000000                     0.537 34.0  
12   27.100000                     1.441 57.0  
13   30.100000                     0.398 59.0  
14   25.800000                     0.587 51.0  
15   30.000000                     0.484 32.0  
16   45.800000                     0.551 31.0  
17   29.600000                     0.254 31.0  
18   43.300000                     0.183 33.0  
19   34.600000                     0.529 32.0  
20   39.300000                     0.704 27.0  
21   35.400000                     0.388 50.0  
22   39.800000                     0.451 41.0  
23   29.000000                     0.263 29.0  
24   36.600000                     0.254 51.0  
25   31.100000                     0.205 41.0  
26   39.400000                     0.257 43.0  
27   23.200000                     0.487 22.0  
28   22.200000                     0.245 57.0  
29   34.100000                     0.337 38.0  
..         ...                       ...   ...  
738 36.600000                     0.453 21.0  
739 39.500000                     0.293 42.0  
740 42.300000                     0.785 48.0  
741 30.800000                     0.400 26.0  
742 28.500000                     0.219 22.0  
743 32.700000                     0.734 45.0  
744 40.600000                     1.174 39.0  
745 30.000000                     0.488 46.0  
746 49.300000                     0.358 27.0  
747 46.300000                     1.096 32.0  
748 36.400000                     0.408 36.0  
749 24.300000                     0.178 50.0  
750 31.200000                     1.182 22.0  
751 39.000000                     0.261 28.0  
752 26.000000                     0.223 25.0  
753 43.300000                     0.222 26.0  
754 32.400000                     0.443 45.0  
755 36.500000                     1.057 37.0  
756 32.000000                     0.391 39.0  
757 36.300000                     0.258 52.0  
758 37.500000                     0.197 26.0  
759 29.000000                   29.000 29.0  
760 28.400000                     0.766 22.0  
761 44.000000                     0.403 43.0  
762 22.500000                     0.142 33.0  
763 32.900000                     0.171 63.0  
764 36.800000                     0.340 27.0  
765 26.200000                     0.245 30.0  
766 30.100000                     0.349 47.0  
767 30.400000                     0.315 23.0  

[768 rows x 8 columns]

X.plot(kind='box',subplots=True,layout=(3,3),figsize=(15,10))
plt.show()

#Plotting all of your data: Bee swarm plots


plt.figure(figsize=(15,8))
_ = sns.swarmplot(data=pima_df)
plt.show()
 

import seaborn as sns


import matplotlib.pyplot as plt
plt.figure(figsize=(15, 6))  
_ = sns.swarmplot(x='Age',y='Glucose',hue='Outcome',data=pima_df)
plt.show()

#!pip install plotnine

reference for exploring different smooths in ggplot2


import warnings
warnings.filterwarnings("ignore")
from plotnine import *
ggplot(pima_df,aes(x='Age',y='Glucose',colour='Outcome')) +geom_point()+stat_smooth()

<ggplot: (7547461505)>

ggplot(pima_df,aes(x='Age',y='Glucose',colour = 'BloodPressure'))
+geom_point()+stat_smooth()+facet_wrap('~Outcome')
 

<ggplot: (294288713)>

ggplot(pima_df,aes(x='Age', y
='Pregnancies'))+geom_point(aes(color='BMI'))+facet_wrap('~Outcome')+stat_smooth()
 

<ggplot: (294281529)>

# correlation of each Point


corr = pima_df.loc[:,pima_df.columns!='Outcome'].corr()
plt.figure(figsize=(12,12))
sns.heatmap(corr,annot=True,cmap="Blues")

<matplotlib.axes._subplots.AxesSubplot at 0x1c1e1a6510>

 
We can observe that there are correlatiom between some columns
Age is highly correlated with pregnancies
Insulin is correlated with skin Glucose
skin thickness is correlated with BMI

sns.lmplot(x='Age', y = 'Pregnancies', hue = 'Outcome', data = pima_df)

<seaborn.axisgrid.FacetGrid at 0x1c1e19bbd0>

 
 

sns.lmplot(x='Insulin', y = 'Glucose', hue = 'Outcome', data = pima_df)

<seaborn.axisgrid.FacetGrid at 0x1c1e3ee9d0>

 
sns.lmplot(x='BMI', y = 'SkinThickness', hue = 'Outcome', data = pima_df)

<seaborn.axisgrid.FacetGrid at 0x10f8e4390>

#Visualise pairplot using seaborn which will give plot against each attribute to
another attribute
sns.pairplot(pima_df[['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness',
'Insulin', 'BMI', 'DiabetesPedigreeFunction', 'Age']])

<seaborn.axisgrid.PairGrid at 0x10fb49750>

 
4. Data Scaling
Many machine learning algorithms expect the scale of the input and even the output data to be equivalent.
It can help in methods that weight inputs in order to make a prediction, such as in linear regression and
logistic regression. It is practically required in methods that combine weighted inputs in complex ways such
as in artificial neural networks and deep learning.

We will discuss:

1.Normalise Data
2.Standardize Data
3.When to Normalise and Standardize
1.Normalize Data
Normalization can refer to different techniques depending on context. Here, we use normalization to refer
to rescaling an input variable to the range between 0 and 1. Normalization requires that you know the
minimum and maximum values for each attribute. This can be estimated from training data or specified
directly if you have deep knowledge of the problem domain. You can easily estimate the minimum and
maximum values for each attribute in a dataset by enumerating through the values.

Once we have estimates of the maximum and minimum allowed values for each column, we can normalize
the raw data to the range 0 and 1. The calculation to normalize a single value for a column is:
scaled value = (value - min)/(max - min)

np.set_printoptions(precision=3)
array = np.array(pima_df.values)
print("== Generating data sets ==")

print("diabetes_attr: unchanged, original attributes")


diabetes_attr = array[:,0:8]
label = array[:,8] #unchanged across preprocessing?
diabetes_df = pd.DataFrame(diabetes_attr)

== Generating data sets ==


diabetes_attr: unchanged, original attributes

print("Normalized_attributes: range of 0 to 1")


from sklearn import preprocessing as preproc
scaler = preproc.MinMaxScaler().fit(diabetes_attr)
normalized_attr = scaler.transform(diabetes_attr)
normalized_df = pd.DataFrame(normalized_attr)
print(normalized_df.describe())

Normalized_attributes: range of 0 to 1
              0           1           2           3           4           5 \
count 768.000000 768.000000 768.000000 768.000000 768.000000 768.000000  
mean     0.226180   0.501205   0.493930   0.240798   0.170130   0.291564  
std     0.198210   0.196361   0.123432   0.095554   0.102189   0.140596  
min     0.000000   0.000000   0.000000   0.000000   0.000000   0.000000  
25%     0.058824   0.359677   0.408163   0.195652   0.129207   0.190184  
50%     0.176471   0.470968   0.491863   0.240798   0.170130   0.290389  
75%     0.352941   0.620968   0.571429   0.271739   0.170130   0.376278  
max     1.000000   1.000000   1.000000   1.000000   1.000000   1.000000  

              6           7  
count 768.000000 768.000000  
mean     0.168179   0.204015  
std     0.141473   0.196004  
min     0.000000   0.000000  
25%     0.070773   0.050000  
50%     0.125747   0.133333  
75%     0.234095   0.333333  
max     1.000000   1.000000  

2. Standardize Data

Standardization is a rescaling technique that refers to centering the distribution of the data on the value 0
and the standard deviation to the value 1. Together, the mean and the standard deviation can be used to
summarize a normal distribution, also called the Gaussian distribution or bell curve. It requires that the
mean and standard deviation of the values for each column be known prior to scaling. As with normalizing
above, we can estimate these values from training data, or use domain knowledge to specify their values.

The standard deviation describes the average spread of values from the mean. It can be calculated as the
square root of the sum of the squared difference between each value and the mean and dividing by the
number of values minus 1.

Once mean and standard deviation is calculated we can easily calculate standardized value.The calculation
to standardize a single value for a column is: :
standardized value = (value - mean)/stdev

print("standardized_attr: mean of 0 and stdev of 1")


#scaler = preproc.StandardScaler().fit(diabetes_attr)
#standardized_attr = scaler.transform(diabetes_attr)
standardized_attr = preproc.scale(diabetes_attr)
standardized_df = pd.DataFrame(standardized_attr)
print(standardized_df.describe())

standardized_attr: mean of 0 and stdev of 1


0 1 2 3 4 \
count 7.680000e+02 7.680000e+02 7.680000e+02 7.680000e+02 7.680000e+02
mean 2.544261e-17 -3.301757e-16 6.966722e-16 6.866252e-16 -2.352033e-16
std 1.000652e+00 1.000652e+00 1.000652e+00 1.000652e+00 1.000652e+00
min -1.141852e+00 -2.554131e+00 -4.004245e+00 -2.521670e+00 -1.665945e+00
25% -8.448851e-01 -7.212214e-01 -6.953060e-01 -4.727737e-01 -4.007289e-01
50% -2.509521e-01 -1.540881e-01 -1.675912e-02 8.087936e-16 -3.345079e-16
75% 6.399473e-01 6.103090e-01 6.282695e-01 3.240194e-01 -3.345079e-16
max 3.906578e+00 2.541850e+00 4.102655e+00 7.950467e+00 8.126238e+00

5 6 7
count 7.680000e+02 7.680000e+02 7.680000e+02
mean 3.090699e-16 2.398978e-16 1.857600e-16
std 1.000652e+00 1.000652e+00 1.000652e+00
min -2.075119e+00 -1.189553e+00 -1.041549e+00
25% -7.215397e-01 -6.889685e-01 -7.862862e-01
50% -8.363615e-03 -3.001282e-01 -3.608474e-01
75% 6.029301e-01 4.662269e-01 6.602056e-01
max 5.042087e+00 5.883565e+00 4.063716e+00

3. When to Normalize and Standardize


Standardization is a scaling technique that assumes your data conforms to a normal distribution. If a given
data attribute is normal or close to normal, this is probably the scaling method to use. It is good practice to
record the summary statistics used in the standardization process so that you can apply them when
standardizing data in the future that you may want to use with your model. Normalization is a scaling
technique that does not assume any specic distribution.

If your data is not normally distributed, consider normalizing it prior to applying your machine learning
algorithm. It is good practice to record the minimum and maximum values for each column used in the
normalization process, again, in case you need to normalize new data in the future to be used with your
model.

Handling Imbalanced Class Data

print("=== undersampling majority class by purging ===")

# Separate majority and minority classes


df_majority = pima_df[pima_df['Outcome']==0]
df_minority = pima_df[pima_df['Outcome']==1]

=== undersampling majority class by purging ===

print("df_minority['class'].size", df_minority['Outcome'].size)
from sklearn.utils import resample
# Downsample majority class

df_majority_downsampled = resample(df_majority,
                         replace=False,    # sample without replacement
                         n_samples=df_minority['Outcome'].size,  # match minority
class
                         random_state=7) # reproducible results

# Combine minority class with downsampled majority class


df_downsampled = pd.concat([df_majority_downsampled, df_minority])

("df_minority['class'].size", 268)

print("undersampled", df_downsampled.groupby('Outcome').size())
df_downsampled=df_downsampled.sample(frac=1).reset_index(drop=True)
undersampling_attr = np.array(df_downsampled.values[:,0:8])
undersampling_label = np.array(df_downsampled.values[:,8])
('undersampled', Outcome
0 268
1 268
dtype: int64)

#!pip install imblearn

print("=== oversampling minority class with SMOTE ===")


from imblearn.over_sampling import SMOTE
sm = SMOTE(random_state=7)
x_val = pima_df.values[:,0:8]
y_val = pima_df.values[:,8]
X_res, y_res = sm.fit_sample(x_val, y_val)

features=['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI',


'DiabetesPedigreeFunction', 'Age']
oversampled_df = pd.DataFrame(X_res)
oversampled_df.columns = features
oversampled_df = oversampled_df.assign(label = np.asarray(y_res))
oversampled_df = oversampled_df.sample(frac=1).reset_index(drop=True)

oversampling_attr = oversampled_df.values[:,0:8]
oversampling_label = oversampled_df.values[:,8]
print("oversampled_df", oversampled_df.groupby('label').size())

=== oversampling minority class with SMOTE ===


('oversampled_df', label
0.0 500
1.0 500
dtype: int64)

print("== treating missing values by purging or imputating ==")


## missing.arff
print("=== Assuming, zero indicates missing values === ")
print("missing values by count")
print((pima_df[['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness', 'Insulin',
'BMI', 'DiabetesPedigreeFunction', 'Age']] == 0).sum())
print("=== purging ===")
# make a copy of original data set
dataset_cp = pima_df.copy(deep=True)

dataset_cp[['Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI']] =


dataset_cp[['Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI']].replace(0,
np.NaN)
== treating missing values by purging or imputating ==
=== Assuming, zero indicates missing values ===
missing values by count
Pregnancies 111
Glucose 0
BloodPressure 0
SkinThickness 0
Insulin 0
BMI 0
DiabetesPedigreeFunction 0
Age 0
dtype: int64
=== purging ===

# dataset with missing values


dataset_missing = dataset_cp.dropna()

# summarize the number of rows and columns in the dataset


print(dataset_cp.shape)

missing_attr = np.array(dataset_missing.values[:,0:8])
missing_label = np.array(dataset_missing.values[:,8])

print("=== imputing by replacing missing values with mean column values ===")

dataset_impute = dataset_cp.fillna(dataset_cp.mean())
# count the number of NaN values in each column
print(dataset_impute.isnull().sum())

print("== addressing class imbalance under or over sampling ==")

impute_attr = np.array(dataset_impute.values[:,0:8])

(768, 9)
=== imputing by replacing missing values with mean column values ===
Pregnancies 0
Glucose 0
BloodPressure 0
SkinThickness 0
Insulin 0
BMI 0
DiabetesPedigreeFunction 0
Age 0
Outcome 0
dtype: int64
== addressing class imbalance under or over sampling ==

Dimensionality reduction using PCA


Principal component analysis (PCA) is a technique that transforms a dataset of many features into
principal components that "summarize" the variance that underlies the data

Each principal component is calculated by finding the linear combination of features that maximizes
variance, while also ensuring zero correlation with the previously calculated principal components

Use cases for modeling:

One of the most common dimensionality reduction techniques


Use if there are too many features or if observation/feature ratio is poor
Also, potentially good option if there are a lot of highly correlated variables in your dataset

Unfortunately, PCA makes models a lot harder to interpret

PCA as dimensionality reduction


Using PCA for dimensionality reduction involves zeroing out one or more of the smallest principal
components, resulting in a lower-dimensional projection of the data that preserves the maximal data
variance.
Choosing the number of components
A vital part of using PCA in practice is the ability to estimate how many components are needed to describe
the data. This can be determined by looking at the cumulative explained variance ratio as a function of the
number of components:

# Use PCA from sklearn.decompostion to find principal components


from sklearn.decomposition import PCA
pca = PCA()
X_pca = pd.DataFrame(pca.fit_transform(pima_df))
plt.plot(np.cumsum(pca.explained_variance_ratio_))
plt.xlabel('number of components')
plt.ylabel('cumulative explained variance');

 
pca = PCA(n_components=5)
pca.fit(diabetes_attr)
diabetes_attr_pca = pca.transform(diabetes_attr)
print("original shape:   ", diabetes_attr.shape)
print("transformed shape:", diabetes_attr_pca.shape)

('original shape: ', (768, 8))


('transformed shape:', (768, 5))

pca.fit(normalized_attr)
normalized_attr_pca = pca.transform(normalized_attr)

pca.fit(standardized_attr)
standardized_attr_pca = pca.transform(standardized_attr)

pca.fit(impute_attr)
impute_attr_pca = pca.transform(impute_attr)

pca.fit(missing_attr)
missing_attr_pca = pca.transform(missing_attr)

pca.fit(undersampling_attr)
undersampling_attr_pca = pca.transform(undersampling_attr)

pca.fit(oversampling_attr)
oversampling_attr_pca = pca.transform(oversampling_attr)

Evaluate Algorithms
print(" == Evaluate Some Algorithms == ")
# Split-out validation dataset
print(" == Create a Validation Dataset: Split-out validation dataset == ")

# Test options and evaluation metric


print(" == Test Harness: Test options and evaluation metric == ")
seed = 7
scoring = 'accuracy'

== Evaluate Some Algorithms ==


== Create a Validation Dataset: Split-out validation dataset ==
== Test Harness: Test options and evaluation metric ==

# Spot Check Algorithms without feature reduction


# algo eval imports
from sklearn import model_selection
from sklearn.metrics import accuracy_score
from sklearn.linear_model import LogisticRegression
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC

# significance tests
import scipy.stats as stats
import math

print("== Build Models: build and evaluate models, Spot Check Algorithms ==")
datasets = []
datasets.append(('diabetes_attr', diabetes_attr, label))
datasets.append(('normalized_attr', normalized_attr, label))
datasets.append(('standardized_attr', standardized_attr, label))
datasets.append(('impute_attr', impute_attr, label))
datasets.append(('missing_attr', missing_attr, missing_label))
datasets.append(('undersampling_attr', undersampling_attr, undersampling_label))
datasets.append(('oversampling_attr', oversampling_attr, oversampling_label))

models = []
models.append(('LR', LogisticRegression())) # based on imbalanced datasets and default
parameters
models.append(('LDA', LinearDiscriminantAnalysis()))
models.append(('KNN', KNeighborsClassifier()))
models.append(('CART', DecisionTreeClassifier()))
models.append(('NB', GaussianNB()))
models.append(('RF', RandomForestClassifier()))
models.append(('SVM', SVC()))

print("eval metric: " + scoring)


for dataname, attributes, target in datasets:
   # evaluate each model in turn
   results = []
   names = []
   print("= " + dataname + " = ")
   print("algorithm,mean,std,signficance,p-val")
   for name, model in models:
       kfold = model_selection.KFold(n_splits=10, random_state=seed)
       cv_results = model_selection.cross_val_score(model, attributes, target,
cv=kfold, scoring=scoring)
       results.append(cv_results)
       #print("cv_results")
       #print(cv_results)
       #print(results[0])
       names.append(name)

       t, prob = stats.ttest_rel(a= cv_results,b= results[0])


       #print("LR vs ", name, t,prob)
       # Below 0.05, significant. Over 0.05, not significant.
       # http://blog.minitab.com/blog/understanding-statistics/what-can-you-say-when-
your-p-value-is-greater-than-005
       statistically_different = (prob < 0.05)

       msg = "%s: %f (%f) %s %f" % (name, cv_results.mean(), cv_results.std(),


statistically_different, prob)
       print(msg)

   # Compare Algorithms
   print(" == Select Best Model, Compare Algorithms == ")
   fig = plt.figure()
   fig.suptitle('Algorithm Comparison for ' + dataname)
   ax = fig.add_subplot(111)
   plt.boxplot(results)
   plt.ylabel(scoring)
   ax.set_xticklabels(names)
   plt.show()

== Build Models: build and evaluate models, Spot Check Algorithms ==


eval metric: accuracy
= diabetes_attr =
algorithm,mean,std,signficance,p-val
LR: 0.765636 (0.047532) False nan
LDA: 0.766951 (0.052975) False 0.820491
KNN: 0.713534 (0.064980) True 0.012597
CART: 0.687474 (0.063816) True 0.002141
NB: 0.747386 (0.043583) False 0.203854
RF: 0.744737 (0.064272) False 0.184573
SVM: 0.651025 (0.072141) True 0.000537
== Select Best Model, Compare Algorithms ==

 
= normalized_attr =
algorithm,mean,std,signficance,p-val
LR: 0.765619 (0.046566) False nan
LDA: 0.766951 (0.052975) False 0.828238
KNN: 0.748701 (0.062006) False 0.235048
CART: 0.700496 (0.048400) True 0.001043
NB: 0.747386 (0.043583) False 0.132240
RF: 0.746036 (0.058189) False 0.061152
SVM: 0.770813 (0.052488) False 0.309233
== Select Best Model, Compare Algorithms ==

= standardized_attr =
algorithm,mean,std,signficance,p-val
LR: 0.770813 (0.051248) False nan
LDA: 0.766951 (0.052975) False 0.526999
KNN: 0.738278 (0.039157) True 0.030019
CART: 0.687440 (0.063132) True 0.001104
NB: 0.747386 (0.043583) False 0.061474
RF: 0.757707 (0.060612) False 0.356660
SVM: 0.753913 (0.044789) True 0.022590
== Select Best Model, Compare Algorithms ==

 
= impute_attr =
algorithm,mean,std,signficance,p-val
LR: 0.764320 (0.048484) False nan
LDA: 0.766951 (0.052975) False 0.675096
KNN: 0.713534 (0.064980) True 0.014497
CART: 0.696617 (0.055419) True 0.000589
NB: 0.747386 (0.043583) False 0.243960
RF: 0.764234 (0.057085) False 0.996000
SVM: 0.651025 (0.072141) True 0.000669
== Select Best Model, Compare Algorithms ==

 
= missing_attr =
algorithm,mean,std,signficance,p-val
LR: 0.764337 (0.047320) False nan
LDA: 0.766951 (0.052975) False 0.640480
KNN: 0.713534 (0.064980) True 0.014492
CART: 0.691353 (0.063152) True 0.002221
NB: 0.747386 (0.043583) False 0.220344
RF: 0.734330 (0.062398) True 0.037339
SVM: 0.651025 (0.072141) True 0.000570
== Select Best Model, Compare Algorithms ==

= undersampling_attr =
algorithm,mean,std,signficance,p-val
LR: 0.749895 (0.053692) False nan
LDA: 0.751747 (0.071419) False 0.897521
KNN: 0.694165 (0.071292) True 0.009508
CART: 0.663941 (0.081059) True 0.016613
NB: 0.710936 (0.081741) False 0.058574
RF: 0.720335 (0.059445) False 0.210477
SVM: 0.458910 (0.072346) True 0.000001
== Select Best Model, Compare Algorithms ==

 
= oversampling_attr =
algorithm,mean,std,signficance,p-val
LR: 0.755000 (0.051039) False nan
LDA: 0.749000 (0.045486) False 0.111373
KNN: 0.768000 (0.027129) False 0.481468
CART: 0.762000 (0.028213) False 0.677050
NB: 0.713000 (0.046054) True 0.001323
RF: 0.814000 (0.045869) True 0.002612
SVM: 0.713000 (0.043829) False 0.090454
== Select Best Model, Compare Algorithms ==

# Spot Check Algorithms after feature reduction with pca


# algo eval imports
from sklearn import model_selection
from sklearn.metrics import accuracy_score
from sklearn.linear_model import LogisticRegression
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC

# significance tests
import scipy.stats as stats
import math

print("== Build Models: build and evaluate models, Spot Check Algorithms ==")
datasets = []
datasets.append(('diabetes_attr', diabetes_attr_pca, label))
datasets.append(('normalized_attr', normalized_attr_pca, label))
datasets.append(('standardized_attr', standardized_attr_pca, label))
datasets.append(('impute_attr', impute_attr, label))
datasets.append(('missing_attr', missing_attr_pca, missing_label))
datasets.append(('undersampling_attr', undersampling_attr_pca, undersampling_label))
datasets.append(('oversampling_attr', oversampling_attr_pca, oversampling_label))

models = []
models.append(('LR', LogisticRegression())) # based on imbalanced datasets and default
parameters
models.append(('LDA', LinearDiscriminantAnalysis()))
models.append(('KNN', KNeighborsClassifier()))
models.append(('CART', DecisionTreeClassifier()))
models.append(('NB', GaussianNB()))
models.append(('RF', RandomForestClassifier()))
models.append(('SVM', SVC()))

print("eval metric: " + scoring)


for dataname, attributes, target in datasets:
   # evaluate each model in turn
   results = []
   names = []
   print("= " + dataname + " = ")
   print("algorithm,mean,std,signficance,p-val")
   for name, model in models:
       kfold = model_selection.KFold(n_splits=10, random_state=seed)
       cv_results = model_selection.cross_val_score(model, attributes, target,
cv=kfold, scoring=scoring)
       results.append(cv_results)
       #print("cv_results")
       #print(cv_results)
       names.append(name)

       t, prob = stats.ttest_rel(a= cv_results,b= results[0])


       #print("LR vs ", name, t,prob)
       # Below 0.05, significant. Over 0.05, not significant.
       statistically_different = (prob < 0.05)
       msg = "%s: %f (%f) %s %f" % (name, cv_results.mean(), cv_results.std(),
statistically_different, prob)
       print(msg)

   # Compare Algorithms
   print(" == Select Best Model, Compare Algorithms == ")
   fig = plt.figure()
   fig.suptitle('Algorithm Comparison for ' + dataname)
   ax = fig.add_subplot(111)
   plt.boxplot(results)
   plt.ylabel(scoring)
   ax.set_xticklabels(names)
   plt.show()

== Build Models: build and evaluate models, Spot Check Algorithms ==


eval metric: accuracy
= diabetes_attr =
algorithm,mean,std,signficance,p-val
LR: 0.760492 (0.049736) False nan
LDA: 0.753947 (0.051575) False 0.297575
KNN: 0.717413 (0.066119) True 0.023485
CART: 0.670540 (0.066640) True 0.001672
NB: 0.747454 (0.048375) False 0.195421
RF: 0.713551 (0.053480) True 0.009030
SVM: 0.651025 (0.072141) True 0.000815
== Select Best Model, Compare Algorithms ==

 
= normalized_attr =
algorithm,mean,std,signficance,p-val
LR: 0.763004 (0.052644) False nan
LDA: 0.759091 (0.049164) False 0.432778
KNN: 0.720010 (0.064269) True 0.001669
CART: 0.654802 (0.061094) True 0.000207
NB: 0.742208 (0.045907) False 0.140785
RF: 0.740858 (0.063580) False 0.123441
SVM: 0.772095 (0.054777) True 0.009535
== Select Best Model, Compare Algorithms ==

= standardized_attr =
algorithm,mean,std,signficance,p-val
LR: 0.748701 (0.033960) False nan
LDA: 0.742208 (0.031646) False 0.272912
KNN: 0.718763 (0.051160) True 0.033641
CART: 0.704323 (0.043981) True 0.007545
NB: 0.721343 (0.035560) True 0.035025
RF: 0.716131 (0.047187) True 0.008683
SVM: 0.733083 (0.046566) False 0.179971
== Select Best Model, Compare Algorithms ==

 
= impute_attr =
algorithm,mean,std,signficance,p-val
LR: 0.765636 (0.047532) False nan
LDA: 0.766951 (0.052975) False 0.820491
KNN: 0.713534 (0.064980) True 0.012597
CART: 0.687389 (0.049055) True 0.000286
NB: 0.747386 (0.043583) False 0.203854
RF: 0.751299 (0.053382) False 0.169036
SVM: 0.651025 (0.072141) True 0.000537
== Select Best Model, Compare Algorithms ==

 
= missing_attr =
algorithm,mean,std,signficance,p-val
LR: 0.760492 (0.049736) False nan
LDA: 0.753947 (0.051575) False 0.297575
KNN: 0.717413 (0.066119) True 0.023485
CART: 0.663995 (0.055143) True 0.000938
NB: 0.747454 (0.048375) False 0.195421
RF: 0.731716 (0.058688) False 0.065855
SVM: 0.651025 (0.072141) True 0.000815
== Select Best Model, Compare Algorithms ==

= undersampling_attr =
algorithm,mean,std,signficance,p-val
LR: 0.716387 (0.058158) False nan
LDA: 0.716282 (0.060502) False 0.983147
KNN: 0.690426 (0.067903) False 0.208882
CART: 0.673620 (0.065603) False 0.115618
NB: 0.701398 (0.058597) False 0.206335
RF: 0.686513 (0.074474) False 0.179856
SVM: 0.451468 (0.063015) True 0.000002
== Select Best Model, Compare Algorithms ==

 
= oversampling_attr =
algorithm,mean,std,signficance,p-val
LR: 0.711000 (0.038588) False nan
LDA: 0.717000 (0.040262) False 0.051003
KNN: 0.762000 (0.023580) True 0.000407
CART: 0.742000 (0.049960) False 0.135066
NB: 0.708000 (0.050951) False 0.802536
RF: 0.767000 (0.024920) True 0.001139
SVM: 0.684000 (0.052192) False 0.261975
== Select Best Model, Compare Algorithms ==

You might also like