Data Pre Processing 1

Outlier detection - Tukey IQR

https://towardsdatascience.com/local-outlier-factor-for-anomaly-detection-cc0c770d2ebe
Identifies extreme values in data
Outliers are defined as:

Values below Q1-1.5(Q3-Q1) or above Q3+1.5(Q3-Q1)
Standard deviation from the mean is another common method to detect extreme values
But it can be problematic:
Assumes normality
Sensitive to very extreme values
from IPython.display import Image

Image(filename='Images/tukeyiqr.jpg')
# box and whisker plot (Univariate Plots)

# with this we can determine outliers in dataset
X = pima_df.loc[:,'Pregnancies':'Age']
X.plot(kind='box',subplots=True,layout=(3,3),figsize=(15,10))
plt.show()
Why 1.5 times the width of the box for the outliers? Why does that particular value demark the
difference between "acceptable" and "unacceptable" values?
Because, when John Tukey was inventing the box-and-whisker plot in 1977 to display these values, he picked
1.5×IQR as the demarkation line for outliers. This has worked well, so we've continued using that value ever
since. If you go further into statistics, you'll find that this measure of reasonableness, for bell-curve-shaped
data, means that usually only maybe as much as about one percent of the data will ever be outliers.
Removing Outliers Using Standard Deviation

Our approach was to remove the outlier points by eliminating any points that were above (Mean + 2.75SD)
and any points below (Mean - 2.75SD) before plotting the frequencies.
X = pima_df.loc[:,'Pregnancies':'Age']
outlier_df = X['Age'][((X['Age']-X['Age'].mean()).abs() > 2.75*X['Age'].std())]
print(outlier_df)
out_indices = outlier_df.index
print(out_indices)
import numpy
123 69
221 66
363 67
453 72
459 81
489 67
495 66
537 67
552 66
666 70
674 68
684 69
759 66
Name: Age, dtype: int64
Int64Index([123, 221, 363, 453, 459, 489, 495, 537, 552, 666, 674, 684, 759],
dtype='int64')
X.loc[out_indices] = np.nan
print(X)
mid = X['Age'].median()
X.loc[out_indices] = mid
print(X)
Pregnancies Glucose BloodPressure SkinThickness Insulin \

0 6.0 148.0 72.000000 35.00000 155.548223
1 1.0 85.0 66.000000 29.00000 155.548223
2 8.0 183.0 64.000000 29.15342 155.548223
3 1.0 89.0 66.000000 23.00000 94.000000
4 0.0 137.0 40.000000 35.00000 168.000000
5 5.0 116.0 74.000000 29.15342 155.548223
6 3.0 78.0 50.000000 32.00000 88.000000
7 10.0 115.0 72.405184 29.15342 155.548223
8 2.0 197.0 70.000000 45.00000 543.000000
9 8.0 125.0 96.000000 29.15342 155.548223
10 4.0 110.0 92.000000 29.15342 155.548223
11 10.0 168.0 74.000000 29.15342 155.548223
12 10.0 139.0 80.000000 29.15342 155.548223
13 1.0 189.0 60.000000 23.00000 846.000000
14 5.0 166.0 72.000000 19.00000 175.000000
15 7.0 100.0 72.405184 29.15342 155.548223
16 0.0 118.0 84.000000 47.00000 230.000000
17 7.0 107.0 74.000000 29.15342 155.548223
18 1.0 103.0 30.000000 38.00000 83.000000
19 1.0 115.0 70.000000 30.00000 96.000000
20 3.0 126.0 88.000000 41.00000 235.000000
21 8.0 99.0 84.000000 29.15342 155.548223
22 7.0 196.0 90.000000 29.15342 155.548223
23 9.0 119.0 80.000000 35.00000 155.548223
24 11.0 143.0 94.000000 33.00000 146.000000
25 10.0 125.0 70.000000 26.00000 115.000000
26 7.0 147.0 76.000000 29.15342 155.548223
27 1.0 97.0 66.000000 15.00000 140.000000
28 13.0 145.0 82.000000 19.00000 110.000000
29 5.0 117.0 92.000000 29.15342 155.548223
.. ... ... ... ... ...
738 2.0 99.0 60.000000 17.00000 160.000000
739 1.0 102.0 74.000000 29.15342 155.548223
740 11.0 120.0 80.000000 37.00000 150.000000
741 3.0 102.0 44.000000 20.00000 94.000000
742 1.0 109.0 58.000000 18.00000 116.000000
743 9.0 140.0 94.000000 29.15342 155.548223
744 13.0 153.0 88.000000 37.00000 140.000000
745 12.0 100.0 84.000000 33.00000 105.000000
746 1.0 147.0 94.000000 41.00000 155.548223
747 1.0 81.0 74.000000 41.00000 57.000000
748 3.0 187.0 70.000000 22.00000 200.000000
749 6.0 162.0 62.000000 29.15342 155.548223
750 4.0 136.0 70.000000 29.15342 155.548223
751 1.0 121.0 78.000000 39.00000 74.000000
752 3.0 108.0 62.000000 24.00000 155.548223
753 0.0 181.0 88.000000 44.00000 510.000000
754 8.0 154.0 78.000000 32.00000 155.548223
755 1.0 128.0 88.000000 39.00000 110.000000
756 7.0 137.0 90.000000 41.00000 155.548223
757 0.0 123.0 72.000000 29.15342 155.548223
758 1.0 106.0 76.000000 29.15342 155.548223
759 NaN NaN NaN NaN NaN
760 2.0 88.0 58.000000 26.00000 16.000000
761 9.0 170.0 74.000000 31.00000 155.548223
762 9.0 89.0 62.000000 29.15342 155.548223
763 10.0 101.0 76.000000 48.00000 180.000000
764 2.0 122.0 70.000000 27.00000 155.548223
765 5.0 121.0 72.000000 23.00000 112.000000
766 1.0 126.0 60.000000 29.15342 155.548223
767 1.0 93.0 70.000000 31.00000 155.548223
BMI DiabetesPedigreeFunction Age

0 33.600000 0.627 50.0
1 26.600000 0.351 31.0
2 23.300000 0.672 32.0
3 28.100000 0.167 21.0
4 43.100000 2.288 33.0
5 25.600000 0.201 30.0
6 31.000000 0.248 26.0
7 35.300000 0.134 29.0
8 30.500000 0.158 53.0
9 32.457464 0.232 54.0
10 37.600000 0.191 30.0
11 38.000000 0.537 34.0
12 27.100000 1.441 57.0
13 30.100000 0.398 59.0
14 25.800000 0.587 51.0
15 30.000000 0.484 32.0
16 45.800000 0.551 31.0
17 29.600000 0.254 31.0
18 43.300000 0.183 33.0
19 34.600000 0.529 32.0
20 39.300000 0.704 27.0
21 35.400000 0.388 50.0
22 39.800000 0.451 41.0
23 29.000000 0.263 29.0
24 36.600000 0.254 51.0
25 31.100000 0.205 41.0
26 39.400000 0.257 43.0
27 23.200000 0.487 22.0
28 22.200000 0.245 57.0
29 34.100000 0.337 38.0
.. ... ... ...
738 36.600000 0.453 21.0
739 39.500000 0.293 42.0
740 42.300000 0.785 48.0
741 30.800000 0.400 26.0
742 28.500000 0.219 22.0
743 32.700000 0.734 45.0
744 40.600000 1.174 39.0
745 30.000000 0.488 46.0
746 49.300000 0.358 27.0
747 46.300000 1.096 32.0
748 36.400000 0.408 36.0
749 24.300000 0.178 50.0
750 31.200000 1.182 22.0
751 39.000000 0.261 28.0
752 26.000000 0.223 25.0
753 43.300000 0.222 26.0
754 32.400000 0.443 45.0
755 36.500000 1.057 37.0
756 32.000000 0.391 39.0
757 36.300000 0.258 52.0
758 37.500000 0.197 26.0
759 NaN NaN NaN
760 28.400000 0.766 22.0
761 44.000000 0.403 43.0
762 22.500000 0.142 33.0
763 32.900000 0.171 63.0
764 36.800000 0.340 27.0
765 26.200000 0.245 30.0
766 30.100000 0.349 47.0
767 30.400000 0.315 23.0
[768 rows x 8 columns]

Pregnancies Glucose BloodPressure SkinThickness Insulin \
0 6.0 148.0 72.000000 35.00000 155.548223
1 1.0 85.0 66.000000 29.00000 155.548223
2 8.0 183.0 64.000000 29.15342 155.548223
3 1.0 89.0 66.000000 23.00000 94.000000
4 0.0 137.0 40.000000 35.00000 168.000000
5 5.0 116.0 74.000000 29.15342 155.548223
6 3.0 78.0 50.000000 32.00000 88.000000
7 10.0 115.0 72.405184 29.15342 155.548223
8 2.0 197.0 70.000000 45.00000 543.000000
9 8.0 125.0 96.000000 29.15342 155.548223
10 4.0 110.0 92.000000 29.15342 155.548223
11 10.0 168.0 74.000000 29.15342 155.548223
12 10.0 139.0 80.000000 29.15342 155.548223
13 1.0 189.0 60.000000 23.00000 846.000000
14 5.0 166.0 72.000000 19.00000 175.000000
15 7.0 100.0 72.405184 29.15342 155.548223
16 0.0 118.0 84.000000 47.00000 230.000000
17 7.0 107.0 74.000000 29.15342 155.548223
18 1.0 103.0 30.000000 38.00000 83.000000
19 1.0 115.0 70.000000 30.00000 96.000000
20 3.0 126.0 88.000000 41.00000 235.000000
21 8.0 99.0 84.000000 29.15342 155.548223
22 7.0 196.0 90.000000 29.15342 155.548223
23 9.0 119.0 80.000000 35.00000 155.548223
24 11.0 143.0 94.000000 33.00000 146.000000
25 10.0 125.0 70.000000 26.00000 115.000000
26 7.0 147.0 76.000000 29.15342 155.548223
27 1.0 97.0 66.000000 15.00000 140.000000
28 13.0 145.0 82.000000 19.00000 110.000000
29 5.0 117.0 92.000000 29.15342 155.548223
.. ... ... ... ... ...
738 2.0 99.0 60.000000 17.00000 160.000000
739 1.0 102.0 74.000000 29.15342 155.548223
740 11.0 120.0 80.000000 37.00000 150.000000
741 3.0 102.0 44.000000 20.00000 94.000000
742 1.0 109.0 58.000000 18.00000 116.000000
743 9.0 140.0 94.000000 29.15342 155.548223
744 13.0 153.0 88.000000 37.00000 140.000000
745 12.0 100.0 84.000000 33.00000 105.000000
746 1.0 147.0 94.000000 41.00000 155.548223
747 1.0 81.0 74.000000 41.00000 57.000000
748 3.0 187.0 70.000000 22.00000 200.000000
749 6.0 162.0 62.000000 29.15342 155.548223
750 4.0 136.0 70.000000 29.15342 155.548223
751 1.0 121.0 78.000000 39.00000 74.000000
752 3.0 108.0 62.000000 24.00000 155.548223
753 0.0 181.0 88.000000 44.00000 510.000000
754 8.0 154.0 78.000000 32.00000 155.548223
755 1.0 128.0 88.000000 39.00000 110.000000
756 7.0 137.0 90.000000 41.00000 155.548223
757 0.0 123.0 72.000000 29.15342 155.548223
758 1.0 106.0 76.000000 29.15342 155.548223
759 29.0 29.0 29.000000 29.00000 29.000000
760 2.0 88.0 58.000000 26.00000 16.000000
761 9.0 170.0 74.000000 31.00000 155.548223
762 9.0 89.0 62.000000 29.15342 155.548223
763 10.0 101.0 76.000000 48.00000 180.000000
764 2.0 122.0 70.000000 27.00000 155.548223
765 5.0 121.0 72.000000 23.00000 112.000000
766 1.0 126.0 60.000000 29.15342 155.548223
767 1.0 93.0 70.000000 31.00000 155.548223
BMI DiabetesPedigreeFunction Age

0 33.600000 0.627 50.0
1 26.600000 0.351 31.0
2 23.300000 0.672 32.0
3 28.100000 0.167 21.0
4 43.100000 2.288 33.0
5 25.600000 0.201 30.0
6 31.000000 0.248 26.0
7 35.300000 0.134 29.0
8 30.500000 0.158 53.0
9 32.457464 0.232 54.0
10 37.600000 0.191 30.0
11 38.000000 0.537 34.0
12 27.100000 1.441 57.0
13 30.100000 0.398 59.0
14 25.800000 0.587 51.0
15 30.000000 0.484 32.0
16 45.800000 0.551 31.0
17 29.600000 0.254 31.0
18 43.300000 0.183 33.0
19 34.600000 0.529 32.0
20 39.300000 0.704 27.0
21 35.400000 0.388 50.0
22 39.800000 0.451 41.0
23 29.000000 0.263 29.0
24 36.600000 0.254 51.0
25 31.100000 0.205 41.0
26 39.400000 0.257 43.0
27 23.200000 0.487 22.0
28 22.200000 0.245 57.0
29 34.100000 0.337 38.0
.. ... ... ...
738 36.600000 0.453 21.0
739 39.500000 0.293 42.0
740 42.300000 0.785 48.0
741 30.800000 0.400 26.0
742 28.500000 0.219 22.0
743 32.700000 0.734 45.0
744 40.600000 1.174 39.0
745 30.000000 0.488 46.0
746 49.300000 0.358 27.0
747 46.300000 1.096 32.0
748 36.400000 0.408 36.0
749 24.300000 0.178 50.0
750 31.200000 1.182 22.0
751 39.000000 0.261 28.0
752 26.000000 0.223 25.0
753 43.300000 0.222 26.0
754 32.400000 0.443 45.0
755 36.500000 1.057 37.0
756 32.000000 0.391 39.0
757 36.300000 0.258 52.0
758 37.500000 0.197 26.0
759 29.000000 29.000 29.0
760 28.400000 0.766 22.0
761 44.000000 0.403 43.0
762 22.500000 0.142 33.0
763 32.900000 0.171 63.0
764 36.800000 0.340 27.0
765 26.200000 0.245 30.0
766 30.100000 0.349 47.0
767 30.400000 0.315 23.0
[768 rows x 8 columns]
X.plot(kind='box',subplots=True,layout=(3,3),figsize=(15,10))
plt.show()
#Plotting all of your data: Bee swarm plots

plt.figure(figsize=(15,8))
_ = sns.swarmplot(data=pima_df)
plt.show()

import seaborn as sns

import matplotlib.pyplot as plt
plt.figure(figsize=(15, 6))
_ = sns.swarmplot(x='Age',y='Glucose',hue='Outcome',data=pima_df)
plt.show()
#!pip install plotnine
reference for exploring different smooths in ggplot2

import warnings
warnings.filterwarnings("ignore")
from plotnine import *
ggplot(pima_df,aes(x='Age',y='Glucose',colour='Outcome')) +geom_point()+stat_smooth()
<ggplot: (7547461505)>
ggplot(pima_df,aes(x='Age',y='Glucose',colour = 'BloodPressure'))
+geom_point()+stat_smooth()+facet_wrap('~Outcome')

<ggplot: (294288713)>
ggplot(pima_df,aes(x='Age', y
='Pregnancies'))+geom_point(aes(color='BMI'))+facet_wrap('~Outcome')+stat_smooth()

<ggplot: (294281529)>
# correlation of each Point

corr = pima_df.loc[:,pima_df.columns!='Outcome'].corr()
plt.figure(figsize=(12,12))
sns.heatmap(corr,annot=True,cmap="Blues")
<matplotlib.axes._subplots.AxesSubplot at 0x1c1e1a6510>

We can observe that there are correlatiom between some columns
Age is highly correlated with pregnancies
Insulin is correlated with skin Glucose
skin thickness is correlated with BMI
sns.lmplot(x='Age', y = 'Pregnancies', hue = 'Outcome', data = pima_df)
<seaborn.axisgrid.FacetGrid at 0x1c1e19bbd0>

sns.lmplot(x='Insulin', y = 'Glucose', hue = 'Outcome', data = pima_df)
<seaborn.axisgrid.FacetGrid at 0x1c1e3ee9d0>

sns.lmplot(x='BMI', y = 'SkinThickness', hue = 'Outcome', data = pima_df)
<seaborn.axisgrid.FacetGrid at 0x10f8e4390>
#Visualise pairplot using seaborn which will give plot against each attribute to
another attribute
sns.pairplot(pima_df[['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness',
'Insulin', 'BMI', 'DiabetesPedigreeFunction', 'Age']])
<seaborn.axisgrid.PairGrid at 0x10fb49750>

4. Data Scaling
Many machine learning algorithms expect the scale of the input and even the output data to be equivalent.
It can help in methods that weight inputs in order to make a prediction, such as in linear regression and
logistic regression. It is practically required in methods that combine weighted inputs in complex ways such
as in artificial neural networks and deep learning.
We will discuss:
1.Normalise Data
2.Standardize Data
3.When to Normalise and Standardize
1.Normalize Data
Normalization can refer to different techniques depending on context. Here, we use normalization to refer
to rescaling an input variable to the range between 0 and 1. Normalization requires that you know the
minimum and maximum values for each attribute. This can be estimated from training data or specified
directly if you have deep knowledge of the problem domain. You can easily estimate the minimum and
maximum values for each attribute in a dataset by enumerating through the values.
Once we have estimates of the maximum and minimum allowed values for each column, we can normalize
the raw data to the range 0 and 1. The calculation to normalize a single value for a column is:
scaled value = (value - min)/(max - min)
np.set_printoptions(precision=3)
array = np.array(pima_df.values)
print("== Generating data sets ==")
print("diabetes_attr: unchanged, original attributes")

diabetes_attr = array[:,0:8]
label = array[:,8] #unchanged across preprocessing?
diabetes_df = pd.DataFrame(diabetes_attr)
== Generating data sets ==

diabetes_attr: unchanged, original attributes
print("Normalized_attributes: range of 0 to 1")

from sklearn import preprocessing as preproc
scaler = preproc.MinMaxScaler().fit(diabetes_attr)
normalized_attr = scaler.transform(diabetes_attr)
normalized_df = pd.DataFrame(normalized_attr)
print(normalized_df.describe())
Normalized_attributes: range of 0 to 1
0 1 2 3 4 5 \
count 768.000000 768.000000 768.000000 768.000000 768.000000 768.000000
mean 0.226180 0.501205 0.493930 0.240798 0.170130 0.291564
std 0.198210 0.196361 0.123432 0.095554 0.102189 0.140596
min 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
25% 0.058824 0.359677 0.408163 0.195652 0.129207 0.190184
50% 0.176471 0.470968 0.491863 0.240798 0.170130 0.290389
75% 0.352941 0.620968 0.571429 0.271739 0.170130 0.376278
max 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000
6 7
count 768.000000 768.000000
mean 0.168179 0.204015
std 0.141473 0.196004
min 0.000000 0.000000
25% 0.070773 0.050000
50% 0.125747 0.133333
75% 0.234095 0.333333
max 1.000000 1.000000
2. Standardize Data
Standardization is a rescaling technique that refers to centering the distribution of the data on the value 0
and the standard deviation to the value 1. Together, the mean and the standard deviation can be used to
summarize a normal distribution, also called the Gaussian distribution or bell curve. It requires that the
mean and standard deviation of the values for each column be known prior to scaling. As with normalizing
above, we can estimate these values from training data, or use domain knowledge to specify their values.
The standard deviation describes the average spread of values from the mean. It can be calculated as the
square root of the sum of the squared difference between each value and the mean and dividing by the
number of values minus 1.
Once mean and standard deviation is calculated we can easily calculate standardized value.The calculation
to standardize a single value for a column is: :
standardized value = (value - mean)/stdev
print("standardized_attr: mean of 0 and stdev of 1")

#scaler = preproc.StandardScaler().fit(diabetes_attr)
#standardized_attr = scaler.transform(diabetes_attr)
standardized_attr = preproc.scale(diabetes_attr)
standardized_df = pd.DataFrame(standardized_attr)
print(standardized_df.describe())
standardized_attr: mean of 0 and stdev of 1

0 1 2 3 4 \
count 7.680000e+02 7.680000e+02 7.680000e+02 7.680000e+02 7.680000e+02
mean 2.544261e-17 -3.301757e-16 6.966722e-16 6.866252e-16 -2.352033e-16
std 1.000652e+00 1.000652e+00 1.000652e+00 1.000652e+00 1.000652e+00
min -1.141852e+00 -2.554131e+00 -4.004245e+00 -2.521670e+00 -1.665945e+00
25% -8.448851e-01 -7.212214e-01 -6.953060e-01 -4.727737e-01 -4.007289e-01
50% -2.509521e-01 -1.540881e-01 -1.675912e-02 8.087936e-16 -3.345079e-16
75% 6.399473e-01 6.103090e-01 6.282695e-01 3.240194e-01 -3.345079e-16
max 3.906578e+00 2.541850e+00 4.102655e+00 7.950467e+00 8.126238e+00
5 6 7
count 7.680000e+02 7.680000e+02 7.680000e+02
mean 3.090699e-16 2.398978e-16 1.857600e-16
std 1.000652e+00 1.000652e+00 1.000652e+00
min -2.075119e+00 -1.189553e+00 -1.041549e+00
25% -7.215397e-01 -6.889685e-01 -7.862862e-01
50% -8.363615e-03 -3.001282e-01 -3.608474e-01
75% 6.029301e-01 4.662269e-01 6.602056e-01
max 5.042087e+00 5.883565e+00 4.063716e+00
3. When to Normalize and Standardize

Standardization is a scaling technique that assumes your data conforms to a normal distribution. If a given
data attribute is normal or close to normal, this is probably the scaling method to use. It is good practice to
record the summary statistics used in the standardization process so that you can apply them when
standardizing data in the future that you may want to use with your model. Normalization is a scaling
technique that does not assume any specic distribution.
If your data is not normally distributed, consider normalizing it prior to applying your machine learning
algorithm. It is good practice to record the minimum and maximum values for each column used in the
normalization process, again, in case you need to normalize new data in the future to be used with your
model.
Handling Imbalanced Class Data
print("=== undersampling majority class by purging ===")
# Separate majority and minority classes

df_majority = pima_df[pima_df['Outcome']==0]
df_minority = pima_df[pima_df['Outcome']==1]
=== undersampling majority class by purging ===
print("df_minority['class'].size", df_minority['Outcome'].size)
from sklearn.utils import resample
# Downsample majority class
df_majority_downsampled = resample(df_majority,
replace=False, # sample without replacement
n_samples=df_minority['Outcome'].size, # match minority
class
random_state=7) # reproducible results
# Combine minority class with downsampled majority class

df_downsampled = pd.concat([df_majority_downsampled, df_minority])
("df_minority['class'].size", 268)
print("undersampled", df_downsampled.groupby('Outcome').size())
df_downsampled=df_downsampled.sample(frac=1).reset_index(drop=True)
undersampling_attr = np.array(df_downsampled.values[:,0:8])
undersampling_label = np.array(df_downsampled.values[:,8])
('undersampled', Outcome
0 268
1 268
dtype: int64)
#!pip install imblearn
print("=== oversampling minority class with SMOTE ===")

from imblearn.over_sampling import SMOTE
sm = SMOTE(random_state=7)
x_val = pima_df.values[:,0:8]
y_val = pima_df.values[:,8]
X_res, y_res = sm.fit_sample(x_val, y_val)
features=['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI',

'DiabetesPedigreeFunction', 'Age']
oversampled_df = pd.DataFrame(X_res)
oversampled_df.columns = features
oversampled_df = oversampled_df.assign(label = np.asarray(y_res))
oversampled_df = oversampled_df.sample(frac=1).reset_index(drop=True)
oversampling_attr = oversampled_df.values[:,0:8]
oversampling_label = oversampled_df.values[:,8]
print("oversampled_df", oversampled_df.groupby('label').size())
=== oversampling minority class with SMOTE ===

('oversampled_df', label
0.0 500
1.0 500
dtype: int64)
print("== treating missing values by purging or imputating ==")

## missing.arff
print("=== Assuming, zero indicates missing values === ")
print("missing values by count")
print((pima_df[['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness', 'Insulin',
'BMI', 'DiabetesPedigreeFunction', 'Age']] == 0).sum())
print("=== purging ===")
# make a copy of original data set
dataset_cp = pima_df.copy(deep=True)
dataset_cp[['Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI']] =

dataset_cp[['Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI']].replace(0,
np.NaN)
== treating missing values by purging or imputating ==
=== Assuming, zero indicates missing values ===
missing values by count
Pregnancies 111
Glucose 0
BloodPressure 0
SkinThickness 0
Insulin 0
BMI 0
DiabetesPedigreeFunction 0
Age 0
dtype: int64
=== purging ===
# dataset with missing values

dataset_missing = dataset_cp.dropna()
# summarize the number of rows and columns in the dataset

print(dataset_cp.shape)
missing_attr = np.array(dataset_missing.values[:,0:8])
missing_label = np.array(dataset_missing.values[:,8])
print("=== imputing by replacing missing values with mean column values ===")
dataset_impute = dataset_cp.fillna(dataset_cp.mean())
# count the number of NaN values in each column
print(dataset_impute.isnull().sum())
print("== addressing class imbalance under or over sampling ==")
impute_attr = np.array(dataset_impute.values[:,0:8])
(768, 9)
=== imputing by replacing missing values with mean column values ===
Pregnancies 0
Glucose 0
BloodPressure 0
SkinThickness 0
Insulin 0
BMI 0
DiabetesPedigreeFunction 0
Age 0
Outcome 0
dtype: int64
== addressing class imbalance under or over sampling ==
Dimensionality reduction using PCA

Principal component analysis (PCA) is a technique that transforms a dataset of many features into
principal components that "summarize" the variance that underlies the data
Each principal component is calculated by finding the linear combination of features that maximizes
variance, while also ensuring zero correlation with the previously calculated principal components
Use cases for modeling:
One of the most common dimensionality reduction techniques

Use if there are too many features or if observation/feature ratio is poor
Also, potentially good option if there are a lot of highly correlated variables in your dataset
Unfortunately, PCA makes models a lot harder to interpret
PCA as dimensionality reduction

Using PCA for dimensionality reduction involves zeroing out one or more of the smallest principal
components, resulting in a lower-dimensional projection of the data that preserves the maximal data
variance.
Choosing the number of components
A vital part of using PCA in practice is the ability to estimate how many components are needed to describe
the data. This can be determined by looking at the cumulative explained variance ratio as a function of the
number of components:
# Use PCA from sklearn.decompostion to find principal components

from sklearn.decomposition import PCA
pca = PCA()
X_pca = pd.DataFrame(pca.fit_transform(pima_df))
plt.plot(np.cumsum(pca.explained_variance_ratio_))
plt.xlabel('number of components')
plt.ylabel('cumulative explained variance');

pca = PCA(n_components=5)
pca.fit(diabetes_attr)
diabetes_attr_pca = pca.transform(diabetes_attr)
print("original shape: ", diabetes_attr.shape)
print("transformed shape:", diabetes_attr_pca.shape)
('original shape: ', (768, 8))

('transformed shape:', (768, 5))
pca.fit(normalized_attr)
normalized_attr_pca = pca.transform(normalized_attr)
pca.fit(standardized_attr)
standardized_attr_pca = pca.transform(standardized_attr)
pca.fit(impute_attr)
impute_attr_pca = pca.transform(impute_attr)
pca.fit(missing_attr)
missing_attr_pca = pca.transform(missing_attr)
pca.fit(undersampling_attr)
undersampling_attr_pca = pca.transform(undersampling_attr)
pca.fit(oversampling_attr)
oversampling_attr_pca = pca.transform(oversampling_attr)
Evaluate Algorithms
print(" == Evaluate Some Algorithms == ")
# Split-out validation dataset
print(" == Create a Validation Dataset: Split-out validation dataset == ")
# Test options and evaluation metric

print(" == Test Harness: Test options and evaluation metric == ")
seed = 7
scoring = 'accuracy'
== Evaluate Some Algorithms ==

== Create a Validation Dataset: Split-out validation dataset ==
== Test Harness: Test options and evaluation metric ==
# Spot Check Algorithms without feature reduction

# algo eval imports
from sklearn import model_selection
from sklearn.metrics import accuracy_score
from sklearn.linear_model import LogisticRegression
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC
# significance tests
import scipy.stats as stats
import math
print("== Build Models: build and evaluate models, Spot Check Algorithms ==")
datasets = []
datasets.append(('diabetes_attr', diabetes_attr, label))
datasets.append(('normalized_attr', normalized_attr, label))
datasets.append(('standardized_attr', standardized_attr, label))
datasets.append(('impute_attr', impute_attr, label))
datasets.append(('missing_attr', missing_attr, missing_label))
datasets.append(('undersampling_attr', undersampling_attr, undersampling_label))
datasets.append(('oversampling_attr', oversampling_attr, oversampling_label))
models = []
models.append(('LR', LogisticRegression())) # based on imbalanced datasets and default
parameters
models.append(('LDA', LinearDiscriminantAnalysis()))
models.append(('KNN', KNeighborsClassifier()))
models.append(('CART', DecisionTreeClassifier()))
models.append(('NB', GaussianNB()))
models.append(('RF', RandomForestClassifier()))
models.append(('SVM', SVC()))
print("eval metric: " + scoring)

for dataname, attributes, target in datasets:
# evaluate each model in turn
results = []
names = []
print("= " + dataname + " = ")
print("algorithm,mean,std,signficance,p-val")
for name, model in models:
kfold = model_selection.KFold(n_splits=10, random_state=seed)
cv_results = model_selection.cross_val_score(model, attributes, target,
cv=kfold, scoring=scoring)
results.append(cv_results)
#print("cv_results")
#print(cv_results)
#print(results[0])
names.append(name)
t, prob = stats.ttest_rel(a= cv_results,b= results[0])

#print("LR vs ", name, t,prob)
# Below 0.05, significant. Over 0.05, not significant.
# http://blog.minitab.com/blog/understanding-statistics/what-can-you-say-when-
your-p-value-is-greater-than-005
statistically_different = (prob < 0.05)
msg = "%s: %f (%f) %s %f" % (name, cv_results.mean(), cv_results.std(),

statistically_different, prob)
print(msg)
# Compare Algorithms
print(" == Select Best Model, Compare Algorithms == ")
fig = plt.figure()
fig.suptitle('Algorithm Comparison for ' + dataname)
ax = fig.add_subplot(111)
plt.boxplot(results)
plt.ylabel(scoring)
ax.set_xticklabels(names)
plt.show()
== Build Models: build and evaluate models, Spot Check Algorithms ==

eval metric: accuracy
= diabetes_attr =
algorithm,mean,std,signficance,p-val
LR: 0.765636 (0.047532) False nan
LDA: 0.766951 (0.052975) False 0.820491
KNN: 0.713534 (0.064980) True 0.012597
CART: 0.687474 (0.063816) True 0.002141
NB: 0.747386 (0.043583) False 0.203854
RF: 0.744737 (0.064272) False 0.184573
SVM: 0.651025 (0.072141) True 0.000537
== Select Best Model, Compare Algorithms ==

= normalized_attr =
LR: 0.765619 (0.046566) False nan
LDA: 0.766951 (0.052975) False 0.828238
KNN: 0.748701 (0.062006) False 0.235048
CART: 0.700496 (0.048400) True 0.001043
NB: 0.747386 (0.043583) False 0.132240
RF: 0.746036 (0.058189) False 0.061152
SVM: 0.770813 (0.052488) False 0.309233
= standardized_attr =
LR: 0.770813 (0.051248) False nan
LDA: 0.766951 (0.052975) False 0.526999
KNN: 0.738278 (0.039157) True 0.030019
CART: 0.687440 (0.063132) True 0.001104
NB: 0.747386 (0.043583) False 0.061474
RF: 0.757707 (0.060612) False 0.356660
SVM: 0.753913 (0.044789) True 0.022590

= impute_attr =
LR: 0.764320 (0.048484) False nan
LDA: 0.766951 (0.052975) False 0.675096
KNN: 0.713534 (0.064980) True 0.014497
CART: 0.696617 (0.055419) True 0.000589
NB: 0.747386 (0.043583) False 0.243960
RF: 0.764234 (0.057085) False 0.996000
SVM: 0.651025 (0.072141) True 0.000669

= missing_attr =
LR: 0.764337 (0.047320) False nan
LDA: 0.766951 (0.052975) False 0.640480
KNN: 0.713534 (0.064980) True 0.014492
CART: 0.691353 (0.063152) True 0.002221
NB: 0.747386 (0.043583) False 0.220344
RF: 0.734330 (0.062398) True 0.037339
SVM: 0.651025 (0.072141) True 0.000570
= undersampling_attr =
LR: 0.749895 (0.053692) False nan
LDA: 0.751747 (0.071419) False 0.897521
KNN: 0.694165 (0.071292) True 0.009508
CART: 0.663941 (0.081059) True 0.016613
NB: 0.710936 (0.081741) False 0.058574
RF: 0.720335 (0.059445) False 0.210477
SVM: 0.458910 (0.072346) True 0.000001

= oversampling_attr =
LR: 0.755000 (0.051039) False nan
LDA: 0.749000 (0.045486) False 0.111373
KNN: 0.768000 (0.027129) False 0.481468
CART: 0.762000 (0.028213) False 0.677050
NB: 0.713000 (0.046054) True 0.001323
RF: 0.814000 (0.045869) True 0.002612
SVM: 0.713000 (0.043829) False 0.090454
# Spot Check Algorithms after feature reduction with pca

# algo eval imports
from sklearn import model_selection
from sklearn.metrics import accuracy_score
from sklearn.linear_model import LogisticRegression
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC
# significance tests
import scipy.stats as stats
import math
print("== Build Models: build and evaluate models, Spot Check Algorithms ==")
datasets = []
datasets.append(('diabetes_attr', diabetes_attr_pca, label))
datasets.append(('normalized_attr', normalized_attr_pca, label))
datasets.append(('standardized_attr', standardized_attr_pca, label))
datasets.append(('impute_attr', impute_attr, label))
datasets.append(('missing_attr', missing_attr_pca, missing_label))
datasets.append(('undersampling_attr', undersampling_attr_pca, undersampling_label))
datasets.append(('oversampling_attr', oversampling_attr_pca, oversampling_label))
models = []
models.append(('LR', LogisticRegression())) # based on imbalanced datasets and default
parameters
models.append(('LDA', LinearDiscriminantAnalysis()))
models.append(('KNN', KNeighborsClassifier()))
models.append(('CART', DecisionTreeClassifier()))
models.append(('NB', GaussianNB()))
models.append(('RF', RandomForestClassifier()))
models.append(('SVM', SVC()))
print("eval metric: " + scoring)

for dataname, attributes, target in datasets:
# evaluate each model in turn
results = []
names = []
print("= " + dataname + " = ")
print("algorithm,mean,std,signficance,p-val")
for name, model in models:
kfold = model_selection.KFold(n_splits=10, random_state=seed)
cv_results = model_selection.cross_val_score(model, attributes, target,
cv=kfold, scoring=scoring)
results.append(cv_results)
#print("cv_results")
#print(cv_results)
names.append(name)
t, prob = stats.ttest_rel(a= cv_results,b= results[0])

#print("LR vs ", name, t,prob)
# Below 0.05, significant. Over 0.05, not significant.
statistically_different = (prob < 0.05)
msg = "%s: %f (%f) %s %f" % (name, cv_results.mean(), cv_results.std(),
statistically_different, prob)
print(msg)
# Compare Algorithms
print(" == Select Best Model, Compare Algorithms == ")
fig = plt.figure()
fig.suptitle('Algorithm Comparison for ' + dataname)
ax = fig.add_subplot(111)
plt.boxplot(results)
plt.ylabel(scoring)
ax.set_xticklabels(names)
plt.show()
== Build Models: build and evaluate models, Spot Check Algorithms ==

eval metric: accuracy
= diabetes_attr =
LR: 0.760492 (0.049736) False nan
LDA: 0.753947 (0.051575) False 0.297575
KNN: 0.717413 (0.066119) True 0.023485
CART: 0.670540 (0.066640) True 0.001672
NB: 0.747454 (0.048375) False 0.195421
RF: 0.713551 (0.053480) True 0.009030
SVM: 0.651025 (0.072141) True 0.000815

= normalized_attr =
LR: 0.763004 (0.052644) False nan
LDA: 0.759091 (0.049164) False 0.432778
KNN: 0.720010 (0.064269) True 0.001669
CART: 0.654802 (0.061094) True 0.000207
NB: 0.742208 (0.045907) False 0.140785
RF: 0.740858 (0.063580) False 0.123441
SVM: 0.772095 (0.054777) True 0.009535
= standardized_attr =
LR: 0.748701 (0.033960) False nan
LDA: 0.742208 (0.031646) False 0.272912
KNN: 0.718763 (0.051160) True 0.033641
CART: 0.704323 (0.043981) True 0.007545
NB: 0.721343 (0.035560) True 0.035025
RF: 0.716131 (0.047187) True 0.008683
SVM: 0.733083 (0.046566) False 0.179971

= impute_attr =
LR: 0.765636 (0.047532) False nan
LDA: 0.766951 (0.052975) False 0.820491
KNN: 0.713534 (0.064980) True 0.012597
CART: 0.687389 (0.049055) True 0.000286
NB: 0.747386 (0.043583) False 0.203854
RF: 0.751299 (0.053382) False 0.169036
SVM: 0.651025 (0.072141) True 0.000537

= missing_attr =
LR: 0.760492 (0.049736) False nan
LDA: 0.753947 (0.051575) False 0.297575
KNN: 0.717413 (0.066119) True 0.023485
CART: 0.663995 (0.055143) True 0.000938
NB: 0.747454 (0.048375) False 0.195421
RF: 0.731716 (0.058688) False 0.065855
SVM: 0.651025 (0.072141) True 0.000815
= undersampling_attr =
LR: 0.716387 (0.058158) False nan
LDA: 0.716282 (0.060502) False 0.983147
KNN: 0.690426 (0.067903) False 0.208882
CART: 0.673620 (0.065603) False 0.115618
NB: 0.701398 (0.058597) False 0.206335
RF: 0.686513 (0.074474) False 0.179856
SVM: 0.451468 (0.063015) True 0.000002

= oversampling_attr =
LR: 0.711000 (0.038588) False nan
LDA: 0.717000 (0.040262) False 0.051003
KNN: 0.762000 (0.023580) True 0.000407
CART: 0.742000 (0.049960) False 0.135066
NB: 0.708000 (0.050951) False 0.802536
RF: 0.767000 (0.024920) True 0.001139
SVM: 0.684000 (0.052192) False 0.261975

Data Pre Processing 1

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Data Pre Processing 1

Uploaded by

Copyright:

Available Formats

Outlier detection - Tukey IQR

Identiﬁes extreme values in data

Outliers are deﬁned as:

from IPython.display import Image

# box and whisker plot (Univariate Plots)

Removing Outliers Using Standard Deviation

Pregnancies Glucose BloodPressure SkinThickness Insulin \

BMI DiabetesPedigreeFunction Age

[768 rows x 8 columns]

BMI DiabetesPedigreeFunction Age

[768 rows x 8 columns]

#Plotting all of your data: Bee swarm plots

import seaborn as sns

#!pip install plotnine

reference for exploring diﬀerent smooths in ggplot2

# correlation of each Point

sns.lmplot(x='Age', y = 'Pregnancies', hue = 'Outcome', data = pima_df)

sns.lmplot(x='Insulin', y = 'Glucose', hue = 'Outcome', data = pima_df)

print("diabetes_attr: unchanged, original attributes")

== Generating data sets ==

print("Normalized_attributes: range of 0 to 1")

print("standardized_attr: mean of 0 and stdev of 1")

standardized_attr: mean of 0 and stdev of 1

3. When to Normalize and Standardize

Handling Imbalanced Class Data

print("=== undersampling majority class by purging ===")

# Separate majority and minority classes

=== undersampling majority class by purging ===

# Combine minority class with downsampled majority class

#!pip install imblearn

print("=== oversampling minority class with SMOTE ===")

features=['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI',

=== oversampling minority class with SMOTE ===

print("== treating missing values by purging or imputating ==")

dataset_cp[['Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI']] =

# dataset with missing values

# summarize the number of rows and columns in the dataset

print("== addressing class imbalance under or over sampling ==")

Dimensionality reduction using PCA

Use cases for modeling:

One of the most common dimensionality reduction techniques

Unfortunately, PCA makes models a lot harder to interpret

PCA as dimensionality reduction

# Use PCA from sklearn.decompostion to find principal components

('original shape: ', (768, 8))

# Test options and evaluation metric

== Evaluate Some Algorithms ==

# Spot Check Algorithms without feature reduction

print("eval metric: " + scoring)

t, prob = stats.ttest_rel(a= cv_results,b= results[0])

msg = "%s: %f (%f) %s %f" % (name, cv_results.mean(), cv_results.std(),

== Build Models: build and evaluate models, Spot Check Algorithms ==

# Spot Check Algorithms after feature reduction with pca

print("eval metric: " + scoring)

t, prob = stats.ttest_rel(a= cv_results,b= results[0])

== Build Models: build and evaluate models, Spot Check Algorithms ==

You might also like