Professional Documents
Culture Documents
1
let’s understand “What is Costa Rica all about? What is IDB? What is Proxmy Means Test?
What is the Focus of majority of the social programs to help Costa Rica?”
Let’s understand now.
2.2 IDB
The Inter-American Development Bank (IDB) is a cooperative development bank founded in 1959
to accelerate the economic and social development of its Latin American and Caribbean member
countries.
2
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
import warnings
warnings.filterwarnings("ignore")
init_notebook_mode(connected=True)
[ ]: train_data = pd.read_csv("train.csv")
test_data = pd.read_csv("test.csv")
[ ]: train_data.head()
3
[ ]: train_data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9557 entries, 0 to 9556
Columns: 143 entries, Id to Target
dtypes: float64(8), int64(130), object(5)
memory usage: 10.4+ MB
[ ]: train_data.shape
[ ]: (9557, 143)
[ ]: train_data['Target'].value_counts()
[ ]: 4 5996
2 1597
3 1209
1 755
Name: Target, dtype: int64
[ ]: test_data.head()
[ ]: test_data.info()
<class 'pandas.core.frame.DataFrame'>
4
RangeIndex: 23856 entries, 0 to 23855
Columns: 142 entries, Id to agesq
dtypes: float64(8), int64(129), object(5)
memory usage: 25.8+ MB
[ ]: test_data.shape
[ ]: (23856, 142)
3.0.4 OBESERVATION:
Looking at the train and test dataset we noticed that the following:
Train dataset: Rows: 9557 entries, 0 to 9556 Columns: 143 entries, Id to Target Column dtypes:
float64(8), int64(130), object(5)
Test dataset: Rows: 23856 entries, 0 to 23855 Columns: 142 entries, Id to agesq dtypes: float64(8),
int64(129), object(5)
The important piece of information here is that we don’t have ‘Target’ feature in Test Dataset.
There are 5 object type, 130(Train set)/ 129 (test set) integer type and 8 float type features. Lets
look at those features next.
Integer Type:
Index(['hacdor', 'rooms', 'hacapo', 'v14a', 'refrig', 'v18q', 'r4h1', 'r4h2',
'r4h3', 'r4m1',
…
'area1', 'area2', 'age', 'SQBescolari', 'SQBage', 'SQBhogar_total',
'SQBedjefe', 'SQBhogar_nin', 'agesq', 'Target'],
dtype='object', length=130)
Float Type:
5
Index(['v2a1', 'v18q1', 'rez_esc', 'meaneduc', 'overcrowding',
'SQBovercrowding', 'SQBdependency', 'SQBmeaned'],
dtype='object')
Object Type:
Index(['Id', 'idhogar', 'dependency', 'edjefe', 'edjefa'], dtype='object')
3.0.6 Note:
Till now, we have explored the train and test datasets. Now, we know that the data whatever
we are provided with might not be in a proper shape where most of the times, the data would be
messy. So, we have to make sure that the data should be in an amazing form and shape before we
build the model or else our model’s accuracy will go for a toss and it would be useless.
Therefore, let’s deep dive into the complete data and make it ready for Model building.
[ ]: 0
rez_esc 7928
v18q1 7342
v2a1 6860
SQBmeaned 5
meaneduc 5
print(per_null_train.head())
rez_esc 82.954902
v18q1 76.823271
v2a1 71.779847
meaneduc 0.052318
SQBmeaned 0.052318
dtype: float64
6
3.1.4 4.2 Data Preparation
Let’s apply our commonsense for Data Cleaning. Data Cleaning has to to be approached by looking
at the datatypes first of all. Here we are able to see 5 columns with null values in the entire dataset.
Let’s look into the datatypes and apply our commonsense to fill the null values to transform the
dataset in a proper form and shape.
Fields with missing values:
rez_esc = Years behind in school
SQBmeaned = Square of the mean years of education of adults (>=18) in the household
3.1.5 4.2.1 Let’s explore the null values with respect to datatypes
4.2.1.1 Any null value in Integer datatype variables?
[ ]: train_data.select_dtypes('int64').head()
[ ]: hacdor rooms hacapo v14a refrig v18q r4h1 r4h2 r4h3 r4m1 … \
0 0 3 0 1 1 0 0 1 1 0 …
1 0 4 0 1 1 1 0 1 1 0 …
2 0 8 0 1 1 0 0 0 0 0 …
3 0 5 0 1 1 1 0 2 2 1 …
4 0 5 0 1 1 1 0 2 2 1 …
[ ]: null_counts=train_data.select_dtypes('int64').isnull().sum()
null_counts[null_counts > 0]
7
[ ]: Series([], dtype: int64)
SQBdependency SQBmeaned
0 0.0 100.0
1 64.0 144.0
2 64.0 121.0
3 1.0 121.0
4 1.0 121.0
[ ]: null_counts=train_data.select_dtypes('float64').isnull().sum()
null_counts[null_counts > 0]
[ ]: v2a1 6860
v18q1 7342
rez_esc 7928
meaneduc 5
SQBmeaned 5
dtype: int64
8
3.2 OBSERVATION:
Looking at the different types of data and null values for each feature. We found the following:
1. No null values for Integer type features.
2. No null values for Object type features.
3. But for Float types, we have observed the null values. And they are
rez_esc = Years behind in school
v18q1 = Number of tablets household owns
v2a1 = Monthly rent payment
meaneduc = Average years of education for adults (18+)
SQBmeaned = Square of the mean years of education of adults (>=18) in the household
Before treating the null value, it is good to check is there any mixed values in any
column. It is known that if there is any mixed values in should be in ‘object datatype’.
The object datatypes are :
[‘Id’, ‘idhogar’, ‘dependency’, ‘edjefe’, ‘edjefa’]
[ ]: # Object datatypes:
Object Type:
Index(['Id', 'idhogar', 'dependency', 'edjefe', 'edjefa'], dtype='object')
3.2.2 4.3.1 Let’s also fix the column with mixed values, if there’s any.
[ ]: train_data['Id'].value_counts()
[ ]: ID_84e4721d4 1
ID_90f0db8b9 1
ID_e5fcc8bda 1
ID_a2aa2c6c5 1
ID_2e489189e 1
..
ID_7e513a2ad 1
ID_0741bb9cd 1
ID_118150b9d 1
ID_877d9a4a4 1
9
ID_f1fe0dbb5 1
Name: Id, Length: 9557, dtype: int64
[ ]: train_data['idhogar'].value_counts()
[ ]: fd8a6d014 13
ae6cf0558 12
0c7436de6 12
4476ccd4c 11
6b35cdcf0 11
..
8161a5155 1
3d56274f9 1
d5471169a 1
1bc617b23 1
60fcd83c7 1
Name: idhogar, Length: 2988, dtype: int64
[ ]: train_data['dependency'].value_counts()
[ ]: yes 2192
no 1747
.5 1497
2 730
1.5 713
.33333334 598
.66666669 487
8 378
.25 260
3 236
4 100
.75 98
.2 90
.40000001 84
1.3333334 84
2.5 77
5 24
3.5 18
.80000001 18
1.25 18
2.25 13
.71428573 12
1.75 11
.83333331 11
.22222222 11
1.2 11
.2857143 9
10
.60000002 8
1.6666666 8
.16666667 7
6 7
Name: dependency, dtype: int64
[ ]: train_data['edjefe'].value_counts()
[ ]: no 3762
6 1845
11 751
9 486
3 307
15 285
8 257
7 234
5 222
14 208
17 202
2 194
4 137
16 134
yes 123
12 113
10 111
13 103
21 43
18 19
19 14
20 7
Name: edjefe, dtype: int64
[ ]: train_data['edjefa'].value_counts()
[ ]: no 6230
6 947
11 399
9 237
8 217
15 188
7 179
5 176
3 152
4 136
14 120
16 113
10 96
11
2 84
17 76
12 72
yes 69
13 52
21 5
19 4
18 3
20 2
Name: edjefa, dtype: int64
3.3 Observation:
Looking at the above results, we are able to see that there are 3 Object datatypes with mixed
values. And they are,
[‘dependency’,‘edjefe’,‘edjefa’]
According to the documentation for these columns:
• dependency: Dependency rate, calculated = (number of members of the household younger
than 19 or older than 64)/(number of member of household between 19 and 64)
• edjefe: years of education of male head of household, based on the interaction of escolari
(years of education), head of household and gender, yes=1 and no=0
• edjefa: years of education of female head of household, based on the interaction of escolari
(years of education), head of household and gender, yes=1 and no=0
For these three variables, it seems “yes” = 1 and “no” = 0. We can correct the variables using a
mapping and convert to floats.
[ ]: mapping={'yes':1,'no':0}
Object Type:
Index(['Id', 'idhogar'], dtype='object')
[ ]: pd.set_option('max_columns', None)
[ ]: train_data.describe()
12
[ ]: v2a1 hacdor rooms hacapo v14a \
count 2.697000e+03 9557.000000 9557.000000 9557.000000 9557.000000
mean 1.652316e+05 0.038087 4.955530 0.023648 0.994768
std 1.504571e+05 0.191417 1.468381 0.151957 0.072145
min 0.000000e+00 0.000000 1.000000 0.000000 0.000000
25% 8.000000e+04 0.000000 4.000000 0.000000 1.000000
50% 1.300000e+05 0.000000 5.000000 0.000000 1.000000
75% 2.000000e+05 0.000000 6.000000 0.000000 1.000000
max 2.353477e+06 1.000000 11.000000 1.000000 1.000000
13
75% 1.000000 5.000000 1.000000 0.000000 0.000000
max 5.000000 13.000000 1.000000 1.000000 1.000000
14
min 0.000000 0.000000 0.000000 0.000000 0.000000
25% 0.000000 0.000000 0.000000 0.000000 0.000000
50% 0.000000 0.000000 0.000000 0.000000 0.000000
75% 0.000000 0.000000 0.000000 0.000000 0.000000
max 1.000000 1.000000 1.000000 1.000000 1.000000
15
count 9557.000000 9557.000000 9557.000000 9557.000000 9557.000000
mean 0.252799 0.646123 0.057549 0.483415 0.516585
std 0.434639 0.478197 0.232902 0.499751 0.499751
min 0.000000 0.000000 0.000000 0.000000 0.000000
25% 0.000000 0.000000 0.000000 0.000000 0.000000
50% 0.000000 1.000000 0.000000 0.000000 1.000000
75% 1.000000 1.000000 0.000000 1.000000 1.000000
max 1.000000 1.000000 1.000000 1.000000 1.000000
16
max 1.000000 1.000000 1.000000 1.000000 9.000000
17
25% 0.000000 0.000000 0.000000 0.000000 0.000000
50% 0.000000 0.000000 0.000000 0.000000 0.000000
75% 0.000000 0.000000 0.000000 0.000000 1.000000
max 1.000000 1.000000 1.000000 1.000000 1.000000
Target
count 9557.000000
18
mean 3.302292
std 1.009565
min 1.000000
25% 3.000000
50% 4.000000
75% 4.000000
max 4.000000
[ ]: train_data['rez_esc'].value_counts()
[ ]: 0.0 1211
1.0 227
2.0 98
3.0 55
4.0 29
5.0 9
Name: rez_esc, dtype: int64
‘rez_esc’ - years behind in school. For the families with a null value, is possible that
they have no children currently in school. Let’s test this out by finding the ages of
those who have a missing value in this column and the ages of those who do not have
a missing value.
[ ]: # Columns related to 'rez_esc' = Years behind in school is 'Age' = Age in years.
# Lets look at the data with not null values first.
train_data[train_data['rez_esc'].notnull()]['age'].describe()
[ ]: count 1629.000000
mean 12.258441
std 3.218325
min 7.000000
25% 9.000000
50% 12.000000
75% 15.000000
max 17.000000
Name: age, dtype: float64
[ ]: #From the above, we see that when min age is 7 and max age is 17 for Years,␣
↪then the 'behind in school' column has a value.
#Lets confirm
train_data.loc[train_data['rez_esc'].isnull()]['age'].describe()
19
[ ]: count 7928.000000
mean 38.833249
std 20.989486
min 0.000000
25% 24.000000
50% 38.000000
75% 54.000000
max 97.000000
Name: age, dtype: float64
3.4 Statement:
Anyone younger than 7 or older than 17 is presumably has no years behind and therefore the value
should be set to 0. For this variable, if the individual is over 19 and they have a missing value, or
if they are younger than 7 and have a missing value we can set it to zero.
3.4.1 Cross Checking once if there’s any null value when the age is between 7 and 17
with respect to null values of ‘years behind school’ column
[ ]: count 1.0
mean 10.0
std NaN
min 10.0
25% 10.0
50% 10.0
75% 10.0
max 10.0
Name: age, dtype: float64
There is one value that has Null for the ‘behind in school’ column with age between 7 and 17 with
the age 10. Let’s find the complete record of this household.
[ ]: train_data[(train_data['age'] ==10) & train_data['rez_esc'].isnull()].head()
v18q1 r4h1 r4h2 r4h3 r4m1 r4m2 r4m3 r4t1 r4t2 r4t3 tamhog \
2514 1.0 0 1 1 1 1 2 1 2 3 3
20
2514 0 0 0 0 0 1
21
tipovivi5 computer television mobilephone qmobilephone lugar1 \
2514 0 1 0 1 2 1
[ ]: #from above we see that the 'behind in school' column has null values
# Lets use the above to fix the data
for df in [train_data, test_data]:
df['rez_esc'].fillna(value=0, inplace=True)
train_data['rez_esc'].isnull().sum()
[ ]: 0
3.4.2 4.3.2.2 Null Value treatment for ‘v18q1’ column (total nulls: 7342) - Number
of tablets household owns
It is fair to say that if a household owns a tablet, then the count would be some number, if not the
count would be zero.
In our dataset, we have column ‘v18q’ which indicates - whether or not household owns a tablet.
Since, we are seeing the things from the household perspective, it only makes sense to look at it on
a household level, so we’ll only select the rows for the head of household.
• ‘parentesco1’, =1 if household head
[ ]: # Heads of household
heads = train_data.loc[train_data['parentesco1'] == 1].copy()
heads.groupby('v18q')['v18q1'].apply(lambda x: x.isnull().sum())
[ ]: v18q
0 2318
1 0
Name: v18q1, dtype: int64
22
[ ]: #Looking at the above data it makes sense that when owns a tablet column is 0,␣
↪there will be no number of tablets
#household owns.
#Lets add 0 for all the null values.
for df in [train_data, test_data]:
df['v18q1'].fillna(value=0, inplace=True)
train_data['v18q1'].isnull().sum()
[ ]: 0
3.4.3 4.3.2.3 Null Value treatment for ‘v2a1’ (total nulls: 6860) = Monthly Rent
Payment
Why only the null values? Lets look at few rows with nulls in v2a1
Columns related to Monthly rent payment are listed down:
• tipovivi1, =1 own and fully paid house
• tipovivi2, “=1 own, paying in installments”
23
• tipovivi3, =1 rented
• tipovivi4, =1 precarious
• tipovivi5, “=1 other(assigned, borrowed)”
[ ]: data = train_data[train_data['v2a1'].isnull()].head()
data
r4h1 r4h2 r4h3 r4m1 r4m2 r4m3 r4t1 r4t2 r4t3 tamhog tamviv \
2 0 0 0 0 1 1 0 1 1 1 1
13 0 1 1 0 1 1 0 2 2 2 2
14 0 1 1 0 1 1 0 2 2 2 2
26 0 1 1 0 0 0 0 1 1 1 1
32 0 4 4 0 1 1 0 5 5 5 5
24
26 0 0 0 1 0
32 0 0 1 1 0
25
13 0 1 0 0 0
14 1 0 0 1 0
26 1 1 0 0 0
32 1 0 0 0 1
26
television mobilephone qmobilephone lugar1 lugar2 lugar3 lugar4 \
2 0 0 0 1 0 0 0
13 0 1 1 1 0 0 0
14 0 1 1 1 0 0 0
26 0 1 2 1 0 0 0
32 0 1 3 1 0 0 0
Target
2 4
13 4
14 4
26 4
32 4
[ ]: columns=['tipovivi1','tipovivi2','tipovivi3','tipovivi4','tipovivi5']
data[columns]
# Plot of the home ownership variables for home missing rent payments
train_data.loc[train_data['v2a1'].isnull(), own_variables].sum().plot.
↪bar(figsize = (10, 8),color = 'black')
27
plt.xticks([0, 1, 2, 3, 4],['Owns and Paid Off', 'Owns and Paying', 'Rented',␣
↪'Precarious', 'Other'],rotation = 20)
[ ]: #Looking at the above data it makes sense that when the house is fully paid,␣
↪there will be no monthly rent payment.
train_data[['v2a1']].isnull().sum()
[ ]: v2a1 0
dtype: int64
28
3.4.4 4.3.2.4 Null Value treatment for ‘meaneduc’ (total nulls: 5) = Average years of
education for adults (18+)
Let’s look at the columns related to average years of education for adults (18+):
• edjefe, years of education of male head of household, based on the interaction of escolari
(years of education), head of household and gender, yes=1 and no=0
• edjefa, years of education of female head of household, based on the interaction of escolari
(years of education), head of household and gender, yes=1 and no=0
• instlevel1, =1 no level of education
• instlevel2, =1 incomplete primary
[ ]: data = train_data[train_data['meaneduc'].isnull()].head()
columns=['edjefe','edjefa','instlevel1','instlevel2']
data[columns][data[columns]['instlevel1']>0].describe()
[ ]: #from the above, we find that meaneduc is null when no level of education is 0
#Lets fix the data
for df in [train_data, test_data]:
df['meaneduc'].fillna(value=0, inplace=True)
train_data['meaneduc'].isnull().sum()
[ ]: 0
3.4.5 4.3.2.5 Null Value treatment for SQBmeaned (total nulls: 5) = Square of the
mean years of education of adults (>=18) in the household
This column is just the square of the mean years od education of adults.
[ ]: data = train_data[train_data['SQBmeaned'].isnull()].head()
columns=['edjefe','edjefa','instlevel1','instlevel2']
data[columns][data[columns]['instlevel1']>0].describe()
29
[ ]: edjefe edjefa instlevel1 instlevel2
count 0.0 0.0 0.0 0.0
mean NaN NaN NaN NaN
std NaN NaN NaN NaN
min NaN NaN NaN NaN
25% NaN NaN NaN NaN
50% NaN NaN NaN NaN
75% NaN NaN NaN NaN
max NaN NaN NaN NaN
[ ]: #from the above, we find that SQBmeaned is null when no level of education is 0
#Lets fix the data
for df in [train_data, test_data]:
df['SQBmeaned'].fillna(value=0, inplace=True)
train_data['SQBmeaned'].isnull().sum()
[ ]: 0
Almost all the null values has ben treated properly inlines with the data understanding.
2 = moderate poverty
3 = vulnerable households
30
5.0.1 Different Levels of Poverty Household Groups
[ ]: target = train_data['Target'].value_counts().to_frame()
levels = ["NonVulnerable", "Moderate Poverty", "Vulnerable", "Extereme Poverty"]
trace = go.Bar(y=target.Target, x=levels, marker=dict(color='orange', opacity=0.
↪6))
data = [trace]
fig = go.Figure(data=data, layout=layout)
iplot(fig)
Looking at the above bar chart, it is evident that we are dealing with an imbalanced class problem.
There are many more households that classify as non vulnerable than in any other category. The
extreme poverty class is the smallest when it should make us optimistic.
Hence, we can say that the dataset is biased.
5.1 Note:
The raw data contains a mix of both household and individual characteristics. Therefore, for the
individual data, we will have to find a way to aggregate this for each household. Some of the
individuals belong to a household with no head of household which means that unfortunately we
can’t use this data for training.
Few columns to note:
• Id: a unique identifier for each individual, this should not be a feature that we use!
• idhogar: a unique identifier for each household. This variable is not a feature, but will be
used to group individuals by household as all individuals in a household will have the same
identifier.
• parentesco1: indicates if this person is the head of the household.
• Target: the label, which should be equal for all members in a household
5.1.1 Lets see if records belonging to same household has same target/score
[ ]: # Groupby the household and figure out the number of unique values
all_equal = train_data.groupby('idhogar')['Target'].apply(lambda x: x.nunique()␣
↪== 1)
There are 85 households where the family members do not all have the same
target.
31
[ ]: #Lets check one household
train_data[train_data['idhogar'] == not_equal.index[0]][['idhogar',␣
↪'parentesco1', 'Target']]
[ ]: #Lets use Target value of the parent record (head of the household) and update␣
↪rest. But before that lets check
households_head = train_data.groupby('idhogar')['parentesco1'].sum()
[ ]: # Find households without a head and where Target value are different
households_no_head_equal = households_no_head.groupby('idhogar')['Target'].
↪apply(lambda x: x.nunique() == 1)
32
# Groupby the household and figure out the number of unique values
all_equal = train_data.groupby('idhogar')['Target'].apply(lambda x: x.nunique()␣
↪== 1)
There are 0 households where the family members do not all have the same target.
print(train_data.shape)
print(train_data.shape)
33
(9557, 143)
(9557, 134)
34
'parentesco11', 'parentesco12', 'instlevel1', 'instlevel2',␣
↪'instlevel3',
'instlevel4', 'instlevel5', 'instlevel6', 'instlevel7',␣
↪'instlevel8',
'instlevel9', 'mobilephone']
[ ]: (2973, 98)
35
# Find index of feature columns with correlation greater than 0.95
to_drop = [column for column in upper.columns if any(abs(upper[column]) > 0.95)]
to_drop
36
7 Observations:
There are several variables here having to do with the size of the house:
• r4t3, Total persons in the household
• tamhog, size of the household
• tamviv, number of persons living in the household
• hhsize, household size
• hogar_total, # of total individuals in the household
• These variables are all highly correlated with one another.
[ ]: cols=['tamhog', 'hogar_total', 'r4t3']
train_data.shape
[ ]: (9557, 131)
[ ]: (9557, 39)
to_drop
[ ]: ['female']
[ ]: # This is simply the opposite of male! We can remove the male flag.
for df in [train_data, test_data]:
df.drop(columns = 'male',inplace=True)
train_data.shape
37
[ ]: (9557, 130)
train_data.shape
[ ]: (9557, 129)
train_data.shape
[ ]: (9557, 127)
Now, the data is completely prepared and all the unwanted columns are removed. Now we are all
set to build th model.
[ ]: X_features=train_data.iloc[:,0:-1]
y_features=train_data.iloc[:,-1]
print(X_features.shape)
print(y_features.shape)
(9557, 126)
(9557,)
X_train,X_test,y_train,y_test=train_test_split(X_features,y_features,test_size=0.
↪2,random_state=1)
rmclassifier = RandomForestClassifier()
38
[ ]: rmclassifier.fit(X_train,y_train)
[ ]: y_predict = rmclassifier.predict(X_test)
[[ 133 3 0 21]
[ 0 290 1 26]
[ 0 1 191 41]
[ 0 3 1 1201]]
Report:
precision recall f1-score support
[ ]: y_predict_testdata = rmclassifier.predict(test_data)
[ ]: y_predict_testdata
[ ]: y_predict_testdata.shape
39
[ ]: (23856,)
7.1.2 5.2 Check the accuracy using random forest with cross validation.
rmclassifier=RandomForestClassifier(random_state=10,n_jobs = -1)
print(cross_val_score(rmclassifier,X_features,y_features,cv=kfold,scoring='accuracy'))
results=cross_val_score(rmclassifier,X_features,y_features,cv=kfold,scoring='accuracy')
print(results.mean()*100)
print(cross_val_score(rmclassifier,X_features,y_features,cv=kfold,scoring='accuracy'))
results=cross_val_score(rmclassifier,X_features,y_features,cv=kfold,scoring='accuracy')
print(results.mean()*100)
40