Professional Documents
Culture Documents
Cleaning Data
Cleaning Data
ipynb - Colaboratory
1 import pandas as pd
2 import numpy as np
3 import matplotlib.pyplot as plt
4 import seaborn as sns
1 cr = pd.read_csv("CreditRisk.csv")
1 cr.columns
Not
3 LP001006 Male Yes 0.0 No 2583
Graduate
1 cr.head(10)
https://colab.research.google.com/drive/1TlZut8ZGAyUfKNV3npX7OuLYVqTj7ws_#scrollTo=0fsSaPwLFAwV&printMode=true 1/18
12/24/21, 5:34 PM CLEANING_DATA.ipynb - Colaboratory
Not
3 LP001006 Male Yes 0.0 No 2583
Graduate
1 cr.tail(10) # lastMale
4 LP001008 5 recordsNo 0.0 Graduate No 6000
7
972LP001014
LP002954 Male
Male YesYes 4.02.0 GraduateNot NoNo 3036
313
Graduate
8 LP001018 Male Yes 2.0 Graduate No 4006
973 LP002962 Male No 0.0 Graduate No 400
9 LP001020 Male Yes 1.0 Graduate No 12841
974 LP002965 Female Yes 0.0 Graduate No 855
Not
976 LP002971 Male Yes 4.0 Yes 400
Graduate
(981, 13)
https://colab.research.google.com/drive/1TlZut8ZGAyUfKNV3npX7OuLYVqTj7ws_#scrollTo=0fsSaPwLFAwV&printMode=true 2/18
12/24/21, 5:34 PM CLEANING_DATA.ipynb - Colaboratory
25%
Semiurban 0.000000
349 2875.000000 0.000000 100.000000 360.000000
Urban 342
50%
Rural 0.000000
290 3800.000000 1110.000000 126.000000 360.000000
Name: Property_Area, dtype: int64
75% 2.000000 5516.000000 2365.000000 162.000000 360.000000
Male 775
Female 182
Name: Gender, dtype: int64
1 cr['Gender'].value_counts()
Male 775
Female 182
Name: Gender, dtype: int64
https://colab.research.google.com/drive/1TlZut8ZGAyUfKNV3npX7OuLYVqTj7ws_#scrollTo=0fsSaPwLFAwV&printMode=true 3/18
12/24/21, 5:34 PM CLEANING_DATA.ipynb - Colaboratory
... Loan_ID
... Gender
... Married
... Dependents
... Education
... Self_Employed
... ApplicantIncom.
1
967 LP001003
LP002920 Male
Male Yes
Yes 1.0
0.0 Graduate
Graduate No
No 458
511
2 LP001005 Male Yes 0.0 Graduate
Not Yes 300
968 LP002921 Male Yes 4.0 No 531
Graduate
Not
3 LP001006 Male Yes 0.0 No 258
969 LP002932 Male Yes 4.0 Graduate
Graduate No 760
Not
976 LP002971 Male Yes 4.0 Yes 400
Graduate
1 abc = ((cr.ApplicantIncome > 3000) & (cr.Gender == "Male" ) & (cr.Married == "Yes") )
1 df3 = cr[abc]
1 df3
https://colab.research.google.com/drive/1TlZut8ZGAyUfKNV3npX7OuLYVqTj7ws_#scrollTo=0fsSaPwLFAwV&printMode=true 4/18
12/24/21, 5:34 PM CLEANING_DATA.ipynb - Colaboratory
Not
972 LP002954 Male Yes 2.0 No 313
Graduate
Not
976 LP002971 Male Yes 4.0 Yes 400
Graduate
5179.795107033639
1 cr.mean()
Dependents 0.881799
ApplicantIncome 5179.795107
CoapplicantIncome 1601.916330
LoanAmount 142.511530
Loan_Amount_Term 342.201873
Credit_History 0.835920
dtype: float64
1 cr.ApplicantIncome.median()
3800.0
1 cr.ApplicantIncome.sum()
5081379
1 cr.ApplicantIncome.max()
81000
https://colab.research.google.com/drive/1TlZut8ZGAyUfKNV3npX7OuLYVqTj7ws_#scrollTo=0fsSaPwLFAwV&printMode=true 5/18
12/24/21, 5:34 PM CLEANING_DATA.ipynb - Colaboratory
1 cr.ApplicantIncome.min()
0
1 cr.Gender.value_counts()
Male 775
Female 182
Name: Gender, dtype: int64
1 # GROUP BY FUNCTION
2
3 cr.groupby('Gender').ApplicantIncome.agg(['count' , 'min' ,'max' ,'mean'])
Gender
min max mean min max mean min max mean min m
Gender
Female 0.0 4.0 0.531073 0 19484 4458.906593 0.0 41667.0 1132.604396 9.0 6
Male 0.0 4.0 0.960317 0 81000 5256.925161 0.0 33837.0 1716.340542 17.0 6
1 cr.groupby(["Gender" , "Married"]).ApplicantIncome.mean()
Gender Married
Female No 4394.153226
Yes 4501.736842
Male No 4912.764151
Yes 5390.440285
Name: ApplicantIncome, dtype: float64
1 cr.iloc[:,:-1]
https://colab.research.google.com/drive/1TlZut8ZGAyUfKNV3npX7OuLYVqTj7ws_#scrollTo=0fsSaPwLFAwV&printMode=true 6/18
12/24/21, 5:34 PM CLEANING_DATA.ipynb - Colaboratory
Not
3 LP001006 Male Yes 0.0 No 2583
Graduate
Not
976 LP002971 Male Yes 4.0 Yes 4009
Graduate
https://colab.research.google.com/drive/1TlZut8ZGAyUfKNV3npX7OuLYVqTj7ws_#scrollTo=0fsSaPwLFAwV&printMode=true 7/18
12/24/21, 5:34 PM CLEANING_DATA.ipynb - Colaboratory
2
7 Male
LP001014 Male0.0 Yes Yes 4.0 Graduate No 3036
3
8 Male
LP001018 Male0.0 Yes No 2.0 Graduate No 4006
4
9 Male
LP001020 Male0.0 Yes No 1.0 Graduate No 12841
6 Male 0.0 No
7 Male 4.0 No
8 Male 2.0 No
9 Male 1.0 No
0 Male 0.0 No
1 Male 1.0 No
3 Male 0.0 No
4 Male 0.0 No
6 Male 0.0 No
7 Male 4.0 No
8 Male 2.0 No
9 Male 1.0 No
1 cr.iloc[:, 2:6]
https://colab.research.google.com/drive/1TlZut8ZGAyUfKNV3npX7OuLYVqTj7ws_#scrollTo=0fsSaPwLFAwV&printMode=true 8/18
12/24/21, 5:34 PM CLEANING_DATA.ipynb - Colaboratory
0 No 0.0 Graduate No
4 No 0.0 Graduate No
981 rows
1 cr.loc[ : ×
, 4['ApplicantIncome',
columns 'Self_Employed']]
ApplicantIncome Self_Employed
0 5849 No
1 4583 No
2 3000 Yes
3 2583 No
4 6000 No
977 4158 No
978 3250 No
979 5000 No
1 cr1.head()
https://colab.research.google.com/drive/1TlZut8ZGAyUfKNV3npX7OuLYVqTj7ws_#scrollTo=0fsSaPwLFAwV&printMode=true 9/18
12/24/21, 5:34 PM CLEANING_DATA.ipynb - Colaboratory
Not
3 LP001006 Male Yes 0.0 No 2583
Graduate
1 cr.columns
1 cr3.head()
Not
3 LP001006 Male Yes 0.0 No 2583
Graduate
1 cr.head()
https://colab.research.google.com/drive/1TlZut8ZGAyUfKNV3npX7OuLYVqTj7ws_#scrollTo=0fsSaPwLFAwV&printMode=true 10/18
12/24/21, 5:34 PM CLEANING_DATA.ipynb - Colaboratory
Not
3 LP001006 Male Yes 0.0 No 2583
Graduate
1 cr.head()
Not
3 Male Yes 0.0 No 2583
Graduate
https://colab.research.google.com/drive/1TlZut8ZGAyUfKNV3npX7OuLYVqTj7ws_#scrollTo=0fsSaPwLFAwV&printMode=true 11/18
12/24/21, 5:34 PM CLEANING_DATA.ipynb - Colaboratory
Missing data occurs commonly in many data analysis applications. One of the goals of pandas
is to make working with missing data as painless as possible
1 import pandas as pd
2 import numpy as np
3 string_data = pd.Series(['aardvark', 'artichoke', np.nan,'avocado'])
1 string_data
0 aardvark
1 artichoke
2 NaN
3 avocado
dtype: object
1 string_data.isnull()
0 False
1 False
2 True
3 False
dtype: bool
link text
0 1.0
1 NaN
2 3.5
3 NaN
4 7.0
dtype: float64
1 data.dropna() # data[data.notnull()]
0 1.0
2 3.5
4 7.0
dtype: float64
1 # Passing how='all' will only drop rows that are all NA:
2 data.dropna(how='all')
0 1.0
2 3.5
https://colab.research.google.com/drive/1TlZut8ZGAyUfKNV3npX7OuLYVqTj7ws_#scrollTo=0fsSaPwLFAwV&printMode=true 12/18
12/24/21, 5:34 PM CLEANING_DATA.ipynb - Colaboratory
4 7.0
dtype: float64
1 # To drop columns in the same way, pass axis=1:
2 data[4] = NA
3 data
0 1.0
1 NaN
2 3.5
3 NaN
4 NaN
dtype: float64
1 data.dropna(axis=0, how='all')
0 1.0
2 3.5
dtype: float64
0 1 2
1 df.iloc[:4, 1] = NA
1 df.iloc[:2, 2] = NA
1 df
https://colab.research.google.com/drive/1TlZut8ZGAyUfKNV3npX7OuLYVqTj7ws_#scrollTo=0fsSaPwLFAwV&printMode=true 13/18
12/24/21, 5:34 PM CLEANING_DATA.ipynb - Colaboratory
0 1 2
3 -1.128460
1 df.fillna(0) NaN -1.124696
1 _ = df.fillna(0, inplace=True)
1 df
0 1 2
Argument Description
value
https://colab.research.google.com/drive/1TlZut8ZGAyUfKNV3npX7OuLYVqTj7ws_#scrollTo=0fsSaPwLFAwV&printMode=true 14/18
12/24/21, 5:34 PM CLEANING_DATA.ipynb - Colaboratory
limit For forward and backward filling, maximum number of consecutive periods to fill
1 df = pd.DataFrame(np.random.randn(6, 3))
2 df.iloc[2:, 1] = NA
3 df.iloc[4:, 2] = NA
4 df
0 1 2
1 cr.isnull().sum()
2
Gender1 24
Married 3
Dependents 25
Education 0
Self_Employed 55
ApplicantIncome1 0
CoapplicantIncome 0
LoanAmount 27
Loan_Amount_Term 20
Credit_History 79
Property_Area 0
Loan_Status 0
dtype: int64
1 cr.Gender1 = cr.Gender1.fillna("Male")
2
1 cr.Married = cr.Married.fillna("Yes")
2
1 cr.Dependents = cr.Dependents.fillna(0)
https://colab.research.google.com/drive/1TlZut8ZGAyUfKNV3npX7OuLYVqTj7ws_#scrollTo=0fsSaPwLFAwV&printMode=true 15/18
12/24/21, 5:34 PM CLEANING_DATA.ipynb - Colaboratory
1 cr.Self_Employed = cr.Self_Employed.fillna("no")
2
1 cr.Loan_Amount_Term = cr.Loan_Amount_Term.fillna(cr.Loan_Amount_Term.mean())
2
1 cr.Credit_History = cr.Credit_History.fillna(0)
2
1 cr.isnull().sum()
2
Gender1 0
Married 0
Dependents 0
Education 0
Self_Employed 0
ApplicantIncome1 0
CoapplicantIncome 0
LoanAmount 0
Loan_Amount_Term 0
Credit_History 0
Property_Area 0
Loan_Status 0
dtype: int64
1 #cr.dropna()
2
https://colab.research.google.com/drive/1TlZut8ZGAyUfKNV3npX7OuLYVqTj7ws_#scrollTo=0fsSaPwLFAwV&printMode=true 16/18
12/24/21, 5:34 PM CLEANING_DATA.ipynb - Colaboratory
1 cr.head(20)
2
Not
3 0 1 0.0 No 2583
Graduate
Not
6 0 1 0.0 No 2333
Graduate
Not
16 0 0 1.0 No 3596
Graduate
Not
18 0 1 0.0 No 4887
Graduate
1 cr = pd.read_csv(r"CreditRisk.csv")
2
https://colab.research.google.com/drive/1TlZut8ZGAyUfKNV3npX7OuLYVqTj7ws_#scrollTo=0fsSaPwLFAwV&printMode=true 17/18
12/24/21, 5:34 PM CLEANING_DATA.ipynb - Colaboratory
1 cr.head()
2
1 cr.shape
2
(981, 20)
https://colab.research.google.com/drive/1TlZut8ZGAyUfKNV3npX7OuLYVqTj7ws_#scrollTo=0fsSaPwLFAwV&printMode=true 18/18