Professional Documents
Culture Documents
Data Analysis 1
Data Analysis 1
# data = pd.read_excel('Sample_sheet.xlsx')
data = pd.read_csv('D:\Study_python\DataAnalysis1 (1)\
Sample_sheet.csv', encoding ='utf-8')
data
Size of data : 80
Size of data : 80
Data types :
S. No. int64
Names object
Years of Experience int64
Domain object
Relevant Experience float64
Income(USD) float64
Marital Status object
Number of siblings int64
dtype: object
Data types :
S. No. int64
Names object
Years of Experience int64
Domain object
Relevant Experience float64
Income(USD) float64
Marital Status object
Number of siblings int64
dtype: object
Number of siblings
count 10.0
mean 3.0
std 0.0
min 3.0
25% 3.0
50% 3.0
75% 3.0
max 3.0
Statistical data analysis :
S. No. Years of Experience Relevant Experience
Income(USD) \
count 10.00000 10.000000 9.000000
9.000000
mean 5.50000 7.800000 5.555556
14888.888889
std 3.02765 5.287301 3.745368
5883.120865
min 1.00000 0.000000 1.000000
7500.000000
25% 3.25000 3.500000 3.000000
9500.000000
50% 5.50000 8.000000 5.000000
15000.000000
75% 7.75000 11.500000 6.000000
19500.000000
max 10.00000 15.000000 13.000000
24000.000000
Number of siblings
count 10.0
mean 3.0
std 0.0
min 3.0
25% 3.0
50% 3.0
75% 3.0
max 3.0
data
data
#Correlation of dataframe
data.drop('S. No.',axis = 1).corr()
data
Exercise : Add a column to your dataframe which will have negative correlation with
Income and then find correlation
data['Expenditure'] = [1000, 1520, 1200, 800, 2000, 1800, 0, 1769,
450, 150]
data
S. No. Names Years of Experience Domain Relevant
Experience \
0 1 John 8 Automotive
6.0
1 2 Jason 5 Entertainment
4.0
2 3 Maria 10 Banking
3.0
3 4 Jacob 12 Insurance
10.0
4 5 Sarah 15 Logistics
5.0
5 6 Angelina 3 Travel
3.0
6 7 Krishna 0 NaN
NaN
7 8 Adam 8 Food
5.0
8 9 Deepika 15 IT
13.0
9 10 Alan 2 Space
1.0
Missing Values
data
data.isna().any()
S. No. False
Names False
Years of Experience False
Domain True
Relevant Experience True
Income(USD) True
Marital Status False
Number of siblings False
Expenditure False
dtype: bool
S. No. 0
Names 0
Years of Experience 0
Domain 1
Relevant Experience 1
Income(USD) 1
Marital Status 0
Number of siblings 0
Expenditure 0
dtype: int64
data['Domain'].fillna('Government')
0 Automotive
1 Entertainment
2 Banking
3 Insurance
4 Logistics
5 Travel
6 Government
7 Food
8 IT
9 Space
Name: Domain, dtype: object
data['Income(USD)'].fillna(3000)
0 20000.0
1 15000.0
2 18000.0
3 24000.0
4 8000.0
5 9500.0
6 3000.0
7 7500.0
8 19500.0
9 12500.0
Name: Income(USD), dtype: float64
data['Income(USD)'].fillna(data['Income(USD)'].mean())
0 20000.000000
1 15000.000000
2 18000.000000
3 24000.000000
4 8000.000000
5 9500.000000
6 14888.888889
7 7500.000000
8 19500.000000
9 12500.000000
Name: Income(USD), dtype: float64
data['Income(USD)'].fillna(data['Income(USD)'].median())
0 20000.0
1 15000.0
2 18000.0
3 24000.0
4 8000.0
5 9500.0
6 15000.0
7 7500.0
8 19500.0
9 12500.0
Name: Income(USD), dtype: float64
data[['Income(USD)']]
Income(USD)
0 20000.0
1 15000.0
2 18000.0
3 24000.0
4 8000.0
5 9500.0
6 NaN
7 7500.0
8 19500.0
9 12500.0
imp_constant.fit_transform(data[['Income(USD)']])
array([[20000.],
[15000.],
[18000.],
[24000.],
[ 8000.],
[ 9500.],
[ 1111.],
[ 7500.],
[19500.],
[12500.]])
imp_mean.fit_transform(data[['Income(USD)']])
array([[20000. ],
[15000. ],
[18000. ],
[24000. ],
[ 8000. ],
[ 9500. ],
[14888.88888889],
[ 7500. ],
[19500. ],
[12500. ]])
imp_mean.statistics_
array([14888.88888889])
imp_mean.fit_transform(data[['Relevant Experience']])
imp_mean.statistics_
array([5.55555556])
imp_median.fit_transform(data[['Income(USD)']])
array([[20000.],
[15000.],
[18000.],
[24000.],
[ 8000.],
[ 9500.],
[15000.],
[ 7500.],
[19500.],
[12500.]])
imp_mode.fit_transform(data[['Income(USD)']])
array([[20000.],
[15000.],
[18000.],
[24000.],
[ 8000.],
[ 9500.],
[ 7500.],
[ 7500.],
[19500.],
[12500.]])
data
my_json = json.dumps(my_dictionary)
pd.read_json('sample.json',orient='index').T
Detecting Duplicates
data_ = pd.DataFrame({'Name':['Krishna', 'Adam', 'Adam', 'Alan',
'Krishna'],
'DOB':['January', 'March', 'March', 'May',
'December'],
'Age':[29, 44, 45, 12, 39]})
data_
data_['Name']
0 Krishna
1 Adam
2 Adam
3 Alan
4 Krishna
Name: Name, dtype: object
data_['Name'].duplicated()
0 False
1 False
2 True
3 False
4 True
Name: Name, dtype: bool
data_['Name'].drop_duplicates()
0 Krishna
1 Adam
3 Alan
Name: Name, dtype: object
data_.drop_duplicates()
data_