Professional Documents
Culture Documents
Intro to Python
Data Dictionary
1. NDB_No - Nutrition database number
2. Shrt_Desc - Short description
3. Water_(g) - water in grams per 100 grams
4. Energ_Kcal - Energy in Kcal
5. Protein_(g) - Protein
6. LipidTot(g) - Total Lipid
7. Ash_(g) - Ash
8. Carbohydrt_(g) - Carbohydrate, by difference
9. FiberTD(g) - Fiber, total dietary
10. SugarTot(g) - Total Sugars
11. Calcium_(mg) - Calcium
12. Iron_(mg) - Iron
13. Magnesium_(mg) - Magnesium
14. Phosphorus_(mg) - Phosphorus
15. Potassium_(mg) - Potassium
16. Zinc_(mg) - Zinc
17. Copper_(mg) - Copper
18. Manganese_(mg) - Manganese
19. Selenium_(æg) - Selenium
20. Sodium_(mg) - Sodium
In [4]:
#Import all the necessary modules
#Import all the necessary modules
,,,,,,,
The csv that we are going to load contains the unicode format data. We will use an additional encoding
parameter which will allow pandas to load the csv.
In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
sns.set(color_codes = True)
%matplotlib inline
In [6]:
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 600 entries, 0 to 599
Data columns (total 20 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 NDB_No 600 non-null int64
1 Shrt_Desc 600 non-null object
2 Water_(g) 600 non-null float64
3 Energ_Kcal 600 non-null int64
4 Protein_(g) 600 non-null float64
5 Lipid_Tot_(g) 599 non-null float64
6 Ash_(g) 599 non-null float64
7 Carbohydrt_(g) 600 non-null float64
8 Fiber_TD_(g) 600 non-null float64
9 Sugar_Tot_(g) 600 non-null float64
10 Calcium_(mg) 600 non-null int64
11 Iron_(mg) 600 non-null float64
12 Magnesium_(mg) 600 non-null int64
13 Phosphorus_(mg) 600 non-null int64
14 Potassium_(mg) 600 non-null int64
15 Sodium_(mg) 600 non-null int64
16 Zinc_(mg) 600 non-null float64
17 Copper_mg) 600 non-null float64
18 Manganese_(mg) 600 non-null float64
19 Selenium_(µg) 600 non-null float64
dtypes: float64(12), int64(7), object(1)
memory usage: 93.9+ KB
NDB_No Shrt_Desc Water_(g) Energ_Kcal Protein_(g) Lipid_Tot_(g) Ash_(g) Carbohydrt_(g) Fiber_TD_(g) Sugar_To
0 1001 BUTTER,WITH SALT 15.87 717 0.85 81.11 2.11 0.06 0.0
BUTTER,WHIPPED,W/
1 1002 16.72 718 0.49 78.30 1.62 2.87 0.0
SALT
BUTTER
2 1003 0.24 876 0.28 99.48 0.00 0.00 0.0
OIL,ANHYDROUS
Head function is used to view the top records. The number of records to be viewed, needs to be given in the
parenthesis.
In [ ]:
Out[ ]:
NDB_No Shrt_Desc Water_(g) Energ_Kcal Protein_(g) Lipid_Tot_(g) Ash_(g) Carbohydrt_(g) Fiber_TD_(g) Sugar_To
0 1001 BUTTER,WITH SALT 15.87 717 0.85 81.11 2.11 0.06 0.0
BUTTER,WHIPPED,W/
1 1002 16.72 718 0.49 78.30 1.62 2.87 0.0
SALT
BUTTER
2 1003 0.24 876 0.28 99.48 0.00 0.00 0.0
OIL,ANHYDROUS
In [10]:
df. tail(20)
Out[10]:
INF FORMULA,PBM
582 3936 PRODUC,STO BRA,RTF 88.10 63 1.40 3.50 0.61 6.39 0.0
(FORMERLY W...
INF FORM,PBM
PROD,STORE
583 3937 76.00 130 2.90 7.00 0.61 13.90 0.0
BRAND,LC,NOT REC
(FORM...
INF FORMULA,PBM
584 3938 PRODUCTS,STORE 2.00 524 12.00 28.00 2.00 56.00 0.0
BRAND,PDR
INF FORMULA,PBM
585 3939 PRODUCTS,STORE 88.00 63 1.80 3.50 0.61 6.09 0.0
BRAND,SOY,RTF
INF FORMU,PBM
586 3940 PRODU,STORE BR,SOY,LIQ 76.00 126 3.60 7.00 1.22 12.18 0.0
CONC,NOT ...
INF FORMULA,PBM
587 3941 PRODUCTS,STORE 2.00 508 13.60 27.20 5.00 52.20 0.0
BRAND,SOY,PDR
INF FORMULA, MEAD
NDB_No Shrt_Desc Water_(g) Energ_Kcal Protein_(g) Lipid_Tot_(g) Ash_(g) Carbohydrt_(g) Fiber_TD_(g)
588 3942 JOHNS, ENFAMI, AR LIPIL, 87.00 68 1.71 3.49 0.33 7.53 0.0
RTF...
INF FORM,ABBOT
590 3944 NUT,SIMIL NEOSU,RTF,W/ 87.00 69 1.92 3.77 0.38 6.94 0.0
ARA & DHA
INF FORMULA,ABBOTT
596 3950 NUTR,SIMILAC,ADVANC,W/ 2.25 522 10.89 28.87 3.26 54.73 0.0
IRON...
INF FORMU,ABBO
597 3951 NUTR,SIMILAC,ADVAN,W/ 76.16 127 2.64 6.89 0.79 13.52 0.0
IRON,LIQ ...
Displaying the shape of the Data Frame in which first value is giving number of Rows and second value is giving
number of columns.
In [13]:
df.shape
Out[13]:
(600, 20)
In [ ]:
print("The number of rows are ",data_df.shape[0],"\n ","The number of columns are",data_d
f.shape[1])
In [16]:
df[df.isna().any(axis=1)]
Out[16]:
NDB_No Shrt_Desc Water_(g) Energ_Kcal Protein_(g) Lipid_Tot_(g) Ash_(g) Carbohydrt_(g) Fiber_TD_(g) Sugar_Tot_(g)
3 1004 CHEESE,BLUE 42.41 353 21.40 28.74 NaN 2.34 0.0 0.50
4 1005 CHEESE,BRICK 41.11 371 23.24 NaN 3.18 2.79 0.0 0.51
In [18]:
df=df.dropna()
In [ ]:
data_df=data_df.dropna()
There are two missing values in the data set one for each variable lipid Tot(g) and Ash_(g). We need to drop these
missing values before proceeding. We will use the dropna method to drop the missing values.
In [ ]:
data_df.isnull().sum()
Out[ ]:
NDB_No 0
Shrt_Desc 0
Water_(g) 0
Energ_Kcal 0
Protein_(g) 0
Lipid_Tot_(g) 0
Ash_(g) 0
Carbohydrt_(g) 0
Fiber_TD_(g) 0
Sugar_Tot_(g) 0
Calcium_(mg) 0
Iron_(mg) 0
Magnesium_(mg) 0
Phosphorus_(mg) 0
Potassium_(mg) 0
Sodium_(mg) 0
Zinc_(mg) 0
Copper_mg) 0
Manganese_(mg) 0
Selenium_(µg) 0
dtype: int64
dtype: int64
We can now see that there are no missing values in our data set.
In [19]:
df.describe()
Out[19]:
NDB_No Water_(g) Energ_Kcal Protein_(g) Lipid_Tot_(g) Ash_(g) Carbohydrt_(g) Fiber_TD_(g) Sugar_Tot_(g) Calc
count 598.000000 598.000000 598.000000 598.000000 598.000000 598.000000 598.000000 598.000000 598.000000
mean 2305.448161 60.415870 186.521739 8.640385 8.849649 2.542040 19.395920 2.587458 9.122726
std 1081.807819 31.929817 158.540534 10.939719 12.262311 6.494483 21.522166 7.594803 13.927320
min 1001.000000 0.200000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
25% 1173.250000 39.167500 63.000000 1.702500 0.697500 0.490000 5.560000 0.000000 0.560000
50% 2052.500000 76.180000 106.000000 3.900000 3.475000 0.890000 10.840000 0.000000 4.820000
75% 3190.750000 85.987500 314.500000 12.385000 13.000000 3.260000 19.147500 1.200000 10.692500
max 3953.000000 99.900000 876.000000 84.080000 99.480000 99.800000 86.680000 53.200000 74.460000
In [22]:
test= df['Protein_(g)'].head(10)
#test
Out[22]:
0 0.85
1 0.49
2 0.28
5 20.75
6 19.80
7 25.18
8 22.87
9 23.37
10 23.76
11 11.12
Name: Protein_(g), dtype: float64
In [24]:
df.iloc[0:5,10:15]
Out[24]:
Out[24]:
0 24 0.02 2 24 24
1 23 0.05 1 24 41
2 4 0.00 0 3 5
In [25]:
df['Sodium_(mg)'].dtype
Out[25]:
dtype('int64')
In [ ]:
data_df_new.shape
In [ ]:
#dimension will be checked using the shape
data_df.shape
Out[ ]:
(598, 21)
In [ ]:
data_df.head(5)
Out[ ]:
NDB_No Shrt_Desc Water_(g) Energ_Kcal Protein_(g) Lipid_Tot_(g) Ash_(g) Carbohydrt_(g) Fiber_TD_(g) Sugar_To
0 1001 BUTTER,WITH SALT 15.87 717 0.85 81.11 2.11 0.06 0.0
BUTTER,WHIPPED,W/
1 1002 16.72 718 0.49 78.30 1.62 2.87 0.0
SALT
BUTTER
2 1003 0.24 876 0.28 99.48 0.00 0.00 0.0
OIL,ANHYDROUS
5 rows × 21 columns
Create a subset of the dataset, where the Energ_Kcal is less
than 500, what is the dimension of this new dataset?
We will create a new data frame by using the subset conditions from the original data_df and will use shape
property for checking the dimension.
In [ ]:
data_df_new=data_df[data_df['Energ_Kcal']<500]
Here what we got reduced rows as there were 36 rows where Energ_Kcal > 500. (598-36=562)
Here we will use sorting to find out based on higher energy values.We will use sort_values method to order the
data. Since we are saying higher trhe value the rank needs to be higher we will sort the data in the descending
order so that the top most value gets the higher rank.
Since we are saying lower the value the rank needs to be higher,we will sort the data in the ascending order so
that the lower most value gets the higher rank.
In [ ]:
df.sort_values(by=)
Out[ ]:
SEASONING
311 2074 MIX,DRY,SAZON,CORIANDER & 0.20 0 0.00 0.00 99.80 0.00
ANNATTO
175 1206 CREAM SUB,FLAV,PDR 1.52 482 0.68 21.47 0.79 75.42
BABYFOOD,CRL,WHL WHEAT,W/
443 3184 1.70 402 6.60 4.80 3.70 83.20
APPLS,DRY FORT
EGG,WHL,DRIED,STABILIZED,GLUCOSE
122 1134 1.87 615 48.17 43.95 3.63 2.38
RED
10 rows × 21 columns
In [ ]:
In [ ]:
data_df_cheese.shape
Out[ ]:
(74, 21)
In [ ]:
data_df_cheese.describe()
Out[ ]:
NDB_No Water_(g) Energ_Kcal Protein_(g) Lipid_Tot_(g) Ash_(g) Carbohydrt_(g) Fiber_TD_(g) Sugar_Tot_(g) Calcium
count 74.000000 74.000000 74.000000 74.000000 74.000000 74.000000 74.000000 74.000000 74.000000 74.
mean 1102.351351 48.021892 302.945946 20.956216 22.258378 4.102838 4.697568 0.005405 2.050541 632.
std 95.658702 14.181385 98.547797 6.423714 9.960468 1.730940 5.791562 0.028148 3.257006 296.
min 1006.000000 13.440000 72.000000 6.150000 0.000000 1.020000 0.000000 0.000000 0.000000 53.
25% 1024.250000 39.305000 264.000000 17.045000 19.480000 3.272500 2.010000 0.000000 0.322500 497.
50% 1042.500000 44.550000 330.000000 21.760000 25.320000 3.790000 3.300000 0.000000 0.955000 676.
75% 1181.750000 52.415000 372.500000 24.502500 29.800000 5.195000 5.395000 0.000000 2.365000 759.
max 1271.000000 82.480000 466.000000 37.860000 35.590000 8.030000 42.650000 0.200000 23.670000 1375.
Using the cut function on water variable divide the whole data
into 6 bins, list down the summary statistics of all the 6
bins.
In [ ]:
data_bins=data_df
In [ ]:
data_bins['bins']=pd.cut(data_bins['Water_(g)'],6, labels =["A", "B", "C","E","F","G"])
In [ ]:
data_bins.head(5)
Out[ ]:
NDB_No Shrt_Desc Water_(g) Energ_Kcal Protein_(g) Lipid_Tot_(g) Ash_(g) Carbohydrt_(g) Fiber_TD_(g) Sugar_To
0 1001 BUTTER,WITH SALT 15.87 717 0.85 81.11 2.11 0.06 0.0
BUTTER,WHIPPED,W/
1 1002 16.72 718 0.49 78.30 1.62 2.87 0.0
SALT
BUTTER
2 1003 0.24 876 0.28 99.48 0.00 0.00 0.0
OIL,ANHYDROUS
5 rows × 22 columns
In [ ]:
data_bins["bins"].value_counts()
Out[ ]:
G 193
F 175
A 127
C 52
E 41
B 10
Name: bins, dtype: int64
In [ ]:
data_bins.groupby("bins").describe().T
Out[ ]:
bins A B C E F G
In [ ]:
A=data_bins[data_bins["bins"]=="A"]
B=data_bins[data_bins["bins"]=="B"]
C=data_bins[data_bins["bins"]=="C"]
E=data_bins[data_bins["bins"]=="E"]
F=data_bins[data_bins["bins"]=="F"]
G=data_bins[data_bins["bins"]=="G"]
In [ ]:
A.describe() # You can do for others
Out[ ]:
NDB_No Water_(g) Energ_Kcal Protein_(g) Lipid_Tot_(g) Ash_(g) Carbohydrt_(g) Fiber_TD_(g) Sugar_Tot_(g) Calc
count 127.000000 127.000000 127.000000 127.000000 127.000000 127.000000 127.000000 127.000000 127.000000
mean 2427.000000 5.964094 405.181102 16.981339 16.340157 7.032441 53.572756 9.881102 20.913150
mean 2427.000000 5.964094 405.181102 16.981339 16.340157 7.032441 53.572756 9.881102 20.913150
NDB_No Water_(g) Energ_Kcal Protein_(g) Lipid_Tot_(g) Ash_(g) Carbohydrt_(g) Fiber_TD_(g) Sugar_Tot_(g) Calc
std 1006.346851 3.634829 123.142286 15.983079 16.559315 12.780734 21.142774 14.068132 23.569114
min 1001.000000 0.200000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
25% 2003.500000 2.990000 324.000000 9.055000 4.320000 2.925000 50.360000 0.000000 0.395000
50% 2035.000000 5.440000 391.000000 12.010000 12.600000 4.000000 55.830000 1.400000 7.270000
75% 3214.500000 8.460000 505.500000 18.030000 26.855000 7.010000 68.540000 14.700000 51.000000
max 3950.000000 16.720000 876.000000 84.080000 99.480000 99.800000 86.680000 53.200000 74.460000 22
In [ ]:
X=[A,B,C,E,F,G]
for i in X:
print(i.describe())
EPW
count 127.000000
mean 131.393840
std 325.416276
min 0.000000
25% 36.045815
50% 72.244898
75% 165.333333
max 3650.000000
NDB_No Water_(g) Energ_Kcal Protein_(g) Lipid_Tot_(g) \
count 10.000000 10.000000 10.000000 10.000000 10.000000
count 10.000000 10.000000 10.000000 10.000000 10.000000
mean 1291.200000 26.282000 416.200000 21.058000 27.528000
std 610.900938 5.029674 115.451577 15.006813 22.019365
min 1023.000000 17.940000 315.000000 0.820000 2.240000
25% 1034.250000 23.237500 344.250000 7.107500 12.982500
50% 1120.000000 27.935000 402.500000 29.115000 27.140000
75% 1153.500000 29.122500 418.750000 31.480000 31.215000
max 3019.000000 33.190000 717.000000 37.860000 81.110000
EPW
count 10.000000
mean 16.917441
std 8.489844
min 10.971787
25% 12.462685
50% 14.511953
75% 17.114903
max 39.966555
NDB_No Water_(g) Energ_Kcal Protein_(g) Lipid_Tot_(g) \
count 52.000000 52.000000 52.000000 52.000000 52.000000
mean 1127.615385 42.910769 336.288462 19.719423 25.033269
std 103.775239 4.096820 46.150791 6.898110 6.883962
min 1006.000000 35.480000 200.000000 2.110000 2.860000
25% 1029.750000 39.242500 308.500000 17.812500 22.262500
50% 1102.500000 42.670000 345.500000 22.025000 26.300000
75% 1238.250000 46.492500 369.500000 24.365000 30.010000
max 1306.000000 50.010000 410.000000 27.350000 33.820000
EPW
count 52.000000
mean 7.990496
std 1.744799
min 4.155412
25% 6.718986
50% 7.913460
75% 9.482712
max 11.341632
NDB_No Water_(g) Energ_Kcal Protein_(g) Lipid_Tot_(g) \
count 41.000000 41.000000 41.000000 41.000000 41.000000
mean 1225.097561 57.147561 236.097561 11.525366 14.667317
std 250.416234 5.581205 66.178095 9.071350 10.715499
min 1007.000000 50.060000 101.000000 0.050000 0.000000
25% 1072.000000 51.800000 176.000000 3.200000 3.080000
50% 1190.000000 56.440000 254.000000 12.000000 14.100000
75% 1243.000000 61.820000 288.000000 18.520000 22.820000
max 2051.000000 66.440000 350.000000 32.140000 36.080000
EPW
count 41.000000
mean 4.236746
std 1.414591
min 1.551221
25% 2.846975
50% 4.373033
75% 5.405937
75% 5.405937
max 6.651463
NDB_No Water_(g) Energ_Kcal Protein_(g) Lipid_Tot_(g) \
count 175.000000 175.000000 175.000000 175.000000 175.000000
mean 2379.885714 77.783600 105.457143 5.172114 4.262629
std 1123.918049 3.955846 34.991413 4.345693 4.625422
min 1012.000000 66.860000 60.000000 0.100000 0.000000
25% 1183.000000 75.045000 78.000000 2.080000 0.300000
50% 3011.000000 79.000000 98.000000 3.600000 2.700000
75% 3227.500000 81.240000 128.000000 8.085000 6.970000
max 3952.000000 83.200000 208.000000 15.690000 19.520000
EPW
count 175.000000
mean 1.378592
std 0.526554
min 0.721154
25% 0.979906
50% 1.230272
75% 1.685755
max 2.946455
NDB_No Water_(g) Energ_Kcal Protein_(g) Lipid_Tot_(g) \
count 193.000000 193.000000 193.000000 193.000000 193.000000
mean 2757.367876 87.678187 53.357513 2.055285 1.515855
std 947.352355 2.418911 14.250061 1.843721 1.508434
min 1059.000000 83.300000 0.000000 0.000000 0.000000
25% 2053.000000 86.340000 44.000000 0.810000 0.180000
50% 3105.000000 87.540000 56.000000 1.800000 1.000000
75% 3267.000000 88.600000 63.000000 3.140000 2.910000
max 3953.000000 99.900000 97.000000 10.900000 6.930000
EPW
count 193.000000
mean 0.612304
std 0.172423
min 0.000000
25% 0.508475
50% 0.638468
75% 0.726846
max 1.163209
In [ ]: