Data Dictionary Data Dictionary: Set The Working Directory Set The Working Directory

Problem Statement - Food Nutrition
Intro to Python
Data Dictionary
1. NDB_No - Nutrition database number
2. Shrt_Desc - Short description
3. Water_(g) - water in grams per 100 grams
4. Energ_Kcal - Energy in Kcal
5. Protein_(g) - Protein
6. LipidTot(g) - Total Lipid
7. Ash_(g) - Ash
8. Carbohydrt_(g) - Carbohydrate, by difference
9. FiberTD(g) - Fiber, total dietary
10. SugarTot(g) - Total Sugars
11. Calcium_(mg) - Calcium
12. Iron_(mg) - Iron
13. Magnesium_(mg) - Magnesium
14. Phosphorus_(mg) - Phosphorus
15. Potassium_(mg) - Potassium
16. Zinc_(mg) - Zinc
17. Copper_(mg) - Copper
18. Manganese_(mg) - Manganese
19. Selenium_(æg) - Selenium
20. Sodium_(mg) - Sodium
In [4]:
#Import all the necessary modules
#Import all the necessary modules
,,,,,,,
Set the working directory

In [2]:
os.getcwd()
#os.chdir('C:\\Users\\MyPC\\Desktop\\DataCamp_exercises\\GL_Mentoring\\Python')
Out[2]:
'C:\\Users\\Aman Prakash'
Import Excel file

Load the Data file into Python DataFrame using pandas read_excel method
The csv that we are going to load contains the unicode format data. We will use an additional encoding
parameter which will allow pandas to load the csv.
In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
sns.set(color_codes = True)
%matplotlib inline
In [6]:
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 600 entries, 0 to 599
Data columns (total 20 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 NDB_No 600 non-null int64
1 Shrt_Desc 600 non-null object
2 Water_(g) 600 non-null float64
3 Energ_Kcal 600 non-null int64
4 Protein_(g) 600 non-null float64
5 Lipid_Tot_(g) 599 non-null float64
6 Ash_(g) 599 non-null float64
7 Carbohydrt_(g) 600 non-null float64
8 Fiber_TD_(g) 600 non-null float64
9 Sugar_Tot_(g) 600 non-null float64
10 Calcium_(mg) 600 non-null int64
11 Iron_(mg) 600 non-null float64
12 Magnesium_(mg) 600 non-null int64
13 Phosphorus_(mg) 600 non-null int64
14 Potassium_(mg) 600 non-null int64
15 Sodium_(mg) 600 non-null int64
16 Zinc_(mg) 600 non-null float64
17 Copper_mg) 600 non-null float64
18 Manganese_(mg) 600 non-null float64
19 Selenium_(µg) 600 non-null float64
dtypes: float64(12), int64(7), object(1)
memory usage: 93.9+ KB
View First 10 rows

In [8]:
df.head(10)
Out[8]:
NDB_No Shrt_Desc Water_(g) Energ_Kcal Protein_(g) Lipid_Tot_(g) Ash_(g) Carbohydrt_(g) Fiber_TD_(g) Sugar_To
0 1001 BUTTER,WITH SALT 15.87 717 0.85 81.11 2.11 0.06 0.0
BUTTER,WHIPPED,W/
1 1002 16.72 718 0.49 78.30 1.62 2.87 0.0
SALT
BUTTER
2 1003 0.24 876 0.28 99.48 0.00 0.00 0.0
OIL,ANHYDROUS
3 1004 CHEESE,BLUE 42.41 353 21.40 28.74 NaN 2.34 0.0
4 1005 CHEESE,BRICK 41.11 371 23.24 NaN 3.18 2.79 0.0
5 1006 CHEESE,BRIE 48.42 334 20.75 27.68 2.70 0.45 0.0
6 1007 CHEESE,CAMEMBERT 51.80 300 19.80 24.26 3.68 0.46 0.0
7 1008 CHEESE,CARAWAY 39.28 376 25.18 29.20 3.28 3.06 0.0
8 1009 CHEESE,CHEDDAR 37.02 404 22.87 33.31 3.71 3.09 0.0
9 1010 CHEESE,CHESHIRE 37.65 387 23.37 30.60 3.60 4.78 0.0
Head function is used to view the top records. The number of records to be viewed, needs to be given in the
parenthesis.
In [ ]:
Out[ ]:
0 1001 BUTTER,WITH SALT 15.87 717 0.85 81.11 2.11 0.06 0.0
BUTTER,WHIPPED,W/
1 1002 16.72 718 0.49 78.30 1.62 2.87 0.0
SALT
BUTTER
2 1003 0.24 876 0.28 99.48 0.00 0.00 0.0
OIL,ANHYDROUS
3 1004 CHEESE,BLUE 42.41 353 21.40 28.74 NaN 2.34 0.0
4 1005 CHEESE,BRICK 41.11 371 23.24 NaN 3.18 2.79 0.0
5 1006 CHEESE,BRIE 48.42 334 20.75 27.68 2.70 0.45 0.0
7 1008 CHEESE,CARAWAY 39.28 376 25.18 29.20 3.28 3.06 0.0
8 1009 CHEESE,CHEDDAR 37.02 404 22.87 33.31 3.71 3.09 0.0
9 1010 CHEESE,CHESHIRE 37.65 387 23.37 30.60 3.60 4.78 0.0
View Last 20 records

Tail function is used to view the last records. The number of records to be viewed, needs to be given in the
parenthesis.
In [10]:
df. tail(20)
Out[10]:
NDB_No Shrt_Desc Water_(g) Energ_Kcal Protein_(g) Lipid_Tot_(g) Ash_(g) Carbohydrt_(g) Fiber_TD_(g)
BABYFOOD,CORN & SWT

580 3934 82.62 68 1.26 0.28 0.61 15.23 1.8
POTATOES,STR
INF FORM, ABBO NUTR,

581 3935 SIMIL, ALIMENT, ADVAN, 87.32 67 1.80 3.63 0.48 6.77 0.0
RT...
INF FORMULA,PBM
582 3936 PRODUC,STO BRA,RTF 88.10 63 1.40 3.50 0.61 6.39 0.0
(FORMERLY W...
INF FORM,PBM
PROD,STORE
583 3937 76.00 130 2.90 7.00 0.61 13.90 0.0
BRAND,LC,NOT REC
(FORM...
INF FORMULA,PBM
584 3938 PRODUCTS,STORE 2.00 524 12.00 28.00 2.00 56.00 0.0
BRAND,PDR
INF FORMULA,PBM
585 3939 PRODUCTS,STORE 88.00 63 1.80 3.50 0.61 6.09 0.0
BRAND,SOY,RTF
INF FORMU,PBM
586 3940 PRODU,STORE BR,SOY,LIQ 76.00 126 3.60 7.00 1.22 12.18 0.0
CONC,NOT ...
INF FORMULA,PBM
587 3941 PRODUCTS,STORE 2.00 508 13.60 27.20 5.00 52.20 0.0
BRAND,SOY,PDR
INF FORMULA, MEAD
NDB_No Shrt_Desc Water_(g) Energ_Kcal Protein_(g) Lipid_Tot_(g) Ash_(g) Carbohydrt_(g) Fiber_TD_(g)
588 3942 JOHNS, ENFAMI, AR LIPIL, 87.00 68 1.71 3.49 0.33 7.53 0.0
RTF...
INF FORMULA, MEA

589 3943 JOHNSO, ENFAMI, AR 3.00 509 12.73 25.97 2.46 56.03 0.0
LIPIL, PDR...
INF FORM,ABBOT
590 3944 NUT,SIMIL NEOSU,RTF,W/ 87.00 69 1.92 3.77 0.38 6.94 0.0
ARA & DHA
INF FORMULA, ABBOTT,

591 3945 SIMILAC, NEOSURE, PDR, 2.25 520 14.42 28.33 3.26 51.75 0.0
W/...
INF FORMULA, ABB NUTR,

592 3946 87.00 68 1.48 3.74 0.75 7.40 0.0
SIMI,SENS (LACT FRE) RT...
INF FORMULA, ABB

593 3947 NUT,SIMIL,SENS,(LACT 75.88 128 2.73 6.95 0.80 13.64 0.0
FR),LIQ ...
INF FORMULA, ABB

594 3948 NUTR,SIMIL,SENS,(LACTO 2.20 520 11.13 28.14 3.00 55.67 0.0
FR), P...
INF FORMULA, ABBOTT

595 3949 NUTR, SIMILAC, ADVANCE, 87.79 66 1.36 3.70 0.38 6.77 0.0
W/...
INF FORMULA,ABBOTT
596 3950 NUTR,SIMILAC,ADVANC,W/ 2.25 522 10.89 28.87 3.26 54.73 0.0
IRON...
INF FORMU,ABBO
597 3951 NUTR,SIMILAC,ADVAN,W/ 76.16 127 2.64 6.89 0.79 13.52 0.0
IRON,LIQ ...
INF FORMULA, ABB NUTR,

598 3952 75.81 128 3.13 6.98 0.88 13.21 0.0
SIMIL, ISOMIL, ADVA W/ ...
INF FORMULA, ABBO

599 3953 NUTR,SIMIL,ISOMIL, 87.54 66 1.61 3.59 0.66 6.70 0.0
ADVANCE W...
Displaying the shape of the Data Frame in which first value is giving number of Rows and second value is giving
number of columns.
In [13]:
df.shape
Out[13]:
(600, 20)
In [ ]:
print("The number of rows are ",data_df.shape[0],"\n ","The number of columns are",data_d
f.shape[1])
The number of rows are 600

The number of columns are 20
Check for missing values

In [15]:
df.isnull().sum()
Out[15]:
NDB_No 0
Shrt_Desc 0
Shrt_Desc 0
Water_(g) 0
Energ_Kcal 0
Protein_(g) 0
Lipid_Tot_(g) 1
Ash_(g) 1
Carbohydrt_(g) 0
Fiber_TD_(g) 0
Sugar_Tot_(g) 0
Calcium_(mg) 0
Iron_(mg) 0
Magnesium_(mg) 0
Phosphorus_(mg) 0
Potassium_(mg) 0
Sodium_(mg) 0
Zinc_(mg) 0
Copper_mg) 0
Manganese_(mg) 0
Selenium_(µg) 0
dtype: int64
In [16]:
df[df.isna().any(axis=1)]
Out[16]:
NDB_No Shrt_Desc Water_(g) Energ_Kcal Protein_(g) Lipid_Tot_(g) Ash_(g) Carbohydrt_(g) Fiber_TD_(g) Sugar_Tot_(g)
3 1004 CHEESE,BLUE 42.41 353 21.40 28.74 NaN 2.34 0.0 0.50
4 1005 CHEESE,BRICK 41.11 371 23.24 NaN 3.18 2.79 0.0 0.51
In [18]:
df=df.dropna()
In [ ]:
data_df=data_df.dropna()
There are two missing values in the data set one for each variable lipid Tot(g) and Ash_(g). We need to drop these
missing values before proceeding. We will use the dropna method to drop the missing values.
In [ ]:
data_df.isnull().sum()
Out[ ]:
NDB_No 0
Shrt_Desc 0
Water_(g) 0
Energ_Kcal 0
Protein_(g) 0
Lipid_Tot_(g) 0
Ash_(g) 0
Carbohydrt_(g) 0
Fiber_TD_(g) 0
Sugar_Tot_(g) 0
Calcium_(mg) 0
Iron_(mg) 0
Magnesium_(mg) 0
Phosphorus_(mg) 0
Potassium_(mg) 0
Sodium_(mg) 0
Zinc_(mg) 0
Copper_mg) 0
Manganese_(mg) 0
Selenium_(µg) 0
dtype: int64
dtype: int64
We can now see that there are no missing values in our data set.
Summary of the data

To see the summary of any dataframe we will use describe function.
In [19]:
df.describe()
Out[19]:
NDB_No Water_(g) Energ_Kcal Protein_(g) Lipid_Tot_(g) Ash_(g) Carbohydrt_(g) Fiber_TD_(g) Sugar_Tot_(g) Calc
count 598.000000 598.000000 598.000000 598.000000 598.000000 598.000000 598.000000 598.000000 598.000000
mean 2305.448161 60.415870 186.521739 8.640385 8.849649 2.542040 19.395920 2.587458 9.122726
std 1081.807819 31.929817 158.540534 10.939719 12.262311 6.494483 21.522166 7.594803 13.927320
min 1001.000000 0.200000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
25% 1173.250000 39.167500 63.000000 1.702500 0.697500 0.490000 5.560000 0.000000 0.560000
50% 2052.500000 76.180000 106.000000 3.900000 3.475000 0.890000 10.840000 0.000000 4.820000
75% 3190.750000 85.987500 314.500000 12.385000 13.000000 3.260000 19.147500 1.200000 10.692500
max 3953.000000 99.900000 876.000000 84.080000 99.480000 99.800000 86.680000 53.200000 74.460000
Create a vector “test” using the top 10 values of variable

Protein_(g)
we will pass the variable name inside the square brackets of data_df and then use head function to retrieve the
top 10 records for that variable.
In [22]:
test= df['Protein_(g)'].head(10)
#test
Out[22]:
0 0.85
1 0.49
2 0.28
5 20.75
6 19.80
7 25.18
8 22.87
9 23.37
10 23.76
11 11.12
Name: Protein_(g), dtype: float64
Select the top 5 rows of initial 5 variables in a matrix

format
To fetch the top 5 rows and initial variables we will use iloc and will pass the index both for rows and columns.
In [24]:
df.iloc[0:5,10:15]
Out[24]:
Out[24]:
Calcium_(mg) Iron_(mg) Magnesium_(mg) Phosphorus_(mg) Potassium_(mg)
0 24 0.02 2 24 24
1 23 0.05 1 24 41
2 4 0.00 0 3 5
5 184 0.50 20 188 152
6 388 0.33 20 347 187
What is the datatype of the Sodium(mg) variable

To check the class of any variable we will use dtype property.
In [25]:
df['Sodium_(mg)'].dtype
Out[25]:
dtype('int64')
Create a new variable “EPW” by dividing Energ_Kcal with the

Water; what is the dimension of the new dataset?
In [ ]:
data_df_new=data_df[Data_df['Energ_Kcal']<500]
In [ ]:
data_df_new.shape
In [ ]:
#dimension will be checked using the shape
data_df.shape
Out[ ]:
(598, 21)
In [ ]:
data_df.head(5)
Out[ ]:
0 1001 BUTTER,WITH SALT 15.87 717 0.85 81.11 2.11 0.06 0.0
BUTTER,WHIPPED,W/
1 1002 16.72 718 0.49 78.30 1.62 2.87 0.0
SALT
BUTTER
2 1003 0.24 876 0.28 99.48 0.00 0.00 0.0
OIL,ANHYDROUS
5 1006 CHEESE,BRIE 48.42 334 20.75 27.68 2.70 0.45 0.0
5 rows × 21 columns
Create a subset of the dataset, where the Energ_Kcal is less
than 500, what is the dimension of this new dataset?
We will create a new data frame by using the subset conditions from the original data_df and will use shape
property for checking the dimension.
In [ ]:
data_df_new=data_df[data_df['Energ_Kcal']<500]
Here what we got reduced rows as there were 36 rows where Energ_Kcal > 500. (598-36=562)
Find the top 10 products based on following

1. Higher the Energy_Kcal, higher the ranking
Here we will use sorting to find out based on higher energy values.We will use sort_values method to order the
data. Since we are saying higher trhe value the rank needs to be higher we will sort the data in the descending
order so that the top most value gets the higher rank.
1. Lower the water content, higher the ranking
Since we are saying lower the value the rank needs to be higher,we will sort the data in the ascending order so
that the lower most value gets the higher rank.
In [ ]:
df.sort_values(by=)
Out[ ]:
NDB_No Shrt_Desc Water_(g) Energ_Kcal Protein_(g) Lipid_Tot_(g) Ash_(g) Carbohydrt_(g) Fib
SEASONING
311 2074 MIX,DRY,SAZON,CORIANDER & 0.20 0 0.00 0.00 99.80 0.00
ANNATTO
295 2047 SALT,TABLE 0.20 0 0.00 0.00 99.80 0.00
2 1003 BUTTER OIL,ANHYDROUS 0.24 876 0.28 99.48 0.00 0.00
63 1070 DESSERT TOPPING,POWDERED 1.47 577 4.90 39.92 1.17 52.54
175 1206 CREAM SUB,FLAV,PDR 1.52 482 0.68 21.47 0.79 75.42
BABYFOOD,CRL,WHL WHEAT,W/
443 3184 1.70 402 6.60 4.80 3.70 83.20
APPLS,DRY FORT
EGG,WHL,DRIED,STABILIZED,GLUCOSE
122 1134 1.87 615 48.17 43.95 3.63 2.38
RED
INF FORMULA,PBM PRODUCTS,STORE

587 3941 2.00 508 13.60 27.20 5.00 52.20
BRAND,SOY,PDR
INF FORMULA. MEAD JOHNSON,

535 3821 2.00 519 14.00 28.00 3.20 52.80
PREGESTIMIL, W/IRON...
INF FORMULA,PBM PRODUCTS,STORE

584 3938 2.00 524 12.00 28.00 2.00 56.00
BRAND,PDR
Create a subset of the data where product_desc contains

“CHEESE” and list down the summary statistics of the
subset
subset
To find a specific value we will use the contains method. We will pass the string in the paranthesis which needs
to be located in the dataframe variable.
In [ ]:
In [ ]:
data_df_cheese.shape
Out[ ]:
(74, 21)
In [ ]:
data_df_cheese.describe()
Out[ ]:
NDB_No Water_(g) Energ_Kcal Protein_(g) Lipid_Tot_(g) Ash_(g) Carbohydrt_(g) Fiber_TD_(g) Sugar_Tot_(g) Calcium
count 74.000000 74.000000 74.000000 74.000000 74.000000 74.000000 74.000000 74.000000 74.000000 74.
mean 1102.351351 48.021892 302.945946 20.956216 22.258378 4.102838 4.697568 0.005405 2.050541 632.
std 95.658702 14.181385 98.547797 6.423714 9.960468 1.730940 5.791562 0.028148 3.257006 296.
min 1006.000000 13.440000 72.000000 6.150000 0.000000 1.020000 0.000000 0.000000 0.000000 53.
25% 1024.250000 39.305000 264.000000 17.045000 19.480000 3.272500 2.010000 0.000000 0.322500 497.
50% 1042.500000 44.550000 330.000000 21.760000 25.320000 3.790000 3.300000 0.000000 0.955000 676.
75% 1181.750000 52.415000 372.500000 24.502500 29.800000 5.195000 5.395000 0.000000 2.365000 759.
max 1271.000000 82.480000 466.000000 37.860000 35.590000 8.030000 42.650000 0.200000 23.670000 1375.
Using the cut function on water variable divide the whole data
into 6 bins, list down the summary statistics of all the 6
bins.
In [ ]:
data_bins=data_df
In [ ]:
data_bins['bins']=pd.cut(data_bins['Water_(g)'],6, labels =["A", "B", "C","E","F","G"])
In [ ]:
data_bins.head(5)
Out[ ]:
0 1001 BUTTER,WITH SALT 15.87 717 0.85 81.11 2.11 0.06 0.0
BUTTER,WHIPPED,W/
1 1002 16.72 718 0.49 78.30 1.62 2.87 0.0
SALT
BUTTER
2 1003 0.24 876 0.28 99.48 0.00 0.00 0.0
OIL,ANHYDROUS
5 1006 CHEESE,BRIE 48.42 334 20.75 27.68 2.70 0.45 0.0

In [ ]:
data_bins["bins"].value_counts()
Out[ ]:
G 193
F 175
A 127
C 52
E 41
B 10
Name: bins, dtype: int64
we can either do this in one go, like shown below
In [ ]:
data_bins.groupby("bins").describe().T
Out[ ]:
bins A B C E F G
count 127.000000 10.000000 52.000000 41.000000 175.000000 193.000000
mean 2427.000000 1291.200000 1127.615385 1225.097561 2379.885714 2757.367876
NDB_No std 1006.346851 610.900938 103.775239 250.416234 1123.918049 947.352355
min 1001.000000 1023.000000 1006.000000 1007.000000 1012.000000 1059.000000
25% 2003.500000 1034.250000 1029.750000 1072.000000 1183.000000 2053.000000
... ... ... ... ... ... ... ...
min 0.000000 10.971787 4.155412 1.551221 0.721154 0.000000
25% 36.045815 12.462685 6.718986 2.846975 0.979906 0.508475
EPW 50% 72.244898 14.511953 7.913460 4.373033 1.230272 0.638468
75% 165.333333 17.114903 9.482712 5.405937 1.685755 0.726846
max 3650.000000 39.966555 11.341632 6.651463 2.946455 1.163209
or we can create subsets of the data on the bases of bins
In [ ]:
A=data_bins[data_bins["bins"]=="A"]
B=data_bins[data_bins["bins"]=="B"]
C=data_bins[data_bins["bins"]=="C"]
E=data_bins[data_bins["bins"]=="E"]
F=data_bins[data_bins["bins"]=="F"]
G=data_bins[data_bins["bins"]=="G"]
In [ ]:
A.describe() # You can do for others
Out[ ]:
count 127.000000 127.000000 127.000000 127.000000 127.000000 127.000000 127.000000 127.000000 127.000000
mean 2427.000000 5.964094 405.181102 16.981339 16.340157 7.032441 53.572756 9.881102 20.913150
mean 2427.000000 5.964094 405.181102 16.981339 16.340157 7.032441 53.572756 9.881102 20.913150
std 1006.346851 3.634829 123.142286 15.983079 16.559315 12.780734 21.142774 14.068132 23.569114
min 1001.000000 0.200000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
25% 2003.500000 2.990000 324.000000 9.055000 4.320000 2.925000 50.360000 0.000000 0.395000
50% 2035.000000 5.440000 391.000000 12.010000 12.600000 4.000000 55.830000 1.400000 7.270000
75% 3214.500000 8.460000 505.500000 18.030000 26.855000 7.010000 68.540000 14.700000 51.000000
max 3950.000000 16.720000 876.000000 84.080000 99.480000 99.800000 86.680000 53.200000 74.460000 22
OR make loop for it.
In [ ]:
X=[A,B,C,E,F,G]
for i in X:
print(i.describe())
NDB_No Water_(g) Energ_Kcal Protein_(g) Lipid_Tot_(g) \

count 127.000000 127.000000 127.000000 127.000000 127.000000
mean 2427.000000 5.964094 405.181102 16.981339 16.340157
std 1006.346851 3.634829 123.142286 15.983079 16.559315
min 1001.000000 0.200000 0.000000 0.000000 0.000000
25% 2003.500000 2.990000 324.000000 9.055000 4.320000
50% 2035.000000 5.440000 391.000000 12.010000 12.600000
75% 3214.500000 8.460000 505.500000 18.030000 26.855000
max 3950.000000 16.720000 876.000000 84.080000 99.480000
Ash_(g) Carbohydrt_(g) Fiber_TD_(g) Sugar_Tot_(g) Calcium_(mg) \

count 127.000000 127.000000 127.000000 127.000000 127.000000
mean 7.032441 53.572756 9.881102 20.913150 618.330709
std 12.780734 21.142774 14.068132 23.569114 530.134186
min 0.000000 0.000000 0.000000 0.000000 0.000000
25% 2.925000 50.360000 0.000000 0.395000 236.500000
50% 4.000000 55.830000 1.400000 7.270000 470.000000
75% 7.010000 68.540000 14.700000 51.000000 912.000000
max 99.800000 86.680000 53.200000 74.460000 2240.000000
Iron_(mg) Magnesium_(mg) Phosphorus_(mg) Potassium_(mg) \

count 127.000000 127.000000 127.000000 127.000000
mean 17.172126 136.440945 358.480315 935.023622
std 21.835832 137.197902 263.466297 801.534456
min 0.000000 0.000000 0.000000 5.000000
25% 3.195000 41.000000 169.500000 500.500000
50% 9.260000 90.000000 299.000000 636.000000
75% 20.470000 180.500000 449.000000 1263.500000
max 123.600000 711.000000 1349.000000 4740.000000
Sodium_(mg) Zinc_(mg) Copper_mg) Manganese_(mg) Selenium_(µg) \

count 127.000000 127.000000 127.000000 127.000000 127.000000
mean 858.181102 3.765433 0.485024 3.675102 26.554331
std 3848.331336 2.743998 0.452203 8.656279 37.762413
min 0.000000 0.010000 0.000000 0.000000 0.000000
25% 51.000000 1.795000 0.138500 0.124500 6.750000
50% 151.000000 3.860000 0.386000 0.677000 14.100000
75% 336.500000 5.145000 0.639000 2.660000 27.300000
max 38758.000000 17.460000 2.467000 60.127000 208.100000
EPW
count 127.000000
mean 131.393840
std 325.416276
min 0.000000
25% 36.045815
50% 72.244898
75% 165.333333
max 3650.000000
count 10.000000 10.000000 10.000000 10.000000 10.000000
count 10.000000 10.000000 10.000000 10.000000 10.000000
mean 1291.200000 26.282000 416.200000 21.058000 27.528000
std 610.900938 5.029674 115.451577 15.006813 22.019365
min 1023.000000 17.940000 315.000000 0.820000 2.240000
25% 1034.250000 23.237500 344.250000 7.107500 12.982500
50% 1120.000000 27.935000 402.500000 29.115000 27.140000
75% 1153.500000 29.122500 418.750000 31.480000 31.215000
max 3019.000000 33.190000 717.000000 37.860000 81.110000

count 10.000000 10.000000 10.000000 10.000000 10.000000
mean 3.925000 21.312000 0.200000 17.788000 683.700000
std 2.586694 29.027888 0.632456 27.866648 485.891632
min 0.090000 0.060000 0.000000 0.060000 18.000000
25% 1.762500 2.432500 0.000000 0.452500 259.250000
50% 4.010000 3.520000 0.000000 0.850000 874.000000
75% 6.302500 44.277500 0.000000 37.847500 1050.750000
max 7.180000 76.610000 2.000000 68.650000 1253.000000

count 10.00000 10.000000 10.000000 10.000000
mean 0.64300 31.900000 464.300000 164.100000
std 0.56362 17.175241 304.899637 131.363321
min 0.02000 2.000000 23.000000 24.000000
25% 0.17500 23.000000 208.000000 82.250000
50% 0.63000 35.000000 616.000000 94.500000
75% 0.85750 43.250000 720.250000 279.000000
max 1.88000 54.000000 760.000000 371.000000

count 10.000000 10.000000 10.000000 10.000000 10.000000
mean 772.700000 2.017000 0.107500 0.111500 13.520000
std 732.488612 1.513848 0.195091 0.220075 11.120332
min 11.000000 0.090000 0.004000 0.002000 1.000000
25% 127.500000 0.827500 0.019500 0.008750 3.400000
50% 568.500000 2.085000 0.032000 0.020000 14.500000
75% 1418.750000 3.080000 0.039250 0.059000 20.575000
max 1804.000000 4.200000 0.627000 0.700000 34.400000
EPW
count 10.000000
mean 16.917441
std 8.489844
min 10.971787
25% 12.462685
50% 14.511953
75% 17.114903
max 39.966555
count 52.000000 52.000000 52.000000 52.000000 52.000000
mean 1127.615385 42.910769 336.288462 19.719423 25.033269
std 103.775239 4.096820 46.150791 6.898110 6.883962
min 1006.000000 35.480000 200.000000 2.110000 2.860000
25% 1029.750000 39.242500 308.500000 17.812500 22.262500
50% 1102.500000 42.670000 345.500000 22.025000 26.300000
75% 1238.250000 46.492500 369.500000 24.365000 30.010000
max 1306.000000 50.010000 410.000000 27.350000 33.820000

count 52.000000 52.000000 52.000000 52.000000 52.000000
mean 4.000577 8.339615 0.396154 4.217308 615.557692
std 1.806044 11.989693 1.615120 6.739612 277.814045
min 0.390000 0.120000 0.000000 0.000000 63.000000
25% 3.202500 1.977500 0.000000 0.487500 538.750000
50% 3.860000 3.075000 0.000000 1.175000 676.500000
75% 5.037500 8.380000 0.000000 5.567500 718.500000
max 8.030000 42.860000 9.300000 25.000000 1375.000000

count 52.000000 52.000000 52.000000 52.000000
mean 0.401923 28.307692 456.057692 147.384615
std 0.305255 12.983456 195.152973 89.547699
min 0.000000 10.000000 26.000000 41.000000
min 0.000000 10.000000 26.000000 41.000000
25% 0.180000 23.750000 392.750000 84.000000
50% 0.365000 27.000000 462.000000 125.500000
75% 0.567500 29.000000 546.500000 188.000000
max 1.620000 100.000000 875.000000 455.000000

count 52.000000 52.000000 52.000000 52.000000 52.000000
mean 793.519231 2.711154 0.061327 0.128462 15.911538
std 486.562955 1.104816 0.095742 0.227085 8.216809
min 21.000000 0.210000 0.008000 0.008000 1.100000
25% 603.000000 2.285000 0.025000 0.013750 14.500000
50% 702.000000 2.960000 0.032500 0.027500 14.500000
75% 1007.250000 3.512500 0.042250 0.063500 18.075000
max 1809.000000 4.440000 0.564000 0.677000 43.400000
EPW
count 52.000000
mean 7.990496
std 1.744799
min 4.155412
25% 6.718986
50% 7.913460
75% 9.482712
max 11.341632
count 41.000000 41.000000 41.000000 41.000000 41.000000
mean 1225.097561 57.147561 236.097561 11.525366 14.667317
std 250.416234 5.581205 66.178095 9.071350 10.715499
min 1007.000000 50.060000 101.000000 0.050000 0.000000
25% 1072.000000 51.800000 176.000000 3.200000 3.080000
50% 1190.000000 56.440000 254.000000 12.000000 14.100000
75% 1243.000000 61.820000 288.000000 18.520000 22.820000
max 2051.000000 66.440000 350.000000 32.140000 36.080000

count 41.000000 41.000000 41.000000 41.000000 41.000000
mean 2.716829 12.317317 0.804878 8.076098 310.390244
std 2.509094 11.064190 2.607676 8.315843 309.052170
min 0.180000 0.000000 0.000000 0.000000 0.000000
25% 0.760000 2.980000 0.000000 0.560000 90.000000
50% 1.580000 10.600000 0.000000 6.700000 134.000000
75% 3.680000 17.130000 0.200000 12.650000 529.000000
max 10.370000 39.640000 14.000000 33.040000 1109.000000

count 41.000000 41.000000 41.000000 41.000000
mean 1.027073 29.878049 327.780488 166.170732
std 2.840742 36.821322 306.890413 116.825704
min 0.000000 1.000000 6.000000 18.000000
25% 0.090000 10.000000 83.000000 97.000000
50% 0.190000 16.000000 256.000000 132.000000
75% 0.530000 28.000000 484.000000 204.000000
max 17.450000 160.000000 1024.000000 609.000000

count 41.000000 41.000000 41.000000 41.000000 41.000000
mean 580.000000 1.545366 0.085561 0.297707 17.307317
std 725.226482 1.269847 0.160070 0.375448 18.935752
min 4.000000 0.010000 0.000000 0.001000 0.000000
25% 61.000000 0.310000 0.018000 0.011000 2.300000
50% 146.000000 1.640000 0.027000 0.080000 12.700000
75% 1000.000000 2.500000 0.077000 0.677000 19.300000
max 3487.000000 4.110000 0.732000 1.719000 56.900000
EPW
count 41.000000
mean 4.236746
std 1.414591
min 1.551221
25% 2.846975
50% 4.373033
75% 5.405937
75% 5.405937
max 6.651463
count 175.000000 175.000000 175.000000 175.000000 175.000000
mean 2379.885714 77.783600 105.457143 5.172114 4.262629
std 1123.918049 3.955846 34.991413 4.345693 4.625422
min 1012.000000 66.860000 60.000000 0.100000 0.000000
25% 1183.000000 75.045000 78.000000 2.080000 0.300000
50% 3011.000000 79.000000 98.000000 3.600000 2.700000
75% 3227.500000 81.240000 128.000000 8.085000 6.970000
max 3952.000000 83.200000 208.000000 15.690000 19.520000

count 175.000000 175.000000 175.000000 175.000000 175.000000
mean 0.879371 11.888057 0.498286 6.842343 94.822857
std 0.857238 6.400474 1.338659 6.095994 79.846674
min 0.140000 0.000000 0.000000 0.000000 3.000000
25% 0.525000 6.830000 0.000000 0.340000 16.000000
50% 0.790000 13.210000 0.000000 6.400000 91.000000
75% 1.030000 17.070000 0.650000 11.565000 136.000000
max 10.300000 23.530000 14.100000 21.000000 351.000000

count 175.000000 175.000000 175.000000 175.000000
mean 1.487200 14.777143 95.714286 155.342857
std 2.604518 14.799485 67.594366 82.264622
min 0.000000 0.000000 3.000000 18.000000
25% 0.115000 8.500000 45.000000 116.000000
50% 0.350000 11.000000 96.000000 138.000000
75% 1.745000 16.000000 124.000000 182.500000
max 14.420000 100.000000 523.000000 668.000000

count 175.000000 175.00000 175.000000 175.000000 175.000000
mean 93.805714 0.82040 0.067914 0.324766 9.161143
std 287.822297 0.92372 0.056852 0.329973 12.514446
min 0.000000 0.01000 0.000000 0.000000 0.000000
25% 31.000000 0.33000 0.027000 0.019000 2.200000
50% 47.000000 0.67000 0.056000 0.065000 3.500000
75% 80.000000 1.06000 0.092000 0.677000 9.650000
max 3663.000000 7.50000 0.329000 1.176000 43.400000
EPW
count 175.000000
mean 1.378592
std 0.526554
min 0.721154
25% 0.979906
50% 1.230272
75% 1.685755
max 2.946455
count 193.000000 193.000000 193.000000 193.000000 193.000000
mean 2757.367876 87.678187 53.357513 2.055285 1.515855
std 947.352355 2.418911 14.250061 1.843721 1.508434
min 1059.000000 83.300000 0.000000 0.000000 0.000000
25% 2053.000000 86.340000 44.000000 0.810000 0.180000
50% 3105.000000 87.540000 56.000000 1.800000 1.000000
75% 3267.000000 88.600000 63.000000 3.140000 2.910000
max 3953.000000 99.900000 97.000000 10.900000 6.930000

count 193.000000 193.000000 193.000000 193.000000 193.000000
mean 0.593057 8.097513 0.775130 4.526995 51.766839
std 0.660497 3.208327 0.988003 3.666433 51.241082
min 0.000000 0.000000 0.000000 0.000000 0.000000
25% 0.330000 6.090000 0.000000 1.250000 10.000000
50% 0.480000 7.550000 0.100000 4.460000 28.000000
75% 0.700000 10.300000 1.300000 7.180000 73.000000
max 8.040000 16.000000 6.800000 14.400000 208.000000

count 193.000000 193.000000 193.000000 193.000000
count 193.000000 193.000000 193.000000 193.000000
mean 0.537150 10.585492 45.155440 125.056995
std 1.036343 9.302452 35.711469 76.524791
min 0.000000 0.000000 0.000000 0.000000
25% 0.140000 5.000000 15.000000 75.000000
50% 0.320000 9.000000 35.000000 115.000000
75% 0.530000 13.000000 63.000000 156.000000
max 11.870000 64.000000 157.000000 738.000000

count 193.000000 193.000000 193.000000 193.000000 193.000000
mean 52.284974 0.363368 0.048036 0.326145 4.057513
std 189.002857 0.267369 0.050270 0.324148 9.247283
min 0.000000 0.000000 0.000000 0.000000 0.000000
25% 6.000000 0.120000 0.022000 0.014000 0.400000
50% 24.000000 0.360000 0.040000 0.140000 1.800000
75% 47.000000 0.500000 0.053000 0.677000 2.800000
max 2348.000000 1.180000 0.385000 1.264000 43.400000
EPW
count 193.000000
mean 0.612304
std 0.172423
min 0.000000
25% 0.508475
50% 0.638468
75% 0.726846
max 1.163209
In [ ]:

Data Dictionary Data Dictionary: Set The Working Directory Set The Working Directory

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Data Dictionary Data Dictionary: Set The Working Directory Set The Working Directory

Uploaded by

Copyright:

Available Formats

Problem Statement - Food Nutrition

Set the working directory

Import Excel file

View First 10 rows

3 1004 CHEESE,BLUE 42.41 353 21.40 28.74 NaN 2.34 0.0

4 1005 CHEESE,BRICK 41.11 371 23.24 NaN 3.18 2.79 0.0

5 1006 CHEESE,BRIE 48.42 334 20.75 27.68 2.70 0.45 0.0

6 1007 CHEESE,CAMEMBERT 51.80 300 19.80 24.26 3.68 0.46 0.0

7 1008 CHEESE,CARAWAY 39.28 376 25.18 29.20 3.28 3.06 0.0

8 1009 CHEESE,CHEDDAR 37.02 404 22.87 33.31 3.71 3.09 0.0

9 1010 CHEESE,CHESHIRE 37.65 387 23.37 30.60 3.60 4.78 0.0

3 1004 CHEESE,BLUE 42.41 353 21.40 28.74 NaN 2.34 0.0

4 1005 CHEESE,BRICK 41.11 371 23.24 NaN 3.18 2.79 0.0

5 1006 CHEESE,BRIE 48.42 334 20.75 27.68 2.70 0.45 0.0

6 1007 CHEESE,CAMEMBERT 51.80 300 19.80 24.26 3.68 0.46 0.0

7 1008 CHEESE,CARAWAY 39.28 376 25.18 29.20 3.28 3.06 0.0

8 1009 CHEESE,CHEDDAR 37.02 404 22.87 33.31 3.71 3.09 0.0

9 1010 CHEESE,CHESHIRE 37.65 387 23.37 30.60 3.60 4.78 0.0

View Last 20 records

NDB_No Shrt_Desc Water_(g) Energ_Kcal Protein_(g) Lipid_Tot_(g) Ash_(g) Carbohydrt_(g) Fiber_TD_(g)

BABYFOOD,CORN & SWT

INF FORM, ABBO NUTR,

INF FORMULA, MEA

INF FORMULA, ABBOTT,

INF FORMULA, ABB NUTR,

INF FORMULA, ABB

INF FORMULA, ABB

INF FORMULA, ABBOTT

INF FORMULA, ABB NUTR,

INF FORMULA, ABBO

The number of rows are 600

Check for missing values

Summary of the data

Create a vector “test” using the top 10 values of variable

Select the top 5 rows of initial 5 variables in a matrix

Calcium_(mg) Iron_(mg) Magnesium_(mg) Phosphorus_(mg) Potassium_(mg)

5 184 0.50 20 188 152

6 388 0.33 20 347 187

What is the datatype of the Sodium(mg) variable

Create a new variable “EPW” by dividing Energ_Kcal with the

5 1006 CHEESE,BRIE 48.42 334 20.75 27.68 2.70 0.45 0.0

6 1007 CHEESE,CAMEMBERT 51.80 300 19.80 24.26 3.68 0.46 0.0

Find the top 10 products based on following

1. Lower the water content, higher the ranking

NDB_No Shrt_Desc Water_(g) Energ_Kcal Protein_(g) Lipid_Tot_(g) Ash_(g) Carbohydrt_(g) Fib

295 2047 SALT,TABLE 0.20 0 0.00 0.00 99.80 0.00

2 1003 BUTTER OIL,ANHYDROUS 0.24 876 0.28 99.48 0.00 0.00

63 1070 DESSERT TOPPING,POWDERED 1.47 577 4.90 39.92 1.17 52.54

INF FORMULA,PBM PRODUCTS,STORE

INF FORMULA. MEAD JOHNSON,

INF FORMULA,PBM PRODUCTS,STORE

Create a subset of the data where product_desc contains

5 1006 CHEESE,BRIE 48.42 334 20.75 27.68 2.70 0.45 0.0

we can either do this in one go, like shown below

count 127.000000 10.000000 52.000000 41.000000 175.000000 193.000000

mean 2427.000000 1291.200000 1127.615385 1225.097561 2379.885714 2757.367876

NDB_No std 1006.346851 610.900938 103.775239 250.416234 1123.918049 947.352355

min 1001.000000 1023.000000 1006.000000 1007.000000 1012.000000 1059.000000

25% 2003.500000 1034.250000 1029.750000 1072.000000 1183.000000 2053.000000

... ... ... ... ... ... ... ...

min 0.000000 10.971787 4.155412 1.551221 0.721154 0.000000

25% 36.045815 12.462685 6.718986 2.846975 0.979906 0.508475