Introduction¶

This data came from an online website called data.world. It was initially collected in 2014 and then revised in 2015. It includes 8,789 food items and lists the amounts they contain of different nutrients. The source is the United States Department of Agriculture.

Here is the link to my data: https://data.world/awram/food-nutritional-values

In [1]:
import pandas as pd
df = pd.read_excel('https://query.data.world/s/h5xr3mwavgls7t2g65zgtpgy2yvz4y?dws=00000')

df
/var/folders/gc/0752xrm56pnf0r0dsrn5370c0000gr/T/ipykernel_57586/299688920.py:1: DeprecationWarning: 
Pyarrow will become a required dependency of pandas in the next major release of pandas (pandas 3.0),
(to allow more performant data types, such as the Arrow string type, and better interoperability with other libraries)
but was not found to be installed on your system.
If this would cause problems for you,
please provide us feedback at https://github.com/pandas-dev/pandas/issues/54466
        
  import pandas as pd
Out[1]:
NDB_No Shrt_Desc Water_(g) Energ_Kcal Protein_(g) Lipid_Tot_(g) Ash_(g) Carbohydrt_(g) Fiber_TD_(g) Sugar_Tot_(g) ... Vit_K_(µg) FA_Sat_(g) FA_Mono_(g) FA_Poly_(g) Cholestrl_(mg) GmWt_1 GmWt_Desc1 GmWt_2 GmWt_Desc2 Refuse_Pct
0 1001 BUTTER,WITH SALT 15.87 717 0.85 81.11 2.11 0.06 0.0 0.06 ... 7.0 51.368 21.021 3.043 215.0 5.00 1 pat, (1" sq, 1/3" high) 14.2 1 tbsp 0.0
1 1002 BUTTER,WHIPPED,W/ SALT 16.72 718 0.49 78.30 1.62 2.87 0.0 0.06 ... 4.6 45.390 19.874 3.331 225.0 3.80 1 pat, (1" sq, 1/3" high) 9.4 1 tbsp 0.0
2 1003 BUTTER OIL,ANHYDROUS 0.24 876 0.28 99.48 0.00 0.00 0.0 0.00 ... 8.6 61.924 28.732 3.694 256.0 12.80 1 tbsp 205.0 1 cup 0.0
3 1004 CHEESE,BLUE 42.41 353 21.40 28.74 5.11 2.34 0.0 0.50 ... 2.4 18.669 7.778 0.800 75.0 28.35 1 oz 17.0 1 cubic inch 0.0
4 1005 CHEESE,BRICK 41.11 371 23.24 29.68 3.18 2.79 0.0 0.51 ... 2.5 18.764 8.598 0.784 94.0 132.00 1 cup, diced 113.0 1 cup, shredded 0.0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
8785 83110 MACKEREL,SALTED 43.00 305 18.50 25.10 13.40 0.00 0.0 0.00 ... 7.8 7.148 8.320 6.210 95.0 80.00 1 piece, (5-1/2" x 1-1/2" x 1/2") 17.0 1 cubic inch, boneless 0.0
8786 90240 SCALLOP,(BAY&SEA),CKD,STMD 70.25 111 20.54 0.84 2.97 5.41 0.0 0.00 ... 0.0 0.218 0.082 0.222 41.0 85.00 3 oz NaN NaN 0.0
8787 90480 SYRUP,CANE 26.00 269 0.00 0.00 0.86 73.14 0.0 73.20 ... 0.0 0.000 0.000 0.000 0.0 21.00 1 serving NaN NaN 0.0
8788 90560 SNAIL,RAW 79.20 90 16.10 1.40 1.30 2.00 0.0 0.00 ... 0.1 0.361 0.259 0.252 50.0 85.00 3 oz NaN NaN 0.0
8789 93600 TURTLE,GREEN,RAW 78.50 89 19.80 0.50 1.20 0.00 0.0 0.00 ... 0.1 0.127 0.088 0.170 50.0 85.00 3 oz NaN NaN 0.0

8790 rows × 53 columns

Preprocessing¶

In [2]:
# Imports

import numpy as np
import pandas as pd
import seaborn as sns
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt

In the next two cells I am filtering the data set to only include columns I am most interested in. Then, I remove any rows with missing values. By checking how many rows and columns the dataframe has before and after removing these values, we see that there were not any missing values in the selected columns, because the shape has not changed.

In [3]:
columns = ["NDB_No", "Shrt_Desc", "Water_(g)", "Energ_Kcal", "Protein_(g)", "Lipid_Tot_(g)", "Carbohydrt_(g)", "Fiber_TD_(g)"]
df = df[columns]
df.shape
Out[3]:
(8790, 8)
In [4]:
df.dropna(inplace=True)
df.shape
/var/folders/gc/0752xrm56pnf0r0dsrn5370c0000gr/T/ipykernel_57586/2053667463.py:1: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df.dropna(inplace=True)
Out[4]:
(8195, 8)

The next cell adds a categorical column to the data frame. This colomn is true if the item contains protein, and false if not.

In [5]:
df["has_Protein"] = df["Protein_(g)"] > 0
df
/var/folders/gc/0752xrm56pnf0r0dsrn5370c0000gr/T/ipykernel_57586/2696937032.py:1: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df["has_Protein"] = df["Protein_(g)"] > 0
Out[5]:
NDB_No Shrt_Desc Water_(g) Energ_Kcal Protein_(g) Lipid_Tot_(g) Carbohydrt_(g) Fiber_TD_(g) has_Protein
0 1001 BUTTER,WITH SALT 15.87 717 0.85 81.11 0.06 0.0 True
1 1002 BUTTER,WHIPPED,W/ SALT 16.72 718 0.49 78.30 2.87 0.0 True
2 1003 BUTTER OIL,ANHYDROUS 0.24 876 0.28 99.48 0.00 0.0 True
3 1004 CHEESE,BLUE 42.41 353 21.40 28.74 2.34 0.0 True
4 1005 CHEESE,BRICK 41.11 371 23.24 29.68 2.79 0.0 True
... ... ... ... ... ... ... ... ... ...
8785 83110 MACKEREL,SALTED 43.00 305 18.50 25.10 0.00 0.0 True
8786 90240 SCALLOP,(BAY&SEA),CKD,STMD 70.25 111 20.54 0.84 5.41 0.0 True
8787 90480 SYRUP,CANE 26.00 269 0.00 0.00 73.14 0.0 False
8788 90560 SNAIL,RAW 79.20 90 16.10 1.40 2.00 0.0 True
8789 93600 TURTLE,GREEN,RAW 78.50 89 19.80 0.50 0.00 0.0 True

8195 rows × 9 columns

The next cells define a function to determine if a given value is an outlier using the 1.5 * IQR standard. This means that if a value is more than 1.5 * the IQR less than Q1 or greater than Q3, it is considered an outlier. Then this is applied to the columns of interest in order to determine how many outliers there are.

In [6]:
def is_outlier(x):
    Q25, Q75 = x.quantile([.25,.75])
    I = Q75 - Q25
    return (x < Q25 - 1.5*I) |  (x > Q75 + 1.5*I)

outliers_water = is_outlier(df["Water_(g)"])
outliers_water.sum()
Out[6]:
0
In [7]:
outliers_energy = is_outlier(df["Energ_Kcal"])
outliers_energy.sum()
Out[7]:
136
In [8]:
outliers_protein = is_outlier(df["Protein_(g)"])
outliers_protein.sum()
Out[8]:
41
In [9]:
outliers_lipid = is_outlier(df["Lipid_Tot_(g)"])
outliers_lipid.sum()
Out[9]:
454
In [10]:
outliers_carb = is_outlier(df["Carbohydrt_(g)"])
outliers_carb.sum()
Out[10]:
25
In [11]:
outliers_fiber = is_outlier(df["Fiber_TD_(g)"])
outliers_fiber.sum()
Out[11]:
734

After running the outlier function on all the quantitative columns of interest, we find the following amounts of outliers:

  • Water: 0
  • Energy: 136
  • Protein: 41
  • Total Lipids: 454
  • Carbohydrates: 25
  • Fiber (dietary): 734

It is interesting that the water category doesn't have any. This tells us that none of the food items have a water content drastically above or below all the rest.

We should also recognize that the presence of such a large number of outliers in the fiber category means that there is a lot of variability in the fiber content of the foods in our data set. The total lipids category also has a large number but is not as extreme of a case as fiber.

Because this data set is so large (8,789 foods), even if their are a seemingly high amount of outliers, (ex. Energy with 136) this only makes up 1.5% of the data. For the fiber category, 8.35% of the data points are considered outliers.

Summary Data Analysis¶

The following cells show summary statistics (count, mean, standard deviation, each quartile, and the max) and relevant graphs (Empirical CDF, PDF) for each of the columns of interest. The units for the energy column are Kcals, and for all other columns they are grams.

In [12]:
# Water
df["Water_(g)"].describe()
Out[12]:
count    8195.000000
mean       53.626142
std        30.808801
min         0.000000
25%        28.255000
50%        62.670000
75%        77.200000
max       100.000000
Name: Water_(g), dtype: float64
In [13]:
sns.displot(df["Water_(g)"], kind="ecdf");
No description has been provided for this image
In [14]:
sns.displot(df["Water_(g)"], kind="hist");
No description has been provided for this image

From these graphs we see that there are a high number of foods with 0g of water, and those that do have water don't vary in amount too drastically.

In [15]:
# Energy
df["Energ_Kcal"].describe()
Out[15]:
count    8195.000000
mean      228.613179
std       169.683531
min         0.000000
25%        95.000000
50%       193.000000
75%       341.000000
max       902.000000
Name: Energ_Kcal, dtype: float64
In [16]:
sns.displot(df["Energ_Kcal"], kind="ecdf");
No description has been provided for this image
In [17]:
sns.displot(df["Energ_Kcal"], kind="hist");
No description has been provided for this image

From these graphs we see that there are more foods in the data set that are between 0 and 400 calories than 400 and 800 meaning it is more common for foods to be lower in calories.

In [18]:
# Protein
df["Protein_(g)"].describe()
Out[18]:
count    8195.000000
mean       11.419850
std        10.516845
min         0.000000
25%         2.400000
50%         8.180000
75%        20.050000
max        88.320000
Name: Protein_(g), dtype: float64
In [19]:
sns.displot(df["Protein_(g)"], kind="ecdf");
No description has been provided for this image
In [20]:
sns.displot(df["Protein_(g)"], kind="hist");
No description has been provided for this image

These graphs show that more foods have low amounts of protein than have very high aounts. There is a large amount of foods that have no protein at all.

In [21]:
# Total Lipid
df["Lipid_Tot_(g)"].describe()
Out[21]:
count    8195.000000
mean       10.660617
std        15.803175
min         0.000000
25%         1.050000
50%         5.330000
75%        13.875000
max       100.000000
Name: Lipid_Tot_(g), dtype: float64
In [22]:
sns.displot(df["Lipid_Tot_(g)"], kind="ecdf");
No description has been provided for this image
In [23]:
sns.displot(df["Lipid_Tot_(g)"], kind="hist");
No description has been provided for this image

These graphs show that there are a lot of foods that have no lipids and very few that have a very high total lipid content.

In [24]:
# Carbohydrates
df["Carbohydrt_(g)"].describe()
Out[24]:
count    8195.000000
mean       22.435128
std        27.494874
min         0.000000
25%         0.030000
50%         9.570000
75%        37.640000
max       100.000000
Name: Carbohydrt_(g), dtype: float64
In [25]:
sns.displot(df["Carbohydrt_(g)"], kind="ecdf");
No description has been provided for this image
In [26]:
sns.displot(df["Carbohydrt_(g)"], kind="hist");
No description has been provided for this image

These graphs show that many foods in the data set have 0 carbohydrates and that foods that do have carbohydrages are pretty evenly spread out in their amounts.

In [27]:
# Fiber
df["Fiber_TD_(g)"].describe()
Out[27]:
count    8195.000000
mean        2.187126
std         4.383311
min         0.000000
25%         0.000000
50%         0.700000
75%         2.600000
max        79.000000
Name: Fiber_TD_(g), dtype: float64
In [28]:
sns.displot(df["Fiber_TD_(g)"], kind="ecdf");
No description has been provided for this image
In [29]:
sns.displot(df["Fiber_TD_(g)"], kind="hist");
No description has been provided for this image

These graphs show that there is a very high number of foods in the data set that have 0g of fiber. Most foods that do have fiber have between 1 and 10 grams and there are also some foods that have a very high fiber content.

Correlations¶

  1. Analazing the correlation between protein and dietary fiber.
In [30]:
columns1 = ["Protein_(g)", "Water_(g)"]
sns.pairplot(data=df[columns1])
Out[30]:
<seaborn.axisgrid.PairGrid at 0x127b6e690>
No description has been provided for this image
In [31]:
df[columns1].corr()
Out[31]:
Protein_(g) Water_(g)
Protein_(g) 1.000000 -0.089458
Water_(g) -0.089458 1.000000
In [32]:
df[columns1].corr("spearman")
Out[32]:
Protein_(g) Water_(g)
Protein_(g) 1.000000 -0.285706
Water_(g) -0.285706 1.000000

The Pearson correlation coefficient between protein and water being close to zero, suggests they do not have a strong correlation. I looked at the Spearman coefficient to ensure that outliers were not too strongly affecting the results, and it was further from zero but a weak association. It is negative which suggests that a lower water content would indicate a higher protein content.

  1. Analyzing the correlation between energy and carbohydrates.
In [33]:
columns2 = ["Energ_Kcal", "Carbohydrt_(g)"]
sns.pairplot(data=df[columns2])
Out[33]:
<seaborn.axisgrid.PairGrid at 0x127a02a90>
No description has been provided for this image
In [34]:
df[columns2].corr()
Out[34]:
Energ_Kcal Carbohydrt_(g)
Energ_Kcal 1.000000 0.494455
Carbohydrt_(g) 0.494455 1.000000
In [35]:
df[columns2].corr("spearman")
Out[35]:
Energ_Kcal Carbohydrt_(g)
Energ_Kcal 1.000000 0.373511
Carbohydrt_(g) 0.373511 1.000000

The Pearson correlation coefficient between energy and carbohydrates suggests a moderate positive association. This means that a higher carbohydrate content is correlated to higher energy content. I looked at the Spearman coefficient to ensure that outliers were not too strongly affecting the results, and it was weaker than the Pearson but still moderately positive.

  1. Analyzing the correlation between protein and fiber content.
In [36]:
columns3 = ["Protein_(g)", "Fiber_TD_(g)"]
sns.pairplot(data=df[columns3])
Out[36]:
<seaborn.axisgrid.PairGrid at 0x13018ab90>
No description has been provided for this image
In [37]:
df[columns3].corr()
Out[37]:
Protein_(g) Fiber_TD_(g)
Protein_(g) 1.000000 -0.081103
Fiber_TD_(g) -0.081103 1.000000
In [38]:
df[columns3].corr("spearman")
Out[38]:
Protein_(g) Fiber_TD_(g)
Protein_(g) 1.00000 -0.31047
Fiber_TD_(g) -0.31047 1.00000

The Pearson correlation coefficient between protein and fiber suggests no association. The result is close to zero which would be absolutely no association. I looked at the Spearman coefficient to ensure that outliers were not too strongly affecting the results, and it was weakly negative which would mean higher protein content is associated with lower fiber content.

Discussion¶

My Questions:

  1. Does a food item having a high amount of water, fiber, and energy increase the chances it has protein? (categorical)
  2. Can the amount of protien, carbohydrates, and fiber a food item has be used to predict if it is above average (of those in this data set) in its amount of energy?