Introduction¶
This data came from an online website called data.world. It was initially collected in 2014 and then revised in 2015. It includes 8,789 food items and lists the amounts they contain of different nutrients. The source is the United States Department of Agriculture.
Here is the link to my data: https://data.world/awram/food-nutritional-values
import pandas as pd
df = pd.read_excel('https://query.data.world/s/h5xr3mwavgls7t2g65zgtpgy2yvz4y?dws=00000')
df
/var/folders/gc/0752xrm56pnf0r0dsrn5370c0000gr/T/ipykernel_57586/299688920.py:1: DeprecationWarning: Pyarrow will become a required dependency of pandas in the next major release of pandas (pandas 3.0), (to allow more performant data types, such as the Arrow string type, and better interoperability with other libraries) but was not found to be installed on your system. If this would cause problems for you, please provide us feedback at https://github.com/pandas-dev/pandas/issues/54466 import pandas as pd
NDB_No | Shrt_Desc | Water_(g) | Energ_Kcal | Protein_(g) | Lipid_Tot_(g) | Ash_(g) | Carbohydrt_(g) | Fiber_TD_(g) | Sugar_Tot_(g) | ... | Vit_K_(µg) | FA_Sat_(g) | FA_Mono_(g) | FA_Poly_(g) | Cholestrl_(mg) | GmWt_1 | GmWt_Desc1 | GmWt_2 | GmWt_Desc2 | Refuse_Pct | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1001 | BUTTER,WITH SALT | 15.87 | 717 | 0.85 | 81.11 | 2.11 | 0.06 | 0.0 | 0.06 | ... | 7.0 | 51.368 | 21.021 | 3.043 | 215.0 | 5.00 | 1 pat, (1" sq, 1/3" high) | 14.2 | 1 tbsp | 0.0 |
1 | 1002 | BUTTER,WHIPPED,W/ SALT | 16.72 | 718 | 0.49 | 78.30 | 1.62 | 2.87 | 0.0 | 0.06 | ... | 4.6 | 45.390 | 19.874 | 3.331 | 225.0 | 3.80 | 1 pat, (1" sq, 1/3" high) | 9.4 | 1 tbsp | 0.0 |
2 | 1003 | BUTTER OIL,ANHYDROUS | 0.24 | 876 | 0.28 | 99.48 | 0.00 | 0.00 | 0.0 | 0.00 | ... | 8.6 | 61.924 | 28.732 | 3.694 | 256.0 | 12.80 | 1 tbsp | 205.0 | 1 cup | 0.0 |
3 | 1004 | CHEESE,BLUE | 42.41 | 353 | 21.40 | 28.74 | 5.11 | 2.34 | 0.0 | 0.50 | ... | 2.4 | 18.669 | 7.778 | 0.800 | 75.0 | 28.35 | 1 oz | 17.0 | 1 cubic inch | 0.0 |
4 | 1005 | CHEESE,BRICK | 41.11 | 371 | 23.24 | 29.68 | 3.18 | 2.79 | 0.0 | 0.51 | ... | 2.5 | 18.764 | 8.598 | 0.784 | 94.0 | 132.00 | 1 cup, diced | 113.0 | 1 cup, shredded | 0.0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
8785 | 83110 | MACKEREL,SALTED | 43.00 | 305 | 18.50 | 25.10 | 13.40 | 0.00 | 0.0 | 0.00 | ... | 7.8 | 7.148 | 8.320 | 6.210 | 95.0 | 80.00 | 1 piece, (5-1/2" x 1-1/2" x 1/2") | 17.0 | 1 cubic inch, boneless | 0.0 |
8786 | 90240 | SCALLOP,(BAY&SEA),CKD,STMD | 70.25 | 111 | 20.54 | 0.84 | 2.97 | 5.41 | 0.0 | 0.00 | ... | 0.0 | 0.218 | 0.082 | 0.222 | 41.0 | 85.00 | 3 oz | NaN | NaN | 0.0 |
8787 | 90480 | SYRUP,CANE | 26.00 | 269 | 0.00 | 0.00 | 0.86 | 73.14 | 0.0 | 73.20 | ... | 0.0 | 0.000 | 0.000 | 0.000 | 0.0 | 21.00 | 1 serving | NaN | NaN | 0.0 |
8788 | 90560 | SNAIL,RAW | 79.20 | 90 | 16.10 | 1.40 | 1.30 | 2.00 | 0.0 | 0.00 | ... | 0.1 | 0.361 | 0.259 | 0.252 | 50.0 | 85.00 | 3 oz | NaN | NaN | 0.0 |
8789 | 93600 | TURTLE,GREEN,RAW | 78.50 | 89 | 19.80 | 0.50 | 1.20 | 0.00 | 0.0 | 0.00 | ... | 0.1 | 0.127 | 0.088 | 0.170 | 50.0 | 85.00 | 3 oz | NaN | NaN | 0.0 |
8790 rows × 53 columns
Preprocessing¶
# Imports
import numpy as np
import pandas as pd
import seaborn as sns
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt
In the next two cells I am filtering the data set to only include columns I am most interested in. Then, I remove any rows with missing values. By checking how many rows and columns the dataframe has before and after removing these values, we see that there were not any missing values in the selected columns, because the shape has not changed.
columns = ["NDB_No", "Shrt_Desc", "Water_(g)", "Energ_Kcal", "Protein_(g)", "Lipid_Tot_(g)", "Carbohydrt_(g)", "Fiber_TD_(g)"]
df = df[columns]
df.shape
(8790, 8)
df.dropna(inplace=True)
df.shape
/var/folders/gc/0752xrm56pnf0r0dsrn5370c0000gr/T/ipykernel_57586/2053667463.py:1: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy df.dropna(inplace=True)
(8195, 8)
The next cell adds a categorical column to the data frame. This colomn is true if the item contains protein, and false if not.
df["has_Protein"] = df["Protein_(g)"] > 0
df
/var/folders/gc/0752xrm56pnf0r0dsrn5370c0000gr/T/ipykernel_57586/2696937032.py:1: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy df["has_Protein"] = df["Protein_(g)"] > 0
NDB_No | Shrt_Desc | Water_(g) | Energ_Kcal | Protein_(g) | Lipid_Tot_(g) | Carbohydrt_(g) | Fiber_TD_(g) | has_Protein | |
---|---|---|---|---|---|---|---|---|---|
0 | 1001 | BUTTER,WITH SALT | 15.87 | 717 | 0.85 | 81.11 | 0.06 | 0.0 | True |
1 | 1002 | BUTTER,WHIPPED,W/ SALT | 16.72 | 718 | 0.49 | 78.30 | 2.87 | 0.0 | True |
2 | 1003 | BUTTER OIL,ANHYDROUS | 0.24 | 876 | 0.28 | 99.48 | 0.00 | 0.0 | True |
3 | 1004 | CHEESE,BLUE | 42.41 | 353 | 21.40 | 28.74 | 2.34 | 0.0 | True |
4 | 1005 | CHEESE,BRICK | 41.11 | 371 | 23.24 | 29.68 | 2.79 | 0.0 | True |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
8785 | 83110 | MACKEREL,SALTED | 43.00 | 305 | 18.50 | 25.10 | 0.00 | 0.0 | True |
8786 | 90240 | SCALLOP,(BAY&SEA),CKD,STMD | 70.25 | 111 | 20.54 | 0.84 | 5.41 | 0.0 | True |
8787 | 90480 | SYRUP,CANE | 26.00 | 269 | 0.00 | 0.00 | 73.14 | 0.0 | False |
8788 | 90560 | SNAIL,RAW | 79.20 | 90 | 16.10 | 1.40 | 2.00 | 0.0 | True |
8789 | 93600 | TURTLE,GREEN,RAW | 78.50 | 89 | 19.80 | 0.50 | 0.00 | 0.0 | True |
8195 rows × 9 columns
The next cells define a function to determine if a given value is an outlier using the 1.5 * IQR standard. This means that if a value is more than 1.5 * the IQR less than Q1 or greater than Q3, it is considered an outlier. Then this is applied to the columns of interest in order to determine how many outliers there are.
def is_outlier(x):
Q25, Q75 = x.quantile([.25,.75])
I = Q75 - Q25
return (x < Q25 - 1.5*I) | (x > Q75 + 1.5*I)
outliers_water = is_outlier(df["Water_(g)"])
outliers_water.sum()
0
outliers_energy = is_outlier(df["Energ_Kcal"])
outliers_energy.sum()
136
outliers_protein = is_outlier(df["Protein_(g)"])
outliers_protein.sum()
41
outliers_lipid = is_outlier(df["Lipid_Tot_(g)"])
outliers_lipid.sum()
454
outliers_carb = is_outlier(df["Carbohydrt_(g)"])
outliers_carb.sum()
25
outliers_fiber = is_outlier(df["Fiber_TD_(g)"])
outliers_fiber.sum()
734
After running the outlier function on all the quantitative columns of interest, we find the following amounts of outliers:
- Water: 0
- Energy: 136
- Protein: 41
- Total Lipids: 454
- Carbohydrates: 25
- Fiber (dietary): 734
It is interesting that the water category doesn't have any. This tells us that none of the food items have a water content drastically above or below all the rest.
We should also recognize that the presence of such a large number of outliers in the fiber category means that there is a lot of variability in the fiber content of the foods in our data set. The total lipids category also has a large number but is not as extreme of a case as fiber.
Because this data set is so large (8,789 foods), even if their are a seemingly high amount of outliers, (ex. Energy with 136) this only makes up 1.5% of the data. For the fiber category, 8.35% of the data points are considered outliers.
Summary Data Analysis¶
The following cells show summary statistics (count, mean, standard deviation, each quartile, and the max) and relevant graphs (Empirical CDF, PDF) for each of the columns of interest. The units for the energy column are Kcals, and for all other columns they are grams.
# Water
df["Water_(g)"].describe()
count 8195.000000 mean 53.626142 std 30.808801 min 0.000000 25% 28.255000 50% 62.670000 75% 77.200000 max 100.000000 Name: Water_(g), dtype: float64
sns.displot(df["Water_(g)"], kind="ecdf");
sns.displot(df["Water_(g)"], kind="hist");
From these graphs we see that there are a high number of foods with 0g of water, and those that do have water don't vary in amount too drastically.
# Energy
df["Energ_Kcal"].describe()
count 8195.000000 mean 228.613179 std 169.683531 min 0.000000 25% 95.000000 50% 193.000000 75% 341.000000 max 902.000000 Name: Energ_Kcal, dtype: float64
sns.displot(df["Energ_Kcal"], kind="ecdf");
sns.displot(df["Energ_Kcal"], kind="hist");
From these graphs we see that there are more foods in the data set that are between 0 and 400 calories than 400 and 800 meaning it is more common for foods to be lower in calories.
# Protein
df["Protein_(g)"].describe()
count 8195.000000 mean 11.419850 std 10.516845 min 0.000000 25% 2.400000 50% 8.180000 75% 20.050000 max 88.320000 Name: Protein_(g), dtype: float64
sns.displot(df["Protein_(g)"], kind="ecdf");
sns.displot(df["Protein_(g)"], kind="hist");
These graphs show that more foods have low amounts of protein than have very high aounts. There is a large amount of foods that have no protein at all.
# Total Lipid
df["Lipid_Tot_(g)"].describe()
count 8195.000000 mean 10.660617 std 15.803175 min 0.000000 25% 1.050000 50% 5.330000 75% 13.875000 max 100.000000 Name: Lipid_Tot_(g), dtype: float64
sns.displot(df["Lipid_Tot_(g)"], kind="ecdf");
sns.displot(df["Lipid_Tot_(g)"], kind="hist");
These graphs show that there are a lot of foods that have no lipids and very few that have a very high total lipid content.
# Carbohydrates
df["Carbohydrt_(g)"].describe()
count 8195.000000 mean 22.435128 std 27.494874 min 0.000000 25% 0.030000 50% 9.570000 75% 37.640000 max 100.000000 Name: Carbohydrt_(g), dtype: float64
sns.displot(df["Carbohydrt_(g)"], kind="ecdf");
sns.displot(df["Carbohydrt_(g)"], kind="hist");
These graphs show that many foods in the data set have 0 carbohydrates and that foods that do have carbohydrages are pretty evenly spread out in their amounts.
# Fiber
df["Fiber_TD_(g)"].describe()
count 8195.000000 mean 2.187126 std 4.383311 min 0.000000 25% 0.000000 50% 0.700000 75% 2.600000 max 79.000000 Name: Fiber_TD_(g), dtype: float64
sns.displot(df["Fiber_TD_(g)"], kind="ecdf");
sns.displot(df["Fiber_TD_(g)"], kind="hist");
These graphs show that there is a very high number of foods in the data set that have 0g of fiber. Most foods that do have fiber have between 1 and 10 grams and there are also some foods that have a very high fiber content.
Correlations¶
- Analazing the correlation between protein and dietary fiber.
columns1 = ["Protein_(g)", "Water_(g)"]
sns.pairplot(data=df[columns1])
<seaborn.axisgrid.PairGrid at 0x127b6e690>
df[columns1].corr()
Protein_(g) | Water_(g) | |
---|---|---|
Protein_(g) | 1.000000 | -0.089458 |
Water_(g) | -0.089458 | 1.000000 |
df[columns1].corr("spearman")
Protein_(g) | Water_(g) | |
---|---|---|
Protein_(g) | 1.000000 | -0.285706 |
Water_(g) | -0.285706 | 1.000000 |
The Pearson correlation coefficient between protein and water being close to zero, suggests they do not have a strong correlation. I looked at the Spearman coefficient to ensure that outliers were not too strongly affecting the results, and it was further from zero but a weak association. It is negative which suggests that a lower water content would indicate a higher protein content.
- Analyzing the correlation between energy and carbohydrates.
columns2 = ["Energ_Kcal", "Carbohydrt_(g)"]
sns.pairplot(data=df[columns2])
<seaborn.axisgrid.PairGrid at 0x127a02a90>
df[columns2].corr()
Energ_Kcal | Carbohydrt_(g) | |
---|---|---|
Energ_Kcal | 1.000000 | 0.494455 |
Carbohydrt_(g) | 0.494455 | 1.000000 |
df[columns2].corr("spearman")
Energ_Kcal | Carbohydrt_(g) | |
---|---|---|
Energ_Kcal | 1.000000 | 0.373511 |
Carbohydrt_(g) | 0.373511 | 1.000000 |
The Pearson correlation coefficient between energy and carbohydrates suggests a moderate positive association. This means that a higher carbohydrate content is correlated to higher energy content. I looked at the Spearman coefficient to ensure that outliers were not too strongly affecting the results, and it was weaker than the Pearson but still moderately positive.
- Analyzing the correlation between protein and fiber content.
columns3 = ["Protein_(g)", "Fiber_TD_(g)"]
sns.pairplot(data=df[columns3])
<seaborn.axisgrid.PairGrid at 0x13018ab90>
df[columns3].corr()
Protein_(g) | Fiber_TD_(g) | |
---|---|---|
Protein_(g) | 1.000000 | -0.081103 |
Fiber_TD_(g) | -0.081103 | 1.000000 |
df[columns3].corr("spearman")
Protein_(g) | Fiber_TD_(g) | |
---|---|---|
Protein_(g) | 1.00000 | -0.31047 |
Fiber_TD_(g) | -0.31047 | 1.00000 |
The Pearson correlation coefficient between protein and fiber suggests no association. The result is close to zero which would be absolutely no association. I looked at the Spearman coefficient to ensure that outliers were not too strongly affecting the results, and it was weakly negative which would mean higher protein content is associated with lower fiber content.
Discussion¶
My Questions:
- Does a food item having a high amount of water, fiber, and energy increase the chances it has protein? (categorical)
- Can the amount of protien, carbohydrates, and fiber a food item has be used to predict if it is above average (of those in this data set) in its amount of energy?