Diet and Covid-19

Motto: "Let food be thy medicine, and let medicine be thy food"

alt text


  • Introduction and motivation
  • Requirements (machine learning libraries)
  • Dataset
    • Data analysis and description
    • Distribution of the data
    • Data analysis by country
  • Machine learning algorithm
  • Conclusions and discussion

Introduction and motivation

Since the Spring of 2020, the entire world is facing an ongoing pandemic caused by the SARS-CoV-2 virus. The symptoms of COVID-19 are various: some people don't even know that they are infected, while others confront with life-threatening symptoms. As the virus is new, researchers still have a lot of questions to answer: how is it spread, which medicine should be use to treat the symptoms and how can we prevent getting infected.

It is believed that the virus spreads through air and via contaminated surfaces. To prevent the virus, authorities recommend social distancing, wearing face masks in public, frequent hand washing and disinfecting surfaces. Also, multiple vaccines have been developed and distributed to the population. For infected persons, there are some treatments that addreess COVID-19 symptoms, but there are no drugs that inhibit the virus.

Some specialists also recommend taking vitamins (vitamin C and vitamin D3) to improve our imune system. It is known that our eating habits and lifestyle make a great difference for a healthy immune system. Therefore, the main objective of this project is to analyse if there is a correlation between the diet (which is said to influence our immune system) and the COVID cases in different regions around the world.

In this project I want to study the relationship between COVID-19 cases and COVID-19 related deaths, and the diet of different countries. It is said that a healthy diet (lots of fruit and vegetables, reduced consumption of alcohol and animal products) and a right pyhsiological state (not obese nor undernourished), can help boost your immune system and therefore increase one's resistance to the virus.

So, I hope that after finishing this tutorial, the reader will understand the steps required to do a data science project, and will become more aware on how a healty diet can help boost the immune system.

Requirements (machine learning libraries)

I used the following machine learning libraries:

  • pandas 1.1.5 - for loading and manipulating csv files
  • matplotlib 3.3.3 and seaborn - for creating and displaying various charts
  • numpy 1.19.5 - for working with arrays
  • scikit-learn 0.24.1 - for various machine learning models and for metrics to evaluate my models
In [ ]:
!pip install seaborn
!pip install plotly
In [30]:
import seaborn
import sklearn
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import as plt_exp
%matplotlib inline


The data that I will work with for this project were downloaded from kaggle:
The author of this dataset gathered nutrition data (energy intake, fat and protein quantity etc.), obesity and undernourished rate, as well as the most up to date confirmed/deaths/recovered/active COVID cases around the world.
The data was scraped from different websites:

  • Food and Agriculture Organization of the United Nations FAO website
  • Population Reference Bureau PRB website
  • Johns Hopkins Center for Systems Science and Engineering CSSE website
  • USDA Center for Nutrition Policy and Promotion diet intake

Data analysis and description

The first step is to load the data, handle missing values and perform exploratory data analysis.

As mentioned in the introductory section we are interested in the relationship between a healty diet and an appropriate physiological state, and the COVID-19 cases and deaths.

My assumption of a healty diet is:

  • consuming lots of fruits and vegetables
  • reduced consumption of animal products (meat)
  • reduced consumption of alchool
  • reduced consumption of sugar and sweeteners.

The available data related to the physiological state of the inhabitants is the obese and undernourished rate.

With this in mind, I will only analyse the following colums related to the diet: Fruits, Sweeteners, Alcoohol, Vegetables, Obesity, Undernourished, Meat, Vegetal_products, Animal_products.

I noticed that rows contain missing values, so I will remove those countries from the analysis.

In [32]:
# Load the data into a panda dataframe
df = pd.read_csv("Food_Supply_Quantity_kg_Data.csv")
# rename some columns that I will be using frequently
df.rename(columns = {"Miscellaneous":"Misc","Milk - Excluding Butter":"Dairy",
                    "Sugar & Sweeteners":"Sweeteners", "Fruits - Excluding Wine":"Fruits",
                    "Unit (all except Population)":"Unit",
                     "Alcoholic Beverages": "Alcoohol",
                    "Animal fats": "Animal_fats",
                    "Animal Products": "Animal_products",
                    "Vegetal Products": "Vegetal_products",
                     "Fish, Seafood": "Fish"
                    }, inplace=True)

countries_all = len(df)
# drop missing values
df = df.dropna()
countries_valid = len(df)
print('Removed {} countries from {} as they had missing infromation. Working with data from {} countries. '.format(countries_all - countries_valid, countries_all, countries_valid))

# keep only the columns we are interest in
columns_of_interest = ['Sweeteners', 'Alcoohol', 'Fruits', 'Animal_fats', 'Animal_products', 'Vegetal_products', 'Obesity', 'Undernourished', 'Meat',
                      'Deaths', 'Confirmed', 'Population', 'Vegetables', 'Dairy', 'Fish', 'Country']
df = df[columns_of_interest]

# The Undernourished column has some string values <2.5, and we need to take care of this issue
df['Undernourished'] = df.apply(lambda ft: 2.5 if ft['Undernourished'] == '<2.5' else float(ft['Undernourished']), axis = 1)
Removed 16 countries from 170 as they had missing infromation. Working with data from 154 countries. 

Distribution of the data

I will also visualize the distribution of these numerical columns.

Now, let's look at the statistical properties of the data. First, I will describe the statistical properties of the data:

In [33]:
Sweeteners Alcoohol Fruits Animal_fats Animal_products Vegetal_products Obesity Undernourished Meat Deaths Confirmed Population Vegetables Dairy Fish
count 154.000000 154.000000 154.000000 154.000000 154.000000 154.000000 154.000000 154.000000 154.000000 154.000000 154.000000 1.540000e+02 154.000000 154.000000 154.000000
mean 2.732413 3.020031 5.534741 0.227173 12.121764 37.874880 18.449351 11.324026 3.202042 0.039882 2.060157 4.796579e+07 5.971525 6.737858 1.283970
std 1.511877 2.404583 3.167282 0.284108 6.039498 6.039735 9.519483 11.771718 1.642639 0.049285 2.396535 1.639258e+08 3.491491 5.118950 1.170999
min 0.366600 0.000000 0.659600 0.002200 1.739100 23.113200 2.100000 2.500000 0.356000 0.000000 0.000312 7.200000e+04 0.857000 0.096300 0.035000
25% 1.704675 0.906350 3.402050 0.040600 6.752650 32.869525 8.250000 2.500000 1.864250 0.002086 0.141688 3.403500e+06 3.480075 2.035425 0.535750
50% 2.539850 2.733450 4.939050 0.116850 11.830350 38.168200 21.300000 7.050000 3.255250 0.012576 1.035083 1.058950e+07 4.956050 5.522500 1.003750
75% 3.630550 4.680600 6.807550 0.255075 17.122875 43.244575 25.700000 15.075000 4.306275 0.069680 3.534770 3.383650e+07 7.789925 10.963900 1.698925
max 9.725900 15.370600 19.302800 1.355900 26.886500 48.258500 45.500000 59.600000 8.092900 0.185428 10.408199 1.402385e+09 16.701900 20.837800 8.795900
In [34]:
# Plot the distribution of the numerical columns that I'm analyzing

features_to_display = set(columns_of_interest).difference(set(['Population', 'Deaths', 'Confirmed', 'Country']))

num_rows = len(features_to_display)//3 
num_cols = 3
fig, _ = plt.subplots(num_rows, 3, figsize=(15,15)) # create a plot with 3 columns 

for idx, feature in enumerate(features_to_display):
    plt.subplot(num_rows, num_cols, idx + 1) # "activate" the plot at index idx + 1 in the grid (num_rows, num_cols)
    plt.hist(df[feature], bins=10)
c:\users\diana\appdata\local\programs\python\python36\lib\site-packages\ UserWarning:

Matplotlib is currently using module://ipykernel.pylab.backend_inline, which is a non-GUI backend, so cannot show the figure.

Based on these plots, we can notice that some features follow a normal distribution, while others have a more "skewed" distribution. These will help us choose the appropriate standardization algorithm for these variables.

Finally, let's look at the correlation between the features that we are analysing:

In [7]:
plt.figure(figsize=(16, 16))
seaborn.heatmap(df.corr(), annot=True, cmap='Blues')

From the correlation matrix above we can notice that COVID cases are obviously related to the consumption of animal products and fats, alcoohol and dairy consumption, and obesity. I would have expected sweetner consumption to have a much more significant impact.

Anyway, it seems that a healthy diet might indeed help your immune system.

Data analysis by country

Finally, let's analyse the distribution of the COVID-19 cases based on their geographical distribution.

First, let's plot the COVID-cases incidence (confirmed cases and deaths) by countries.

In [8]:
df_confirmed_country = df[['Confirmed', 'Deaths', 'Country']]

#sort dataframe by confirmed cases
df_confirmed_country = df_confirmed_country.sort_values(by='Confirmed', ascending=False)

plt.figure(figsize = (12, 32))
seaborn.barplot(x = df_confirmed_country['Confirmed'], y=df_confirmed_country['Country'])

plt.xlabel("Confirmed cases")
plt.title("Confirmed cases per country")

It can be noticed that the countries with the highest number of deaths are localized in Europe. Also USA is also in the top countries with the highest number of cases. This can be do the fact that these countries peform more COVID-19 tests, have a high density of the population and people tend to travel more in these regions.

In [9]:
# Also display a choropleth map to have an idea of how the covid-cases  are distributed geographically
fig = plt_exp.choropleth(df, locations="Country",
                    locationmode='country names',