Since the Spring of 2020, the entire world is facing an ongoing pandemic caused by the SARS-CoV-2 virus. The symptoms of COVID-19 are various: some people don't even know that they are infected, while others confront with life-threatening symptoms. As the virus is new, researchers still have a lot of questions to answer: how is it spread, which medicine should be use to treat the symptoms and how can we prevent getting infected.
It is believed that the virus spreads through air and via contaminated surfaces. To prevent the virus, authorities recommend social distancing, wearing face masks in public, frequent hand washing and disinfecting surfaces. Also, multiple vaccines have been developed and distributed to the population. For infected persons, there are some treatments that addreess COVID-19 symptoms, but there are no drugs that inhibit the virus.
Some specialists also recommend taking vitamins (vitamin C and vitamin D3) to improve our imune system. It is known that our eating habits and lifestyle make a great difference for a healthy immune system. Therefore, the main objective of this project is to analyse if there is a correlation between the diet (which is said to influence our immune system) and the COVID cases in different regions around the world.
In this project I want to study the relationship between COVID-19 cases and COVID-19 related deaths, and the diet of different countries. It is said that a healthy diet (lots of fruit and vegetables, reduced consumption of alcohol and animal products) and a right pyhsiological state (not obese nor undernourished), can help boost your immune system and therefore increase one's resistance to the virus.
So, I hope that after finishing this tutorial, the reader will understand the steps required to do a data science project, and will become more aware on how a healty diet can help boost the immune system.
I used the following machine learning libraries:
!pip install seaborn
!pip install plotly
import seaborn
import sklearn
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import plotly.express as plt_exp
%matplotlib inline
The data that I will work with for this project were downloaded from kaggle: https://www.kaggle.com/mariaren/covid19-healthy-diet-dataset
The author of this dataset gathered nutrition data (energy intake, fat and protein quantity etc.), obesity and undernourished rate, as well as the most up to date confirmed/deaths/recovered/active COVID cases around the world.
The data was scraped from different websites:
The first step is to load the data, handle missing values and perform exploratory data analysis.
As mentioned in the introductory section we are interested in the relationship between a healty diet and an appropriate physiological state, and the COVID-19 cases and deaths.
My assumption of a healty diet is:
The available data related to the physiological state of the inhabitants is the obese and undernourished rate.
With this in mind, I will only analyse the following colums related to the diet: Fruits, Sweeteners, Alcoohol, Vegetables, Obesity, Undernourished, Meat, Vegetal_products, Animal_products.
I noticed that rows contain missing values, so I will remove those countries from the analysis.
# Load the data into a panda dataframe
df = pd.read_csv("Food_Supply_Quantity_kg_Data.csv")
# rename some columns that I will be using frequently
df.rename(columns = {"Miscellaneous":"Misc","Milk - Excluding Butter":"Dairy",
"Sugar & Sweeteners":"Sweeteners", "Fruits - Excluding Wine":"Fruits",
"Unit (all except Population)":"Unit",
"Alcoholic Beverages": "Alcoohol",
"Animal fats": "Animal_fats",
"Animal Products": "Animal_products",
"Vegetal Products": "Vegetal_products",
"Fish, Seafood": "Fish"
}, inplace=True)
countries_all = len(df)
# drop missing values
df = df.dropna()
countries_valid = len(df)
print('Removed {} countries from {} as they had missing infromation. Working with data from {} countries. '.format(countries_all - countries_valid, countries_all, countries_valid))
# keep only the columns we are interest in
columns_of_interest = ['Sweeteners', 'Alcoohol', 'Fruits', 'Animal_fats', 'Animal_products', 'Vegetal_products', 'Obesity', 'Undernourished', 'Meat',
'Deaths', 'Confirmed', 'Population', 'Vegetables', 'Dairy', 'Fish', 'Country']
df = df[columns_of_interest]
# The Undernourished column has some string values <2.5, and we need to take care of this issue
df['Undernourished'] = df.apply(lambda ft: 2.5 if ft['Undernourished'] == '<2.5' else float(ft['Undernourished']), axis = 1)
I will also visualize the distribution of these numerical columns.
Now, let's look at the statistical properties of the data. First, I will describe the statistical properties of the data:
df.describe()
# Plot the distribution of the numerical columns that I'm analyzing
features_to_display = set(columns_of_interest).difference(set(['Population', 'Deaths', 'Confirmed', 'Country']))
num_rows = len(features_to_display)//3
num_cols = 3
fig, _ = plt.subplots(num_rows, 3, figsize=(15,15)) # create a plot with 3 columns
for idx, feature in enumerate(features_to_display):
plt.subplot(num_rows, num_cols, idx + 1) # "activate" the plot at index idx + 1 in the grid (num_rows, num_cols)
plt.hist(df[feature], bins=10)
plt.title(feature)
fig.show()
Based on these plots, we can notice that some features follow a normal distribution, while others have a more "skewed" distribution. These will help us choose the appropriate standardization algorithm for these variables.
Finally, let's look at the correlation between the features that we are analysing:
plt.figure(figsize=(16, 16))
seaborn.heatmap(df.corr(), annot=True, cmap='Blues')
plt.show()
From the correlation matrix above we can notice that COVID cases are obviously related to the consumption of animal products and fats, alcoohol and dairy consumption, and obesity. I would have expected sweetner consumption to have a much more significant impact.
Anyway, it seems that a healthy diet might indeed help your immune system.
Finally, let's analyse the distribution of the COVID-19 cases based on their geographical distribution.
First, let's plot the COVID-cases incidence (confirmed cases and deaths) by countries.
df_confirmed_country = df[['Confirmed', 'Deaths', 'Country']]
#sort dataframe by confirmed cases
df_confirmed_country = df_confirmed_country.sort_values(by='Confirmed', ascending=False)
plt.figure(figsize = (12, 32))
seaborn.barplot(x = df_confirmed_country['Confirmed'], y=df_confirmed_country['Country'])
plt.xlabel("Confirmed cases")
plt.ylabel("Country")
plt.title("Confirmed cases per country")
plt.show()
It can be noticed that the countries with the highest number of deaths are localized in Europe. Also USA is also in the top countries with the highest number of cases. This can be do the fact that these countries peform more COVID-19 tests, have a high density of the population and people tend to travel more in these regions.
# Also display a choropleth map to have an idea of how the covid-cases are distributed geographically
fig = plt_exp.choropleth(df, locations="Country",
color="Confirmed",
locationmode='country names',
color_continuous_scale=plt_exp.colors.sequential.Plasma)
fig.show()