Hello everyone, in this blog we are going to explore some of most used and simplest plots in the data analysis. If you have made your hand dirty playing with data then you might have come across at least anyone of these plots. And in Python, we have been doing these plots using Matplotlib. But above that, we have some tools like Seaborn (built on the top of Matplotlib) which gave use nice graphs. But those were not interactive plots. Plotly is all about interactivity!
This blog will be updated frequently.
This blog was prepared and run on the google colab and if you are trying to run codes in local computer, please install plotly first by pip install plotly
. You can visit official link if you want. Then cufflinks by pip install cufflinks
.
import pandas as pd
import numpy as np
import warnings
from plotly.offline import init_notebook_mode, iplot
import plotly.figure_factory as ff
import cufflinks
import plotly.io as pio
cufflinks.go_offline()
cufflinks.set_config_file(world_readable=True, theme='pearl')
pio.renderers.default = "colab" # should change by looking into pio.renderers
pd.options.display.max_columns = None
# pd.options.display.max_rows = None
pio.renderers
Renderers configuration ----------------------- Default renderer: 'colab' Available renderers: ['plotly_mimetype', 'jupyterlab', 'nteract', 'vscode', 'notebook', 'notebook_connected', 'kaggle', 'azure', 'colab', 'cocalc', 'databricks', 'json', 'png', 'jpeg', 'jpg', 'svg', 'pdf', 'browser', 'firefox', 'chrome', 'chromium', 'iframe', 'iframe_connected', 'sphinx_gallery', 'sphinx_gallery_png']
If you are running Plotly on colab then use pio.renderers.default = "colab"
else choose according to your need.
df = pd.read_csv("https://covid.ourworldindata.org/data/owid-covid-data.csv")
df["date"] = pd.to_datetime(df.date)
df
iso_code | continent | location | date | total_cases | new_cases | new_cases_smoothed | total_deaths | new_deaths | new_deaths_smoothed | total_cases_per_million | new_cases_per_million | new_cases_smoothed_per_million | total_deaths_per_million | new_deaths_per_million | new_deaths_smoothed_per_million | reproduction_rate | icu_patients | icu_patients_per_million | hosp_patients | hosp_patients_per_million | weekly_icu_admissions | weekly_icu_admissions_per_million | weekly_hosp_admissions | weekly_hosp_admissions_per_million | new_tests | total_tests | total_tests_per_thousand | new_tests_per_thousand | new_tests_smoothed | new_tests_smoothed_per_thousand | positive_rate | tests_per_case | tests_units | total_vaccinations | people_vaccinated | people_fully_vaccinated | total_boosters | new_vaccinations | new_vaccinations_smoothed | total_vaccinations_per_hundred | people_vaccinated_per_hundred | people_fully_vaccinated_per_hundred | total_boosters_per_hundred | new_vaccinations_smoothed_per_million | new_people_vaccinated_smoothed | new_people_vaccinated_smoothed_per_hundred | stringency_index | population | population_density | median_age | aged_65_older | aged_70_older | gdp_per_capita | extreme_poverty | cardiovasc_death_rate | diabetes_prevalence | female_smokers | male_smokers | handwashing_facilities | hospital_beds_per_thousand | life_expectancy | human_development_index | excess_mortality_cumulative_absolute | excess_mortality_cumulative | excess_mortality | excess_mortality_cumulative_per_million | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | AFG | Asia | Afghanistan | 2020-02-24 | 5.0 | 5.0 | NaN | NaN | NaN | NaN | 0.126 | 0.126 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 8.33 | 39835428.0 | 54.422 | 18.6 | 2.581 | 1.337 | 1803.987 | NaN | 597.029 | 9.59 | NaN | NaN | 37.746 | 0.5 | 64.83 | 0.511 | NaN | NaN | NaN | NaN |
1 | AFG | Asia | Afghanistan | 2020-02-25 | 5.0 | 0.0 | NaN | NaN | NaN | NaN | 0.126 | 0.000 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 8.33 | 39835428.0 | 54.422 | 18.6 | 2.581 | 1.337 | 1803.987 | NaN | 597.029 | 9.59 | NaN | NaN | 37.746 | 0.5 | 64.83 | 0.511 | NaN | NaN | NaN | NaN |
2 | AFG | Asia | Afghanistan | 2020-02-26 | 5.0 | 0.0 | NaN | NaN | NaN | NaN | 0.126 | 0.000 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 8.33 | 39835428.0 | 54.422 | 18.6 | 2.581 | 1.337 | 1803.987 | NaN | 597.029 | 9.59 | NaN | NaN | 37.746 | 0.5 | 64.83 | 0.511 | NaN | NaN | NaN | NaN |
3 | AFG | Asia | Afghanistan | 2020-02-27 | 5.0 | 0.0 | NaN | NaN | NaN | NaN | 0.126 | 0.000 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 8.33 | 39835428.0 | 54.422 | 18.6 | 2.581 | 1.337 | 1803.987 | NaN | 597.029 | 9.59 | NaN | NaN | 37.746 | 0.5 | 64.83 | 0.511 | NaN | NaN | NaN | NaN |
4 | AFG | Asia | Afghanistan | 2020-02-28 | 5.0 | 0.0 | NaN | NaN | NaN | NaN | 0.126 | 0.000 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 8.33 | 39835428.0 | 54.422 | 18.6 | 2.581 | 1.337 | 1803.987 | NaN | 597.029 | 9.59 | NaN | NaN | 37.746 | 0.5 | 64.83 | 0.511 | NaN | NaN | NaN | NaN |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
157699 | ZWE | Africa | Zimbabwe | 2022-01-23 | 228254.0 | 75.0 | 310.857 | 5294.0 | 2.0 | 6.714 | 15124.000 | 4.969 | 20.597 | 350.778 | 0.133 | 0.445 | 0.58 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 1541.0 | 1824420.0 | 120.885 | 0.102 | 3912.0 | 0.259 | 0.0795 | 12.6 | tests performed | 7512903.0 | 4242647.0 | 3270256.0 | NaN | 6117.0 | 10631.0 | 49.78 | 28.11 | 21.67 | NaN | 704.0 | 5182.0 | 0.034 | NaN | 15092171.0 | 42.729 | 19.6 | 2.822 | 1.882 | 1899.775 | 21.4 | 307.846 | 1.82 | 1.6 | 30.7 | 36.791 | 1.7 | 61.49 | 0.571 | NaN | NaN | NaN | NaN |
157700 | ZWE | Africa | Zimbabwe | 2022-01-24 | 228541.0 | 287.0 | 297.286 | 5305.0 | 11.0 | 6.714 | 15143.017 | 19.016 | 19.698 | 351.507 | 0.729 | 0.445 | 0.58 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 4913.0 | 1829333.0 | 121.211 | 0.326 | 4043.0 | 0.268 | 0.0735 | 13.6 | tests performed | 7517985.0 | 4245063.0 | 3272922.0 | NaN | 5082.0 | 10273.0 | 49.81 | 28.13 | 21.69 | NaN | 681.0 | 5009.0 | 0.033 | NaN | 15092171.0 | 42.729 | 19.6 | 2.822 | 1.882 | 1899.775 | 21.4 | 307.846 | 1.82 | 1.6 | 30.7 | 36.791 | 1.7 | 61.49 | 0.571 | NaN | NaN | NaN | NaN |
157701 | ZWE | Africa | Zimbabwe | 2022-01-25 | 228776.0 | 235.0 | 330.857 | 5316.0 | 11.0 | 8.286 | 15158.588 | 15.571 | 21.922 | 352.236 | 0.729 | 0.549 | 0.58 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 7525574.0 | 4248576.0 | 3276998.0 | NaN | 7589.0 | 9579.0 | 49.86 | 28.15 | 21.71 | NaN | 635.0 | 4638.0 | 0.031 | NaN | 15092171.0 | 42.729 | 19.6 | 2.822 | 1.882 | 1899.775 | 21.4 | 307.846 | 1.82 | 1.6 | 30.7 | 36.791 | 1.7 | 61.49 | 0.571 | NaN | NaN | NaN | NaN |
157702 | ZWE | Africa | Zimbabwe | 2022-01-26 | 228943.0 | 167.0 | 293.714 | 5321.0 | 5.0 | 7.857 | 15169.653 | 11.065 | 19.461 | 352.567 | 0.331 | 0.521 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 15092171.0 | 42.729 | 19.6 | 2.822 | 1.882 | 1899.775 | 21.4 | 307.846 | 1.82 | 1.6 | 30.7 | 36.791 | 1.7 | 61.49 | 0.571 | NaN | NaN | NaN | NaN |
157703 | ZWE | Africa | Zimbabwe | 2022-01-27 | 229096.0 | 153.0 | 220.571 | 5324.0 | 3.0 | 6.857 | 15179.791 | 10.138 | 14.615 | 352.766 | 0.199 | 0.454 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 15092171.0 | 42.729 | 19.6 | 2.822 | 1.882 | 1899.775 | 21.4 | 307.846 | 1.82 | 1.6 | 30.7 | 36.791 | 1.7 | 61.49 | 0.571 | NaN | NaN | NaN | NaN |
157704 rows × 67 columns
First step of any data analysis is checking for missing columns.
total = df.isnull().sum().sort_values(ascending = False)
percent = (df.isnull().sum()/df.isnull().count()).sort_values(ascending = False)
mdf = pd.concat([total, percent], axis=1, keys=['Total', 'Percent'])
mdf = mdf.reset_index()
mdf
index | Total | Percent | |
---|---|---|---|
0 | weekly_icu_admissions | 153311 | 0.972144 |
1 | weekly_icu_admissions_per_million | 153311 | 0.972144 |
2 | excess_mortality_cumulative_per_million | 152284 | 0.965632 |
3 | excess_mortality | 152284 | 0.965632 |
4 | excess_mortality_cumulative_absolute | 152284 | 0.965632 |
... | ... | ... | ... |
62 | total_cases | 2850 | 0.018072 |
63 | population | 1038 | 0.006582 |
64 | date | 0 | 0.000000 |
65 | location | 0 | 0.000000 |
66 | iso_code | 0 | 0.000000 |
67 rows × 3 columns
mdf.query("Total>100000").iplot(kind='pie',labels = "index",
values="Total", textinfo="percent+label",
title='Top Columns with Missing Values', hole = 0.5)
Above plot seems little bit dirty and we could smoothen it by not providing textinfo.
mdf.query("Total>100000").iplot(kind='pie',labels = "index",
values="Total",
title='Top Columns with Missing Values', hole = 0.5)
The location field of our data seems to be having country name, continent name and world so we will skip those locations first. Then we will calculate the aggregated value of each day by grouping on date level
Lets first plot simple line chart with only total cases. But we could always plot more lines within it.
todf = df[~df.location.isin(["Lower middle income", "North America", "World", "Asia", "Europe",
"European Union", "Upper middle income",
"High income", "South America"])]
tdf = todf.groupby("date").aggregate(new_cases=("new_cases", "sum"),
new_deaths = ("new_deaths", "sum"),
new_vaccinations = ("new_vaccinations", "sum"),
new_tests = ("new_tests", "sum")
).reset_index()
tdf.iplot(kind="line",
y="new_cases",
x="date",
xTitle="Date",
width=2,
yTitle="new_cases",
title="New Cases from Jan 2020 to Jan 2022")
Above plot seems to be cool but now lets plot multiple lines at the same time on same figure.
tdf.iplot(kind="line",
y=["new_deaths", "new_vaccinations", "new_tests"],
x="date",
xTitle="Date",
width=2,
yTitle="Cases",
title="Cases from Jan 2020 to Jan 2022")
It does not look that good because the new_deaths is not clearly visible lets draw them in sub plots so that we could see each lines distinctly.
tdf.iplot(kind="line",
y=["new_deaths", "new_vaccinations", "new_tests"],
x="date",
xTitle="Date",
width=2,
yTitle="Cases",
subplots=True,
title="Cases from Jan 2020 to Jan 2022")
Now its better.
We could even plot secondary y variable. Now lets plot new tests and new vaccinations side by side.
tdf.iplot(kind="line",
y=["new_vaccinations"],
secondary_y = "new_tests",
x="date",
xTitle="Date",
width=2,
yTitle="new_vaccinations",
secondary_y_title="new_tests",
title="Cases from Jan 2020 to Jan 2022")
tdf.iplot(kind="scatter",
y="new_deaths", x='new_cases',
mode='markers',
yTitle="New Deaths", xTitle="New Cases",
title="New Deaths vs New Cases")
It seems that most of the deaths happened while cases were little.
We could even plot secondary y. Lets visualize new tests along with them.
tdf.iplot(kind="scatter",
x="new_deaths", y='new_cases',
secondary_y="new_tests",
secondary_y_title="New Tests",
mode='markers',
xTitle="New Deaths", yTitle="New Cases",
title="New Deaths vs New Cases")
We could even use subplots on it.
tdf.iplot(kind="scatter",
x="new_deaths", y='new_cases',
secondary_y="new_tests",
secondary_y_title="New Tests",
mode='markers',
subplots=True,
xTitle="New Deaths", yTitle="New Cases",
title="New Deaths vs New Cases")
How about plotting top 20 countries where most death have occured?
But first, take the aggregate data by taking maximum of total deaths column. Thanks to the author of this dataset we do not have to make our hands dirty much. Then take top 20 by using nlargest
.
tdf = df[~df.location.isin(["Lower middle income", "North America", "World", "Asia", "Europe",
"European Union", "Upper middle income",
"High income", "South America"])].groupby("location").aggregate(total_deaths=("total_deaths", "max"),
total_cases = ("total_cases", "max"),
total_tests = ("total_tests", "max")).reset_index()
topdf = tdf.nlargest(20, "total_deaths")
topdf.iplot(kind="bar", x="location",
y="total_deaths",
theme="polar",
xTitle="Countries", yTitle="Total Deaths",
title="Top 20 Countries according to total deaths")
It seems awesome. We could play with theme also.
We could even make it horizontal.
topdf.iplot(kind="bar", x="location",
y="total_deaths",
theme="polar", orientation='h',
xTitle="Countries", yTitle="Total Deaths",
title="Top 20 Countries according to total deaths")
We could even plot multiple bars at the same time. In seaborn, we could do this by using Hue but here, we only have to pass it in y. Lets plot bars of total deaths, total cases and total tests.
topdf.iplot(kind="bar", x="location",
y=["total_deaths", "total_cases", "total_tests"],
theme="polar",
xTitle="Countries", yTitle="Total Deaths",
title="Top 20 Countries according to total deaths")
But total deaths is not visible clearly, lets try to use different mode of bar. We could choose one from the 'stack', 'group', 'overlay', 'relative'
.
topdf.iplot(kind="bar", x="location",
y=["total_deaths", "total_cases", "total_tests"],
theme="polar",
barmode="overlay",
xTitle="Countries", yTitle="Total Deaths",
title="Top 20 Countries according to total deaths")
But it is still not clear. One solution is to plot in subplots.
topdf.iplot(kind="bar", x="location",
y=["total_deaths", "total_cases", "total_tests"],
theme="polar",
barmode="overlay",
xTitle="Countries", yTitle="Total Deaths",
subplots=True,
title="Top 20 Countries according to total deaths")
Much better.
How about viewing the distribution of totel tests done?
tdf.iplot(kind="hist",
bins=50,
colors=["red"],
keys=["total_tests"],
title="Total tests Histogram")
To see histogram of other columns in same figure we will use keys.
tdf.iplot(kind="hist",
bins=100,
colors=["red"],
keys=["total_tests", "total_cases", "total_deaths"],
title="Multiple Histogram")
It does not look good as the data is not distributed properly. Lets visualize it in different plots.
tdf.iplot(kind="hist",
subplots=True,
keys=["total_tests", "total_cases", "total_deaths"],
title="Multiple Histogram")
How about viewing outliers in data?
tdf.iplot(kind="box",
keys=["total_tests", "total_cases", "total_deaths"],
boxpoints="outliers",
x="location",
xTitle="Columns", title="Box Plot Tests, Cases and Deaths")