A General Way to Perform an EDA

29 minute read

Introduction

Hello everyone, welcome back to another new blog where we will explore different ideas and concept one could perform while performing an EDA. In simple words, this blog is a simple walk-through of an average EDA process which might include (in top down order):

Data Loading: From various sources (remote, local) and various formats (excel, csv, sql etc.)
Data Check: This is very important task where we check the data types (numerical, categorical, binary etc) of a data. We often focus on number of missing values.
Data Transformation: This includes filling up null values, or removing them from the table. We also do some data type conversions if required.
Descriptive Analysis: This is the heart of any EDA because here, we do lots of statistical tasks like finding mean, median, quartiles, mode, distribution, relationships of fields. We also plot different plots to support the analysis. This is sometimes enough to give insights about the data and if the data is rich and we need to find more insights and make assumptions, we have to do Inferential Analysis.
Inferential Analysis: This task sometimes is taken into the EDA part but most of the time we do inferential analysis along with model development. However, we do perform different tests (e.g Chi- Square Test) to calculate feature importance. Here we often do tests based on hypothesis and samples drawn from the population.

While walking through these major steps, one will try to answer different questions of analysis like how many times some categorical data has appeared, what is the distribution over a date, what is the performance over certain cases and so on.

Please follow the following link for the interactive version of this blog General Way of Doing EDA.

Data Loading

Installing Libraries

!pip install autoviz
!pip install seaborn
!pip install plotly
!pip install cufflinks
!pip install pandas

Autoviz is for auto visualization but it is heavy and power hungry.
Seaborn is built on top of the matplotlib and is best for making rich static plots.
Plotly is for interactive visualization.
Cufflinks is for connecting pandas and plotly.
Pandas is for data analysis.

Importing Libraries

If you do not have these libraries installed, please install them like below:

import autoviz
from autoviz.AutoViz_Class import AutoViz_Class
from pandas_profiling import ProfileReport
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import warnings
from plotly.offline import init_notebook_mode, iplot
import plotly.figure_factory as ff
import cufflinks
import plotly.io as pio 
cufflinks.go_offline()
cufflinks.set_config_file(world_readable=True, theme='pearl')
pio.renderers.default = "notebook" # should change by looking into pio.renderers

pd.options.display.max_columns = None
%matplotlib inline

In above step, we have told cufflinks to make plotly plots available offline. And if we are working locally on Jupyter Notebook we should make sure have pio.renderers.default="notebook".

Reading File

To make things easier, I am reading file from local storage which is downloaded from Kaggle.

According to the author, the data is collected by 3 IoT devices under different environmental conditions. These environmental conditions plays major role on the analysis later on.

device	environmental conditions
00:0f:00:70:91:0a	stable conditions, cooler and more humid
1c:bf:ce:15:ec:4d	highly variable temperature and humidity
b8:27:eb:bf:9d:51	stable conditions, warmer and dryer

df=pd.read_csv("iot_telemetry_data.csv")

Viewing Shape of Data

How many rows and columns are there?

df.shape

(405184, 9)

There are only 9 columns but lots of rows.

Viewing Top Data

df.head()

	ts	device	co	humidity	light	lpg	motion	smoke	temp
0	1.594512e+09	b8:27:eb:bf:9d:51	0.004956	51.000000	False	0.007651	False	0.020411	22.700000
1	1.594512e+09	00:0f:00:70:91:0a	0.002840	76.000000	False	0.005114	False	0.013275	19.700001
2	1.594512e+09	b8:27:eb:bf:9d:51	0.004976	50.900000	False	0.007673	False	0.020475	22.600000
3	1.594512e+09	1c:bf:ce:15:ec:4d	0.004403	76.800003	True	0.007023	False	0.018628	27.000000
4	1.594512e+09	b8:27:eb:bf:9d:51	0.004967	50.900000	False	0.007664	False	0.020448	22.600000

Data Check

Viewing Data Types

df.dtypes

ts          float64
device       object
co          float64
humidity    float64
light          bool
lpg         float64
motion         bool
smoke       float64
temp        float64
dtype: object

It seems that we have float data in most of the columns. According to the Author the definition of the columns is

column	description	units
ts	timestamp of event	epoch
device	unique device name	string
co	carbon monoxide	ppm (%)
humidity	humidity	percentage
light	light detected?	boolean
lpg	liquid petroleum gas	ppm (%)
motion	motion detected?	boolean
smoke	smoke	ppm (%)
temp	temperature	Fahrenheit

Checking Missing Values

This is very crucial as missing values could lead to false assumption and sometimes we have to remove or replace them. Lets check how many of columns have missing values.

total = df.isnull().sum().sort_values(ascending = False)
percent = (df.isnull().sum()/df.isnull().count()).sort_values(ascending = False)
mdf = pd.concat([total, percent], axis=1, keys=['Total', 'Percent'])
mdf = mdf.reset_index()
mdf

NumExpr defaulting to 8 threads.

	index	Total	Percent
0	ts	0	0.0
1	device	0	0.0
2	co	0	0.0
3	humidity	0	0.0
4	light	0	0.0
5	lpg	0	0.0
6	motion	0	0.0
7	smoke	0	0.0
8	temp	0	0.0

It seems that there is no missing data in our dataset. Which is great. But what about outliers? Because outliers also plays huge role in making data modeling tough task. This task falls under the Descriptive Analysis part.

Data Transformation

It seems that we do not have missing data so we do not have to do much to do besides converting time stamp to datetime.But we might need to transform our data based on the outliers later.

Datetime

Lets convert timestamp to date time because we will visualize some sort of time series analysis later on.

from datetime import datetime

df["date"]= df.ts.apply(datetime.fromtimestamp)

Device Name

Lets make our device little bit readable. Create a new column device_name and add the mapped value of environment and device id.

d={"00:0f:00:70:91:0a":"cooler,more,humid", 
   "1c:bf:ce:15:ec:4d":"variable temp/humidity",
   "b8:27:eb:bf:9d:51":"stable, warmer, dry"}
df["device_name"] = df.device.apply(lambda x: d[x])
df

	ts	device	co	humidity	light	lpg	motion	smoke	temp	date	device_name
0	1.594512e+09	b8:27:eb:bf:9d:51	0.004956	51.000000	False	0.007651	False	0.020411	22.700000	2020-07-12 05:46:34.385975	stable, warmer, dry
1	1.594512e+09	00:0f:00:70:91:0a	0.002840	76.000000	False	0.005114	False	0.013275	19.700001	2020-07-12 05:46:34.735568	cooler,more,humid
2	1.594512e+09	b8:27:eb:bf:9d:51	0.004976	50.900000	False	0.007673	False	0.020475	22.600000	2020-07-12 05:46:38.073573	stable, warmer, dry
3	1.594512e+09	1c:bf:ce:15:ec:4d	0.004403	76.800003	True	0.007023	False	0.018628	27.000000	2020-07-12 05:46:39.589146	variable temp/humidity
4	1.594512e+09	b8:27:eb:bf:9d:51	0.004967	50.900000	False	0.007664	False	0.020448	22.600000	2020-07-12 05:46:41.761235	stable, warmer, dry
...	...	...	...	...	...	...	...	...	...	...	...
405179	1.595203e+09	00:0f:00:70:91:0a	0.003745	75.300003	False	0.006247	False	0.016437	19.200001	2020-07-20 05:48:33.162015	cooler,more,humid
405180	1.595203e+09	b8:27:eb:bf:9d:51	0.005882	48.500000	False	0.008660	False	0.023301	22.200000	2020-07-20 05:48:33.576561	stable, warmer, dry
405181	1.595203e+09	1c:bf:ce:15:ec:4d	0.004540	75.699997	True	0.007181	False	0.019076	26.600000	2020-07-20 05:48:36.167959	variable temp/humidity
405182	1.595203e+09	00:0f:00:70:91:0a	0.003745	75.300003	False	0.006247	False	0.016437	19.200001	2020-07-20 05:48:36.979522	cooler,more,humid
405183	1.595203e+09	b8:27:eb:bf:9d:51	0.005914	48.400000	False	0.008695	False	0.023400	22.200000	2020-07-20 05:48:37.264313	stable, warmer, dry

Descriptive Analysis

Descriptive Statistics is all about describing the data in the terms of some numbers, charts, graphs or plots. In descriptive statistics, our focus will be on the summary of the data like mean, spread, quartiles, percentiles and so on.

Lets get little bit deep into the descriptive analysis here, we will measure:

Central tendency which focuses on the average.
Variability (measure of dispersion) which focuses on how far the data has spreaded.
Distribution (Frequency distribution) which focuses of number of times something occured.

Frequency Distribution

What is number of observations for each device?

df.groupby("device_name").ts.count().rename("Counts").reset_index().iplot(kind="pie", labels="device_name", values="Counts")

Insights

There seems to be high number of records from the device which was kept on stable, warmer and dry place.

What is the distribution of a field over a time?

This question can be done on the time series analysis but we are not focusing on that in this blog.

We have a date time column prepared already from a timestamp. Lets use that column here.

cols = [i for i in df.columns if i not in ["date", "ts", "device", "device_name"]]
for c in cols:
    plt.figure(figsize=(15, 10))
    sns.scatterplot(data=df, x="date", y=c, hue="device_name")
    plt.title(label=f"Distribution of {c} over a time for each Device")
    plt.show()

png

Insights

There seems to be high spikes in CO recorded by cooler,more humid place’s device.
Humidity seems to be normal for all 3 devices but there is not normal flow for device of cooler, more humid place.
And LPG seems to decreasing for cooler, more humid and increasing for stable, warmer dry place’s device.
And so on.

What is the distribution of each Columns?

df.co.iplot(kind="hist", xTitle="ppm in %", yTitle="Frequency", title="Frequency Distribution of CO")

# df.co.plot(kind="hist", title="Frequency Distribution of CO")
plt.figure(figsize=(8,5))
sns.distplot(df.co, kde=False, color='red', bins=100)
plt.title('Frequency Distribution of CO As a Whole', fontsize=18)
plt.xlabel('Units in ppm (%)', fontsize=16)
plt.ylabel('Frequency', fontsize=16)

Text(0, 0.5, 'Frequency')

png

plt.figure(figsize=(18,10))
sns.histplot(data=df, x="co", hue="device_name")
plt.title('Frequency Distribution of CO with Device', fontsize=18)
plt.xlabel('Units in ppm (%)', fontsize=16)
plt.ylabel('Frequency', fontsize=16)

Text(0, 0.5, 'Frequency')

png

CO Insights\

It seems that there is huge number of CO readings for ppm 0.004 to 0.006.
There is some readings of 0.012 too which might be a outliers in our case and we will later visualize it based on the device.
The device starting with b8 seems to have read much CO. This device was placed on stable conditions, dry places.

All

Lets try to visualize histogram of each fields based on device name.

for c in [i for i in df.columns if i not in ["date", "ts", "device", "device_name"]]:
    plt.figure(figsize=(18,10))
    sns.histplot(data=df, x=c, hue="device_name")
    plt.title(f'Frequency Distribution of {c}', fontsize=18)
    plt.xlabel(f'Values of {c}', fontsize=16)
    plt.ylabel('Frequency', fontsize=16)
    plt.show()

png

All Insights

Temp seems to be distributed largely for device which was in variable temp.
Smoke seems to be distributed largely for device which was in cooler temp.
LPG seems to be distributed largely for device which was in cooler temp.

Central Tendency

Lets view the summary of each numerical data first.

Overall Insights

df[[i for i in df.columns if i not in ["date", "ts", "device", "device_name"]]].describe()

	co	humidity	lpg	smoke	temp
count	405184.000000	405184.000000	405184.000000	405184.000000	405184.000000
mean	0.004639	60.511694	0.007237	0.019264	22.453987
std	0.001250	11.366489	0.001444	0.004086	2.698347
min	0.001171	1.100000	0.002693	0.006692	0.000000
25%	0.003919	51.000000	0.006456	0.017024	19.900000
50%	0.004812	54.900000	0.007489	0.019950	22.200000
75%	0.005409	74.300003	0.008150	0.021838	23.600000
max	0.014420	99.900002	0.016567	0.046590	30.600000

cols = [i for i in df.columns if i not in ["date", "ts", "device", "device_name", "motion", "light"]]

df[cols].iplot(kind="box", subplots=True)

By hovering over each subplots, we could get the min, max, mean, median values.
Looking over how the horizontal lines are placed, we could make assumptions like how much is the data skewed.
It seems there there is high deviation in temperature and humidity which means there could be outliers.

Insights Based on Device

Since our overall data might be biased, we have to look into insights based on device. But why biased? The reasons are:

Each device was on distinct environment
Each device have different numbers of recordings

df[[i for i in df.columns if i not in ["date", "ts", "device"]]].groupby("device_name").describe()

	co								humidity								lpg								smoke								temp
	count	mean	std	min	25%	50%	75%	max	count	mean	std	min	25%	50%	75%	max	count	mean	std	min	25%	50%	75%	max	count	mean	std	min	25%	50%	75%	max	count	mean	std	min	25%	50%	75%	max
device_name
cooler,more,humid	111815.0	0.003527	0.001479	0.001171	0.002613	0.003230	0.004116	0.014420	111815.0	75.444361	1.975801	1.1	74.400002	75.400002	76.500000	99.900002	111815.0	0.005893	0.001700	0.002693	0.004815	0.005613	0.006689	0.016567	111815.0	0.015489	0.004809	0.006692	0.012445	0.014662	0.017682	0.046590	111815.0	19.362552	0.643786	0.0	19.100000	19.4	19.700001	20.200001
stable, warmer, dry	187451.0	0.005560	0.000559	0.004646	0.005079	0.005439	0.005993	0.007955	187451.0	50.814077	1.888926	45.1	49.600000	50.900000	52.100000	63.300000	187451.0	0.008306	0.000599	0.007301	0.007788	0.008183	0.008778	0.010774	187451.0	0.022288	0.001720	0.019416	0.020803	0.021931	0.023640	0.029422	187451.0	22.279969	0.481902	21.0	21.900000	22.3	22.600000	24.100000
variable temp/humidity	105918.0	0.004183	0.000320	0.003391	0.003931	0.004089	0.004391	0.006224	105918.0	61.910247	8.944792	1.6	55.599998	59.599998	65.300003	92.000000	105918.0	0.006764	0.000373	0.005814	0.006470	0.006657	0.007009	0.009022	105918.0	0.017895	0.001055	0.015224	0.017064	0.017592	0.018589	0.024341	105918.0	26.025511	2.026427	0.0	24.299999	25.9	27.299999	30.600000

It is hard to get any insights from above table. Lets view it by looping.

for d in df.device_name.unique():
    df.query(f"device_name=='{d}'")[cols].iplot(kind="box", subplots=True, title=f"Box Plot of device placed at {d}")

If we observe plots clearly, there can be seen significant difference in each for the column temp.

CO recorded seems to be higher for a device placed at variable temp. But much spread is of stable, warmer and dry.
Humidity recorded seems to be lower for stable, warmer dry place’s device.
LPG recorded seems to be well spread on stable, warmer place’s device.
Smoke recorded seems to be spread for stable warmer dry place’s device.
Temperature is self explained that it is lower for cooler place. And so on.

Finding Outliers

for c in cols:
    plt.figure(figsize=(15,8))
    sns.boxplot(x="device_name", y=c, data=df)
    plt.title(label=f"Box Plot of {c}")
    plt.show()

png

Outliers are those for which points are away from the horizontal bars.

There seems to be high outliers in co for device which was in cooler, more humid.
There seems to be high outliers in humidity for device which was in cooler, more humid and variable temp/humidity.
There seems to be high outliers in LPG for device which was in cooler, more humid.
There seems to be high outliers in smoke for device which was in cooler, more humid.
There seems to be high outliers in temp for device which was in cvariable temp/humidity.

Correlations

Lets find Pearson’s correlation, whose range lies from -1 to 1. Value of -1 means negatively correlated where as +1 means highly correlated.

Overall

df.corr().iplot(kind="heatmap")

df.corr()

	ts	co	humidity	light	lpg	motion	smoke	temp
ts	1.000000	0.025757	0.017752	-0.020868	0.014178	-0.006911	0.016349	0.074443
co	0.025757	1.000000	-0.656750	-0.230197	0.997331	-0.000706	0.998192	0.110905
humidity	0.017752	-0.656750	1.000000	0.079703	-0.672113	-0.009826	-0.669863	-0.410427
light	-0.020868	-0.230197	0.079703	1.000000	-0.208926	0.033594	-0.212969	0.747485
lpg	0.014178	0.997331	-0.672113	-0.208926	1.000000	0.000232	0.999916	0.136396
motion	-0.006911	-0.000706	-0.009826	0.033594	0.000232	1.000000	0.000062	0.037649
smoke	0.016349	0.998192	-0.669863	-0.212969	0.999916	0.000062	1.000000	0.131891
temp	0.074443	0.110905	-0.410427	0.747485	0.136396	0.037649	0.131891	1.000000

CO have high positive correlation with LPG, Smoke. Negative with Light and Humidity.
Humidity have negative correlation with smoke, temp, LPG, CO which means that as Humidity increases these fields decreases.
Light have high correlation with temp.
Smoke have high correlation with LPG, CO but negative with humidity.
And so on.

For Each Device

Again, our data is biased and we have to further analyze it for distinct device.

for d in df.device_name.unique():
    corr=df.query(f"device_name=='{d}'").corr()
    print(corr)
    corr.iplot(kind="heatmap", title=f"Correlation of fields for device at {d}")

                ts        co  humidity     light       lpg    motion  \
ts        1.000000  0.696208  0.042347 -0.112667  0.703218 -0.009851   
co        0.696208  1.000000 -0.077022 -0.095929  0.999845 -0.003513   
humidity  0.042347 -0.077022  1.000000 -0.042066 -0.079296 -0.007169   
light    -0.112667 -0.095929 -0.042066  1.000000 -0.096124  0.007202   
lpg       0.703218  0.999845 -0.079296 -0.096124  1.000000 -0.003606   
motion   -0.009851 -0.003513 -0.007169  0.007202 -0.003606  1.000000   
smoke     0.701994  0.999895 -0.078891 -0.096093  0.999995 -0.003590   
temp      0.149731 -0.035695 -0.372977  0.008124 -0.033369 -0.000086   

             smoke      temp  
ts        0.701994  0.149731  
co        0.999895 -0.035695  
humidity -0.078891 -0.372977  
light    -0.096093  0.008124  
lpg       0.999995 -0.033369  
motion   -0.003590 -0.000086  
smoke     1.000000 -0.033786  
temp     -0.033786  1.000000  

                ts        co  humidity     light       lpg    motion  \
ts        1.000000 -0.322829  0.298280 -0.034300 -0.331622  0.004054   
co       -0.322829  1.000000 -0.221073 -0.048450  0.994789 -0.005022   
humidity  0.298280 -0.221073  1.000000 -0.169963 -0.227099  0.022255   
light    -0.034300 -0.048450 -0.169963  1.000000 -0.047746  0.018596   
lpg      -0.331622  0.994789 -0.227099 -0.047746  1.000000 -0.005482   
motion    0.004054 -0.005022  0.022255  0.018596 -0.005482  1.000000   
smoke    -0.330315  0.996474 -0.226195 -0.047971  0.999835 -0.005404   
temp      0.043851 -0.296603  0.293223 -0.053637 -0.301287  0.001910   

             smoke      temp  
ts       -0.330315  0.043851  
co        0.996474 -0.296603  
humidity -0.226195  0.293223  
light    -0.047971 -0.053637  
lpg       0.999835 -0.301287  
motion   -0.005404  0.001910  
smoke     1.000000 -0.300719  
temp     -0.300719  1.000000  

                ts        co  humidity  light       lpg    motion     smoke  \
ts        1.000000 -0.165952 -0.012370    NaN -0.167243 -0.007758 -0.167018   
co       -0.165952  1.000000 -0.313322    NaN  0.999907  0.013455  0.999937   
humidity -0.012370 -0.313322  1.000000    NaN -0.314211 -0.011879 -0.314058   
light          NaN       NaN       NaN    NaN       NaN       NaN       NaN   
lpg      -0.167243  0.999907 -0.314211    NaN  1.000000  0.013532  0.999997   
motion   -0.007758  0.013455 -0.011879    NaN  0.013532  1.000000  0.013518   
smoke    -0.167018  0.999937 -0.314058    NaN  0.999997  0.013518  1.000000   
temp      0.320340  0.044866 -0.397001    NaN  0.044504  0.021263  0.044566   

              temp  
ts        0.320340  
co        0.044866  
humidity -0.397001  
light          NaN  
lpg       0.044504  
motion    0.021263  
smoke     0.044566  
temp      1.000000  

One valuable insight can be found in first plot where there is high correlation between time and smoke, then co for device at stable and warmer place.
But there is negative correlation between time and smoke for other two devices.

Conclusion from Descriptive Analysis

As we could see on the above plots and correlation plots, values, we could say that we can not make any judgment based on the overall data because the relationship between fields is different for different place. This could be found in real world that we often have to sub divide the data and perform distinct tests, operations for each. Now we will move on to the next part of our analysis which is Inferential Analysis.

Inferential Data Analysis

In Inferential Statistics, we take a step forward from the descriptive information we had and try to make some inferences or predictions. In general case, we try to prove, estimate and hypothesize something by taking a sample from the population. Mainly in inferential statistics, our focus will be on making conclusion about something.

From our descriptive analysis, we knew that there is difference in correlation values of fields for each device and lets focus our test, hypothesis based on that. There are lots of thing we could inference and test here and I think sky is the limit. Also, looking over the time series analysis, there was distinct grouping of each field for distinct devices.

In all of the inferential analysis there there are mainly two things we do:

Making inferences or predictions about the population. Example,the average age of the passengers is 29 years.
Making and testing hypothesis about the populations. Example, whether the survival rate of one gender differs from another’s.

Sampling

Sampling is a concept of taking a small part of a population data with (or without) a hope of having a central tendency of population. Sampling is done when size of the population is high.

Sampling is very popular in risk analyzing. For example, if a bulb company manufactures bulbs then in order to find the durability, they often take small sample and test on it. Similarly, in data collection types like questionnaire, we often make assumptions based on small number of data and try to claim something about a population. If we want to find out what is the ratio of smokers in male/female gender then we will collect small data and perform some tests to claim some conclusion and apply that in the population.

While working with a sample two terms are used to represent sample and population metrics:

Statistics: It is a measure or metric of sample. e.g. sample average CO.
Parameter: It is a measure or metric of a population. e.g. population average CO.

Problems with Sampling

Sample simply means to draw out the subset of the data from the population and whose size should always be smaller than that of the population. One major problem could be found in sampling is that the mean and variance of sample might not ressemble the population. It is often called as sample error.

In Pandas, we could take sample easily. So lets take a sample of size 10k from the population of size 405184.

sample1=df.sample(n=10000)
sd = sample1[cols].describe()
sd

	co	humidity	lpg	smoke	temp
count	10000.000000	10000.000000	10000.000000	10000.000000	10000.000000
mean	0.004658	60.409360	0.007259	0.019327	22.423650
std	0.001250	11.344686	0.001443	0.004084	2.650209
min	0.001171	4.200000	0.002693	0.006692	5.900000
25%	0.003924	51.000000	0.006462	0.017042	19.900000
50%	0.004828	54.600000	0.007508	0.020004	22.200000
75%	0.005460	74.300003	0.008206	0.021998	23.400000
max	0.013180	91.599998	0.015524	0.043461	30.600000

pod = df[cols].describe()
pod

	co	humidity	lpg	smoke	temp
count	405184.000000	405184.000000	405184.000000	405184.000000	405184.000000
mean	0.004639	60.511694	0.007237	0.019264	22.453987
std	0.001250	11.366489	0.001444	0.004086	2.698347
min	0.001171	1.100000	0.002693	0.006692	0.000000
25%	0.003919	51.000000	0.006456	0.017024	19.900000
50%	0.004812	54.900000	0.007489	0.019950	22.200000
75%	0.005409	74.300003	0.008150	0.021838	23.600000
max	0.014420	99.900002	0.016567	0.046590	30.600000

The result will come different each time for the sample because it will have random samples each time. But lets find the difference of sample statistics from population parameters.

pod-sd

	co	humidity	lpg	smoke	temp
count	3.951840e+05	395184.000000	3.951840e+05	395184.000000	395184.000000
mean	-1.939797e-05	0.102334	-2.223391e-05	-0.000063	0.030337
std	8.081408e-08	0.021803	7.320333e-07	0.000002	0.048138
min	0.000000e+00	-3.100000	0.000000e+00	0.000000	-5.900000
25%	-5.239762e-06	0.000000	-6.243145e-06	-0.000018	0.000000
50%	-1.696835e-05	0.300000	-1.909604e-05	-0.000054	0.000000
75%	-5.135261e-05	0.000000	-5.590530e-05	-0.000160	0.200000
max	1.240580e-03	8.300003	1.043818e-03	0.003129	0.000000

In above table we can see that some value is higher for population while some is for sample.

Estimation

While working with prediction/hypothesis in inferential analysis, we often have to deal with two types of estimates:

Point Estimation: It is simply a single value estimation for example the sample mean CO is equal to the population mean CO.
Interval Estimation: This estimation is based on finding a value in some range. For example the confidence interval is used in tests like Chi Square, t-test etc. In above example we have seen that there is difference in the trend of field value for each device. But is it significantly different that we should consider each as distinct?

In above example, we could do point estimation like the Temp mean of sample will be equal to population. Example of interval can be, the population mean of Temp will be around 5% left/right of sample.

Test

Once we are done taking samples and made some estimations, our next step is to test whether we will be able to claim such. So we will test our assumption. This step is known as test.

There are lots of test based upon the nature of estimation, calculation and prediction but all of those can be divided into 3 categories:

Comparison Test
Correlation Test
Regression Test

Based on parameters, we can also categorize tests into two groups:

Parametric Test: Parametric tests are those in which we work with parameters like mean and variance. One example of this test is t-test.
Non Parametric Test: These tests are non parametric because does not use parameters in the hypothesis. One example is Mann Whitney U test.

Based on the measurement (Nominal, Ordinal, Interval and Ratio) of the data we can choose best test for our data.

Terms Widely Used in Testing

Confidence Interval: Confidence interval is all about giving some room for the error. Which is often used with tests. For example, if we are trying to make a test where we have set our hypothesis that the average CO recorded by device first is not more than 2% of device second. Here we are giving some room for possible error.
Confidence Level: It sounds similar to confidence interval but no it is not. But these two terms are related to each other. Confidence level tells us how much probability is there that the sample statistics or estimated parameter lies within the confidence interval. For example, if we set the confidence level to 5%, then we will be claiming that if there are 100 tests done, at max 5 will be predicting wrong prediction. Or in other words, out of 100 tests, 95 tests will have the estimated value lie within the confidence interval.
Hypothesis: As the term suggests, hypothesis is something that we are assuming to happen. In Hypothesis testing, we will have different hypothesis against the default or null hypothesis. Those hypothesis against the default are known as alternative hypothesis.

Comparison Test

This kind of test is mostly done where we will compare the parameters, metrics between different samples or population vs sample. Generally we perform parametric tests here.

Test	Parametric	Comparison With	No. Samples
t-test	Yes	Mean, Variance	2
ANOVA	Yes	Variance, Mean	3+
Mann-Whitney U (Wilcoxon Rank Sum)	No	Sum of rankings	2
Wilcoxon Signed Rank	No	Distributions	2
Kruskal-Wallis H	No	Mean Rankings	3+
Mood’s Median	No	Medians	2+

Is the mean value of each fields same for each device’s recorded data?

ANOVA means Analysis of Variance. This test is used when we have to compare statistics between two or more samples. If we have two sample, we will use t-test.

Lets test it by assuming 5% of alpha value which is significance level. We assume that if there will be 5 wrong prediction out of 100, then we will ignore it.

Null Hypothesis: There is no difference in mean values of each devices.
Alternate Hypothesis: There is significant difference in mean value for each devices.

import statsmodels.api as sm
from statsmodels.formula.api import ols
from statsmodels.stats.anova import anova_lm

# formula = 'len ~ C(supp) + C(dose) + C(supp):C(dose)'
for c in cols:
    formula = f'{c} ~ device_name'
    model = ols(formula, data=df).fit()
    aov_table = anova_lm(model,typ=2)
    print(aov_table)

               sum_sq        df              F  PR(>F)
device_name  0.319266       2.0  206081.057177     0.0
Residual     0.313859  405181.0            NaN     NaN
                   sum_sq        df              F  PR(>F)
device_name  4.276879e+07       2.0  904472.329682     0.0
Residual     9.579674e+06  405181.0            NaN     NaN
               sum_sq        df              F  PR(>F)
device_name  0.439852       2.0  219945.965812     0.0
Residual     0.405145  405181.0            NaN     NaN
               sum_sq        df              F  PR(>F)
device_name  3.506427       2.0  217991.815523     0.0
Residual     3.258695  405181.0            NaN     NaN
                   sum_sq        df              F  PR(>F)
device_name  2.425356e+06       2.0  936247.353097     0.0
Residual     5.248123e+05  405181.0            NaN     NaN

It seems that that the p value is smaller than 5%, thus we reject the null hypothesis and claim that there is significant difference in mean values of fields of each device. But lets use ANOVA from SciPy’s stats and result must be same.

import scipy.stats as stats


for c in cols:
    devs = df.device_name.unique()
    groups = df.groupby("device_name").groups

    co0 = df[c][groups[devs[0]]]
    co1 = df[c][groups[devs[1]]]
    co2 = df[c][groups[devs[2]]]

    print(stats.f_oneway(co0, co1, co2))

F_onewayResult(statistic=206081.05717747274, pvalue=0.0)
F_onewayResult(statistic=904472.329681998, pvalue=0.0)
F_onewayResult(statistic=219945.96581178883, pvalue=0.0)
F_onewayResult(statistic=217991.81552333018, pvalue=0.0)
F_onewayResult(statistic=936247.3530974094, pvalue=0.0)

Correlation Test

Correlation tests are done to calculate the strength of the association between data.

Test	Parametric	Data Type
Pearson’s r	Yes	Interval/Ratio
Spearman’s r	No	Ordinal/Interval/Ratio
Chi Square Test of Independence	No	Nominal/Ordinal

Pearson’s r test is statistically powerful than Spearman’s but Spearman’s test is appropriate for interval and ratio type of data.

Only Chi Square Test of Independence is the only test that can be used with nominal variables.

Pearson’s and Spearman’s Test

Pearson’s Test For Linear Relationship Between Variables

The coefficient returns a value between -1 and 1 that represents the limits of correlation from a full negative correlation to a full positive correlation. A value of 0 means no correlation. The value must be interpreted, where often a value below -0.5 or above 0.5 indicates a notable correlation, and values below those values suggests a less notable correlation.

A formula is:

From Wikipedia

We have already done this test on the Descriptive Analysis Part.

Spearman’s Correlation: Non-Linear Relationship between two variables.

Two variables may be related by a nonlinear relationship, such that the relationship is stronger or weaker across the distribution of the variables. In this case Spearman’s correlation is used.

Pearson correlation assumes the data is normally distributed. However, Spearman does not make any assumption on the distribution of the data. That is the main difference between these two.

From Wikipedia

df.corr("spearman")

	ts	co	humidity	light	lpg	motion	smoke	temp
ts	1.000000	0.077576	0.051555	-0.020867	0.077576	-0.006917	0.077576	0.055377
co	0.077576	1.000000	-0.764622	-0.337479	1.000000	-0.003210	1.000000	0.121469
humidity	0.051555	-0.764622	1.000000	0.210620	-0.764622	-0.006705	-0.764622	-0.334038
light	-0.020867	-0.337479	0.210620	1.000000	-0.337479	0.033594	-0.337479	0.713951
lpg	0.077576	1.000000	-0.764622	-0.337479	1.000000	-0.003210	1.000000	0.121469
motion	-0.006917	-0.003210	-0.006705	0.033594	-0.003210	1.000000	-0.003210	0.033095
smoke	0.077576	1.000000	-0.764622	-0.337479	1.000000	-0.003210	1.000000	0.121469
temp	0.055377	0.121469	-0.334038	0.713951	0.121469	0.033095	0.121469	1.000000

Insights

Some of notable insights:

High +ve correlation of co with lpg, smoke.
High -ve correlation of humidity with co, lpg.
High +ve correlation of light with temp.
And so on.

Chi Square Test

When to use Chi Square?

The Chi-square test is a non-parametric statistic, also called a distribution free test. Non-parametric tests should be used when any one of the following conditions pertains to the data:

The level of measurement of all the variables is nominal or ordinal.
The sample sizes of the study groups are unequal; for the χ2 the groups may be of equal size or unequal size whereas some parametric tests require groups of equal or approximately equal size.
The original data were measured at an interval or ratio level, but violate one of the following assumptions of a parametric test:
- The distribution of the data was seriously skewed or kurtotic (parametric tests assume approximately normal distribution of the dependent variable), and thus the researcher must use a distribution free statistic rather than a parametric statistic.
- The data violate the assumptions of equal variance or homoscedasticity.
- For any of a number of reasons (1), the continuous data were collapsed into a small number of categories, and thus the data are no longer interval or ratio.

Note:

Null Hypothesis(H0): Two variables are not dependent. (no association between the two variables)
Alternate Hypothesis(H1): There is relationship between variables.
If Statistic >= Critical Value: significant result, reject null hypothesis (H0), dependent.
If Statistic < Critical Value: not significant result, fail to reject null hypothesis (H0), independent.

In terms of a p-value and a chosen significance level (alpha), the test can be interpreted as follows:

If p-value <= alpha: significant result, reject null hypothesis (H0), dependent.
If p-value > alpha: not significant result, fail to reject null hypothesis (H0), independent.

We do not have nominal data here thus we will not perform any test here yet.

Collinearity vs Multicollinearity

Correlation and collinearity are similar things with few differences:

Correlation measures the relationship strength and direction of the relationship between two fields in our data.
Collinearity is a situation where two fields are linearly associated (high correlation) and they are used as predictors for the target.
Multicollinearity is a case if collinearity where a there exists linear relationship with two or more features.

While training ML models, it is important that we remove those features that exhibit multicollinearity and we could do so by calculating VIF (Variance Inflation Factor). VIF allows us to determine the strength of correlation between other variables. VIF calculates how much the variance of a coefficient is inflated because of its linear dependencies with other predictors. Hence its name.

Referenced from here.

(1-R**2) is known as tolerance factor.

R-squared (R2) is a statistical measure that represents the proportion of the variance for a dependent variable that’s explained by an independent variable or variables in a regression model

Referenced from here.

Interpreting VIF:

1 — features are not correlated
1<VIF<5 — features are moderately correlated
VIF>5 — features are highly correlated
VIF>10 — high correlation between features and is cause for concern

from sklearn.linear_model import LinearRegression
def calculate_vif(df, features):    
    vif, tolerance = {}, {}
    # all the features that you want to examine
    for feature in features:
        # extract all the other features you will regress against
        X = [f for f in features if f != feature]        
        X, y = df[X], df[feature]
        # extract r-squared from the fit
        r2 = LinearRegression().fit(X, y).score(X, y)                
        
        # calculate tolerance
        tolerance[feature] = 1 - r2
        # calculate VIF
        vif[feature] = 1/(tolerance[feature])
    # return VIF DataFrame
    return pd.DataFrame({'VIF': vif, 'Tolerance': tolerance})

Now calculating VIF of our columns with each other.

calculate_vif(df=df, features=[c for c in df.columns if c not in ["device", "device_name", "date"]])

	VIF	Tolerance
ts	1.124875e+00	8.889881e-01
co	8.709637e+04	1.148153e-05
humidity	3.618642e+00	2.763468e-01
light	4.123083e+00	2.425369e-01
lpg	1.872582e+06	5.340219e-07
motion	1.001580e+00	9.984225e-01
smoke	2.765493e+06	3.615991e-07
temp	4.835901e+00	2.067867e-01

In above table, we can see that,

LPG, Smoke, CO have high correlation between other features and thus it can be our concerned features.
Also Temp, Humidity seems to be having good correlation but Time stamp, motion, does not seem to be having good relationships.

Lets remove co as highly correlated feature and calculating VIF again to see what effect can be seen.

calculate_vif(df=df, features=[c for c in df.columns if c not in ["device", "device_name", "date", "co"]])

	VIF	Tolerance
ts	1.085452	0.921275
humidity	3.079545	0.324723
light	4.023971	0.248511
lpg	7206.123295	0.000139
motion	1.001578	0.998424
smoke	7185.519140	0.000139
temp	4.833069	0.206908

The change can be seen in the terms that the VIF of LPG, Smoke has also decreased. It is sure that these 3 fields have high collinearity. Now again removing feature smoke and calculating VIF.

calculate_vif(df=df, features=[c for c in df.columns if c not in ["device", "device_name", "date", "smoke"]])

	VIF	Tolerance
ts	1.081658	0.924507
co	226.300545	0.004419
humidity	3.096043	0.322993
light	4.017793	0.248893
lpg	231.395623	0.004322
motion	1.001578	0.998425
temp	4.832018	0.206953

The changes seems to be more reflected. And it is clear that smoke have more collinearity than that of co with others. But again checking by removing LPG and calculating VIF.

calculate_vif(df=df, features=[c for c in df.columns if c not in ["device", "device_name", "date", "lpg"]])

	VIF	Tolerance
ts	1.080860	0.925189
co	335.166702	0.002984
humidity	3.101372	0.322438
light	4.016575	0.248968
motion	1.001578	0.998425
smoke	341.732959	0.002926
temp	4.831788	0.206963

The effects are similar to the case where we removed smoke.

Regression Tests

Regression tests are done where we try to estimate some parameter. If we have one dependent and one independent variable then we will be using simple linear regression like $y=mx+c$. If we have multiple variables then it will be mulilinear regression. But besides linear, there is logistic regression which tries to classify between two class.

The regression test examines whether the change is dependent variable have any effect in the independent variable or not.

Test	Predictor	Outcome
Simple Linear	1 interval/ratio	1 interval/ratio
Multi Linear	2+ interval/ratio	1 interval/ratio
Logistic regression	1+	1 binary
Nominal regression	1+	1 nominal
Ordinal regression	1+	1 ordinal

The linear relationship between features has been already discovered like the rise in CO has something to do with LPG and Smoke thus we can skip this test for now.

Using Autoviz for Fast EDA

Autoviz is a kind of auto EDA tool which performs lots of EDA and plots graphs and provides some valuable insights. However, manual EDA always gives much insights if we have time to perform one. And using Pandas profiler, we can get insights like correlation in terms of sentence.

av = AutoViz_Class()
dfa = av.AutoViz("iot_telemetry_data.csv")

    max_rows_analyzed is smaller than dataset shape 405184...
        randomly sampled 150000 rows from read CSV file
Shape of your Data Set loaded: (150000, 9)
############## C L A S S I F Y I N G  V A R I A B L E S  ####################
Classifying variables in data set...
    9 Predictors classified...
        No variables removed since no ID or low-information variables found in data set
Since Number of Rows in data 150000 exceeds maximum, randomly sampling 150000 rows for EDA...
Number of All Scatter Plots = 21

png

Time to run AutoViz = 35 seconds 

 ###################### AUTO VISUALIZATION Completed ########################

As we can see in the above outputs, there are lots of plots to find outliers, relationships and so on. Most of them are done by us manually on earlier steps but if we are on hurry and want to grasp insight as soon as possible, Autoviz is highly recommended. Pandas Profiling is even richer and it gives us interactive way to tune between different aspects of EDA like correlation, null counts, plots and so on. But this blog doesn’t contain the result because this is a static blog. :)

ProfileReport(dfa)

Twitter Facebook LinkedIn

Quassarian Viper

Introduction

Data Loading

Installing Libraries

Importing Libraries

Reading File

Viewing Shape of Data

Viewing Top Data

Data Check

Viewing Data Types

Checking Missing Values

Data Transformation

Datetime

Device Name

Descriptive Analysis

Frequency Distribution

What is number of observations for each device?

Insights

What is the distribution of a field over a time?

Insights

What is the distribution of each Columns?

CO Insights\

All

All Insights

Central Tendency

Overall Insights

Insights Based on Device

Finding Outliers

Correlations

Overall

For Each Device

Conclusion from Descriptive Analysis

Inferential Data Analysis

Sampling

Problems with Sampling

Estimation

Test

Terms Widely Used in Testing

Comparison Test

Is the mean value of each fields same for each device’s recorded data?

Correlation Test

Pearson’s and Spearman’s Test

Pearson’s Test For Linear Relationship Between Variables

Spearman’s Correlation: Non-Linear Relationship between two variables.

Insights

Chi Square Test

Collinearity vs Multicollinearity

Regression Tests

Using Autoviz for Fast EDA

Comments

You May Also Enjoy

ImageBaker - Making Image Labelling Fun

Advent of Code 2022 with Python

Text Analysis with WordCloud in Python

WorldCup Tweet Sentiment Analysis in Python