Taking Data Apps into WebApp: Using Streamlit, Plotly, and Python

29 minute read

Introduction

From the past 2 stories of a data and its journey to confess the insights, we have explored several areas and to point out few:

  1. We have done EDA based on descriptive and inferential part of the statistics to find strong evidences, relationships and facts about the data.
  2. We used some of valuable insights from the EDA and tried to classify the possible environment that the properties reflects to. One example is, we tried to predict the value of CO based on Smoke and LPG.

But now in this part, we will try to take those experiments into web app where we could tweak different aspects our experiment by making a simple yet powerful web app using Streamlit. Streamlit is a free tool available in Python that allows us to make Data Apps faster.

Making Things Ready

  • Please install Streamlit by doing pip install streamlit.
  • Once installed, please make sure it is recognized by system as a environment variable by doing streamlit --version and if it gives a output, then we are ready to go.
  • Please install Plotly as we will be making interactive plots based on it.

Getting Data Ready

For this purpose, we will be working with Room Occupancy Detection Data. Which is similar to the previous data. There are 3 text files with CSV formats, datatraining.txt, datatest.txt and datatest2.txt. Lets read them using Pandas and convert the date column to datetime. The column Occupancy contains a binary value 0/1 which will be the label for us later on.

import pandas as pd
import cufflinks
import plotly.io as pio
import warnings
import numpy as np
cufflinks.go_offline()
cufflinks.set_config_file(world_readable=True, theme='pearl')
pio.renderers.default = "notebook"
warnings.simplefilter("ignore")
train = pd.read_csv("https://github.com/LuisM78/Occupancy-detection-data/raw/master/datatraining.txt")
train["date"]=pd.to_datetime(train.date)
train
date Temperature Humidity Light CO2 HumidityRatio Occupancy
1 2015-02-04 17:51:00 23.18 27.2720 426.0 721.250000 0.004793 1
2 2015-02-04 17:51:59 23.15 27.2675 429.5 714.000000 0.004783 1
3 2015-02-04 17:53:00 23.15 27.2450 426.0 713.500000 0.004779 1
4 2015-02-04 17:54:00 23.15 27.2000 426.0 708.250000 0.004772 1
5 2015-02-04 17:55:00 23.10 27.2000 426.0 704.500000 0.004757 1
... ... ... ... ... ... ... ...
8139 2015-02-10 09:29:00 21.05 36.0975 433.0 787.250000 0.005579 1
8140 2015-02-10 09:29:59 21.05 35.9950 433.0 789.500000 0.005563 1
8141 2015-02-10 09:30:59 21.10 36.0950 433.0 798.500000 0.005596 1
8142 2015-02-10 09:32:00 21.10 36.2600 433.0 820.333333 0.005621 1
8143 2015-02-10 09:33:00 21.10 36.2000 447.0 821.000000 0.005612 1
test1 = pd.read_csv("https://github.com/LuisM78/Occupancy-detection-data/raw/master/datatest.txt")
test1["date"]=pd.to_datetime(test1.date)
test1

date Temperature Humidity Light CO2 HumidityRatio Occupancy
140 2015-02-02 14:19:00 23.700000 26.272000 585.200000 749.200000 0.004764 1
141 2015-02-02 14:19:59 23.718000 26.290000 578.400000 760.400000 0.004773 1
142 2015-02-02 14:21:00 23.730000 26.230000 572.666667 769.666667 0.004765 1
143 2015-02-02 14:22:00 23.722500 26.125000 493.750000 774.750000 0.004744 1
144 2015-02-02 14:23:00 23.754000 26.200000 488.600000 779.000000 0.004767 1
... ... ... ... ... ... ... ...
2800 2015-02-04 10:38:59 24.290000 25.700000 808.000000 1150.250000 0.004829 1
2801 2015-02-04 10:40:00 24.330000 25.736000 809.800000 1129.200000 0.004848 1
2802 2015-02-04 10:40:59 24.330000 25.700000 817.000000 1125.800000 0.004841 1
2803 2015-02-04 10:41:59 24.356667 25.700000 813.000000 1123.000000 0.004849 1
2804 2015-02-04 10:43:00 24.408333 25.681667 798.000000 1124.000000 0.004860 1
test2 = pd.read_csv("https://github.com/LuisM78/Occupancy-detection-data/raw/master/datatest2.txt")
test2["date"]=pd.to_datetime(test2.date)
test2
date Temperature Humidity Light CO2 HumidityRatio Occupancy
1 2015-02-11 14:48:00 21.7600 31.133333 437.333333 1029.666667 0.005021 1
2 2015-02-11 14:49:00 21.7900 31.000000 437.333333 1000.000000 0.005009 1
3 2015-02-11 14:50:00 21.7675 31.122500 434.000000 1003.750000 0.005022 1
4 2015-02-11 14:51:00 21.7675 31.122500 439.000000 1009.500000 0.005022 1
5 2015-02-11 14:51:59 21.7900 31.133333 437.333333 1005.666667 0.005030 1
... ... ... ... ... ... ... ...
9748 2015-02-18 09:15:00 20.8150 27.717500 429.750000 1505.250000 0.004213 1
9749 2015-02-18 09:16:00 20.8650 27.745000 423.500000 1514.500000 0.004230 1
9750 2015-02-18 09:16:59 20.8900 27.745000 423.500000 1521.500000 0.004237 1
9751 2015-02-18 09:17:59 20.8900 28.022500 418.750000 1632.000000 0.004279 1
9752 2015-02-18 09:19:00 21.0000 28.100000 409.000000 1864.000000 0.004321 1

Lets look over these data and do necessary actions.

test2.date.describe()
count                    9752
unique                   9752
top       2015-02-15 15:04:59
freq                        1
first     2015-02-11 14:48:00
last      2015-02-18 09:19:00
Name: date, dtype: object
test1.date.describe()
count                    2665
unique                   2665
top       2015-02-03 14:45:59
freq                        1
first     2015-02-02 14:19:00
last      2015-02-04 10:43:00
Name: date, dtype: object
train.date.describe()
count                    8143
unique                   8143
top       2015-02-07 20:26:59
freq                        1
first     2015-02-04 17:51:00
last      2015-02-10 09:33:00
Name: date, dtype: object

Looking over the date of each dataframe, the train data have data from 04 to 10 day, and test1 have 02 to 04 then test2 have 11 to 18 day. It might be best idea to concatenate train and test2 but lets explore it later on.

Exploratory Data Analysis

Missing Values

dfs = {"train":train,"test1":test1,"test2":test2}
for df in dfs.values():
    total = df.isnull().sum().sort_values(ascending = False)
    percent = (df.isnull().sum()/df.isnull().count()).sort_values(ascending = False)
    mdf = pd.concat([total, percent], axis=1, keys=['Total', 'Percent'])
    mdf = mdf.reset_index()
    print(mdf)
           index  Total  Percent
0           date      0      0.0
1    Temperature      0      0.0
2       Humidity      0      0.0
3          Light      0      0.0
4            CO2      0      0.0
5  HumidityRatio      0      0.0
6      Occupancy      0      0.0
           index  Total  Percent
0           date      0      0.0
1    Temperature      0      0.0
2       Humidity      0      0.0
3          Light      0      0.0
4            CO2      0      0.0
5  HumidityRatio      0      0.0
6      Occupancy      0      0.0
           index  Total  Percent
0           date      0      0.0
1    Temperature      0      0.0
2       Humidity      0      0.0
3          Light      0      0.0
4            CO2      0      0.0
5  HumidityRatio      0      0.0
6      Occupancy      0      0.0

It seems that there is no missing values in each columns. Lets see the distribution of each columns.

Summary of Each Variables

For the purpose of comparing distribution of values in each dataframe, we will plot boxplot side by side. Please ignore the import time and fig.write_image(..) part.

from plotly.subplots import make_subplots
import plotly.graph_objects as go
import time

titles = list(dfs.keys())

for c in train.columns:
    if c!="date":
        fig = make_subplots(rows=2,cols=3, subplot_titles=titles, )
        fig.add_trace(go.Box(y=train[c].tolist(), name=titles[0]), row=1, col=1)
        fig.add_trace(go.Box(y=test1[c].tolist(), name = titles[1]), row=1, col=2)
        fig.add_trace(go.Box(y=test2[c].tolist(), name = titles[2]), row=1, col=3)
        
        fig.add_trace(go.Histogram(y=train[c].tolist(), name=titles[0]), row=2, col=1)
        fig.add_trace(go.Histogram(y=test1[c].tolist(), name = titles[1]), row=2, col=2)
        fig.add_trace(go.Histogram(y=test2[c].tolist(), name = titles[2]), row=2, col=3)
        fig.update_layout(height=600, width=800, title_text=f"Box and Distribution of {c}")
        fig.show()
        fig.write_image(f"summary_{c}.png")

Looking over a Histogram and a box plot of different column values, we can see that the descriptive property of a data is not identical to each other. Thus we might need to do some kind of data transformation if our model does not perform well.

Correlation

Lets see if correlation between variables are same and if they do, we will be on the bright side.

fig = make_subplots(rows=1,cols=3, subplot_titles=titles)
fig.add_trace(go.Heatmap(z=train.corr(), y=train.corr().columns,x=train.corr().index, name=titles[0]), row=1, col=1)
fig.add_trace(go.Heatmap(z=test1.corr(), x=train.corr().index, name = titles[1]), row=1, col=2)
fig.add_trace(go.Heatmap(z=test2.corr(), x=train.corr().index, name = titles[2]), row=1, col=3)
fig.show()

fig.write_image(f"corr.png")

The correlation seems almost similar for all 3 dataframes.

Taking EDA to Streamlit App

Please create a project folder and inside it, create a Python file main.py. This file will be our main file where we will do all these plots and it will take our plots, analysis into the web app.

First Streamlit App

We will read our file and then put it in a cache so that we wont have to read it whenever our app is changed.

import streamlit as st
import numpy as np
import pandas as pd
import cufflinks

@st.cache
def get_data():
    train = pd.read_csv("https://github.com/LuisM78/Occupancy-detection-data/raw/master/datatraining.txt")
    train["date"]=pd.to_datetime(train.date)
    
    test1 = pd.read_csv("https://github.com/LuisM78/Occupancy-detection-data/raw/master/datatest.txt")
    test1["date"]=pd.to_datetime(test1.date)
    
    test2 = pd.read_csv("https://github.com/LuisM78/Occupancy-detection-data/raw/master/datatest2.txt")
    test2["date"]=pd.to_datetime(test2.date)
    
    dfs = {"train":train,"test1":test1,"test2":test2}
    return dfs

dfs = get_data()
st.dataframe(dfs["train"])

In above code, we have read 3 files and put them in dictionary as dfs the returned. The @st.cache decorator allows us to cache the file so that we wont need to reload the data whenever the app reloads. Then we have shown the dataframe in a app. App should look like below:

For the next step, we will add few select box and then the analysis parts.

import streamlit as st
import numpy as np
import pandas as pd
import cufflinks
from plotly.subplots import make_subplots
import plotly.graph_objects as go

@st.cache
def get_data():
    train = pd.read_csv("https://github.com/LuisM78/Occupancy-detection-data/raw/master/datatraining.txt")
    train["date"]=pd.to_datetime(train.date)
    
    test1 = pd.read_csv("https://github.com/LuisM78/Occupancy-detection-data/raw/master/datatest.txt")
    test1["date"]=pd.to_datetime(test1.date)
    
    test2 = pd.read_csv("https://github.com/LuisM78/Occupancy-detection-data/raw/master/datatest2.txt")
    test2["date"]=pd.to_datetime(test2.date)
    
    dfs = {"train":train,"test1":test1,"test2":test2}
    return dfs

dfs = get_data()
sidebar = st.sidebar


# select modes, EDA, Clustering, Regression and Classification
mode = sidebar.selectbox("Select a mode.",options=["EDA", "Clustering", "Regression", "Classification"])

# If selected EDA, show EDA related plots
if mode=="EDA":
    
    # if selected show the data
    show_data = sidebar.checkbox("Show data")
    if show_data:
        # if selected, show train data
        if sidebar.checkbox("Show Train data"):
            st.markdown("### Train Data")
            st.dataframe(dfs["train"])
        
        # if selected, show test1 data
        if sidebar.checkbox("Show Test1 data"):
            st.markdown("### Test1 Data")
            st.dataframe(dfs["test1"])
            
        # if selected, show test2 data
        if sidebar.checkbox("Show Test2 data"):
            st.markdown("### Test2 Data")
            st.dataframe(dfs["test2"])
    
    # if selected, show the comparision data
    show_comparison = sidebar.checkbox("Show comparison")
    if show_comparison:
        
        # make a multiselect to select the columns to compare
        selected = sidebar.multiselect("Select Columns ", [d for d in dfs["train"].columns if d not in ["date"]])
        
        
        titles=list(dfs.keys())
        train = dfs["train"]
        test1 = dfs["test1"]
        test2 = dfs["test2"]
        
        if selected:
            st.markdown(f"### Selected Columns: {', '.join(selected)}")
            
            for c in selected:
                fig = make_subplots(rows=2,cols=3, subplot_titles=titles, )
                fig.add_trace(go.Box(y=train[c].tolist(), name=titles[0]), row=1, col=1)
                fig.add_trace(go.Box(y=test1[c].tolist(), name = titles[1]), row=1, col=2)
                fig.add_trace(go.Box(y=test2[c].tolist(), name = titles[2]), row=1, col=3)
                
                fig.add_trace(go.Histogram(y=train[c].tolist(), name=titles[0]), row=2, col=1)
                fig.add_trace(go.Histogram(y=test1[c].tolist(), name = titles[1]), row=2, col=2)
                fig.add_trace(go.Histogram(y=test2[c].tolist(), name = titles[2]), row=2, col=3)
                fig.update_layout(height=600, width=800, title_text=f"Box and Distribution of {c}")
                st.plotly_chart(fig)
        
        # if selected show correlation
        show_corr = sidebar.checkbox("Show Correlation")
        if show_corr:
            st.markdown("### Correlation")
            fig = make_subplots(rows=1,cols=3, subplot_titles=titles)
            fig.add_trace(go.Heatmap(z=train.corr(), y=train.corr().columns,x=train.corr().index, name=titles[0]), row=1, col=1)
            fig.add_trace(go.Heatmap(z=test1.corr(), x=train.corr().index, name = titles[1]), row=1, col=2)
            fig.add_trace(go.Heatmap(z=test2.corr(), x=train.corr().index, name = titles[2]), row=1, col=3)
            st.plotly_chart(fig)

In above code, we have added everything we did on EDA into a web app. We have added a comment above the part of code that needs explanation. Now our app looks like below:

Clustering

Now we want to cluster our data based on the features we have. We already know that there are two classes in data occupancy, but lets try to find if some kind of clusters can be seen or not.

KMeans Clustering

Lets first do clustering based on default features and see the performance of it on the train dataframe.

from sklearn.cluster import KMeans


clusters = 5
features = ["Temperature", "Humidity", "CO2", "HumidityRatio","Light"][:-2]

inertias = []

for c in range(2,clusters+1):
    tdf = train.copy()
    X = tdf[features].to_numpy()
    
    colors=['red','green','blue','magenta','black','yellow']
    model = KMeans(n_clusters=c)
    model.fit(X)
    y_kmeans = model.predict(X)
    tdf["cluster"] = y_kmeans
    inertias.append((c,model.inertia_))
    
    trace0 = go.Scatter(x=tdf[features[0]],y=tdf[features[1]],mode='markers',  marker=dict(
        color=tdf.cluster.apply(lambda x: colors[x]),
        colorscale='Viridis',
        showscale=True
    ),name="Cluster Points")
    trace1 = go.Scatter(x=model.cluster_centers_[:, 0], y=model.cluster_centers_[:, 1],mode='markers', marker=dict(
        color=colors,
        size=20,
        showscale=True
    ),name="Cluster Mean")
        
    data7 = go.Data([trace0, trace1])
    fig = go.Figure(data=data7)
    fig.update_layout(title=f"Cluster Size {c}")
    fig.show()
    fig.write_image(f"kmeans_{c}.png")

inertias=np.array(inertias).reshape(-1,2)
performance = go.Scatter(x=inertias[:,0], y=inertias[:,1])
layout = go.Layout(
    title="Cluster Number vs Inertia",
    xaxis=dict(
        title="Ks"
    ),
    yaxis=dict(
        title="Inertia"
    ) ) 
fig=go.Figure(data=go.Data([performance]))
fig.update_layout(layout)
fig.show()
fig.write_image(f"kmeans_cvi{c}.png")

Looking over the Inertia plot, it seems that inertia has decreased slowly from ks 3. But We already know that data is from two different occupancy. The cluster plots does not seems to be great because we have multiple features used for clustering and plot is 2d. Now lets try to do dimension reduction and see the performance.

PCA for Dimensionality Reduction

PCA is used to reduce the high dimension of the data into more robust features.

from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt

X = train[features]
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)


# Create a PCA instance: pca
pca = PCA(n_components=3)
principalComponents = pca.fit_transform(X_scaled)# Plot the explained variances
feat = range(pca.n_components_)
plt.bar(feat, pca.explained_variance_ratio_, color='black')
plt.xlabel('PCA features')
plt.ylabel('variance %')
plt.xticks(feat)# Save components to a DataFrame
PCA_components = pd.DataFrame(principalComponents, columns=list(range(len(feat))))
plt.show()

plt.scatter(PCA_components[1], PCA_components[2], alpha=.1, color='black')
plt.xlabel('PCA 1')
plt.ylabel('PCA 2')

png

Text(0, 0.5, 'PCA 2')

png

Looking over the plots of components, we can see some kind of clustering. Thus, we will try to make a Cluster now.


clusters = 5
features = ["Temperature", "Humidity", "CO2", "HumidityRatio","Light"][:-2]

inertias = []

for c in range(2,clusters+1):
    X = PCA_components[[1,2]]
    
    colors=['red','green','blue','magenta','black','yellow']
    model = KMeans(n_clusters=c)
    model.fit(X)
    y_kmeans = model.predict(X)
    tdf["cluster"] = y_kmeans
    inertias.append((c,model.inertia_))
    
    trace0 = go.Scatter(x=X[1],y=X[2],mode='markers',  marker=dict(
        color=tdf.cluster.apply(lambda x: colors[x]),
        colorscale='Viridis',
        showscale=True
    ),name="Cluster Points")
    trace1 = go.Scatter(x=model.cluster_centers_[:, 0], y=model.cluster_centers_[:, 1],mode='markers', marker=dict(
        color=colors,
        size=20,
        showscale=True
    ),name="Cluster Mean")
        
    data7 = go.Data([trace0, trace1])
    fig = go.Figure(data=data7)
    fig.update_layout(title=f"Cluster Size {c}")
    fig.show()
    fig.write_image(f"pca_kmeans_{c}1.png")

inertias=np.array(inertias).reshape(-1,2)
performance = go.Scatter(x=inertias[:,0], y=inertias[:,1])
layout = go.Layout(
    title="Cluster Number vs Inertia",
    xaxis=dict(
        title="Ks"
    ),
    yaxis=dict(
        title="Inertia"
    ) ) 
fig=go.Figure(data=go.Data([performance]))
fig.update_layout(layout)
fig.show()
fig.write_image(f"pca_kmeans_cvi.png")

Now we can see the performance in better way. We can make cluster of 2. Before adding this into the web app, lets do KMedoids first.

KMedoids Clustering

We have already covered the theory on previous part but in this one, we will just import KMedoids from sklearn_extra and use it just like the previous part.

from sklearn_extra.cluster import KMedoids


clusters = 5
features = ["Temperature", "Humidity", "CO2", "HumidityRatio","Light"][:-2]

inertias = []

for c in range(2,clusters+1):
    X = PCA_components[[1,2]]
    
    colors=['red','green','blue','magenta','black','yellow']
    model = KMedoids(n_clusters=c)
    model.fit(X)
    y_kmeans = model.predict(X)
    tdf["cluster"] = y_kmeans
    inertias.append((c,model.inertia_))
    
    trace0 = go.Scatter(x=X[1],y=X[2],mode='markers',  marker=dict(
        color=tdf.cluster.apply(lambda x: colors[x]),
        colorscale='Viridis',
        showscale=True
    ),name="Cluster Points")
    trace1 = go.Scatter(x=model.cluster_centers_[:, 0], y=model.cluster_centers_[:, 1],mode='markers', marker=dict(
        color=colors,
        size=20,
        showscale=True
    ),name="Cluster Mean")
        
    data7 = go.Data([trace0, trace1])
    fig = go.Figure(data=data7)
    fig.update_layout(title=f"Cluster Size {c}")
    fig.show()
    fig.write_image(f"kmedoids_{c}.png")

inertias=np.array(inertias).reshape(-1,2)
performance = go.Scatter(x=inertias[:,0], y=inertias[:,1])
layout = go.Layout(
    title="Cluster Number vs Inertia",
    xaxis=dict(
        title="Ks"
    ),
    yaxis=dict(
        title="Inertia"
    ) ) 
fig=go.Figure(data=go.Data([performance]))
fig.update_layout(layout)
fig.show()
fig.write_image(f"kmedoids_kvi.png")

Lets add this into streamlit app now.

Taking Clustering to Streamlit App

Since we have already made a selectbox of each mode, we will add entire clustering codes in a clustering.

from sklearn_extra.cluster import KMedoids
from sklearn.cluster import KMeans
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler

if mode=="Clustering":
    features = ["Temperature", "Humidity", "CO2", "HumidityRatio","Light"][:-2]
    st.markdown("## Clustering Mode Selected")
    st.markdown(hline)
    
    # select a  clustering algorithm
    calg = sidebar.selectbox("Select a clustering algorithm", ["K-Medoids","K-Means"])
    
    # select number of clusters
    ks = sidebar.slider("Select number of clusters", min_value=2, max_value=10, value=2)
    
    # select a dataframe to apply cluster on
    data_type = sidebar.selectbox("Select a dataframe:", ["Train","Test1","Test2"])
    st.markdown(f"## Dataframe selected {data_type}")
    udf = dfs[data_type.lower()]
    
    # if selected kmedoids, do respective operations
    if calg == "K-Medoids":  
        st.markdown("### K-Medoids Clustering")      
        
        # if using PCA or not
        use_pca = sidebar.radio("Use PCA?",["Yes","No"])
        # if not using pca, do default clustering
        if use_pca=="No":
            st.markdown("### Not Using PCA")
            inertias = []
            for c in range(1,ks+1):
                tdf = udf.copy()
                X = tdf[features]                
                colors=['red','green','blue','magenta','black','yellow']
                model = KMedoids(n_clusters=c)
                model.fit(X)
                y_kmeans = model.predict(X)
                tdf["cluster"] = y_kmeans
                inertias.append((c,model.inertia_))
                
                trace0 = go.Scatter(x=tdf[features[0]],y=tdf[features[1]],mode='markers',  marker=dict(
                    color=tdf.cluster.apply(lambda x: colors[x]),
                    colorscale='Viridis',
                    showscale=True
                ),name="Cluster Points")
                trace1 = go.Scatter(x=model.cluster_centers_[:, 0], y=model.cluster_centers_[:, 1],mode='markers', marker=dict(
                    color=colors,
                    size=20,
                    showscale=True
                ),name="Cluster Mean")
                    
                data7 = go.Data([trace0, trace1])
                fig = go.Figure(data=data7)
                fig.update_layout(height=600, width=800, title=f"Cluster Size {c}")
                st.plotly_chart(fig)

            inertias=np.array(inertias).reshape(-1,2)
            performance = go.Scatter(x=inertias[:,0], y=inertias[:,1])
            layout = go.Layout(
                title="Cluster Number vs Inertia",
                xaxis=dict(
                    title="Ks"
                ),
                yaxis=dict(
                    title="Inertia"
                ) ) 
            fig=go.Figure(data=go.Data([performance]))
            fig.update_layout(layout)
            st.plotly_chart(fig)
        
        # if using pca, use pca to reduce dimensionality and then do clustering    
        if use_pca=="Yes":
            st.markdown("### Using PCA")
            tdf=udf.copy()
            
            X = udf[features]
            scaler = StandardScaler()
            X_scaled = scaler.fit_transform(X)
            
            pca = PCA(n_components=3)
            principalComponents = pca.fit_transform(X_scaled)
            feat = list(range(pca.n_components_))
            PCA_components = pd.DataFrame(principalComponents, columns=list(range(len(feat))))
            choosed_component = sidebar.multiselect("Choose Components",feat,default=[1,2])
            choosed_component=[int(i) for i in choosed_component]
            inertias = []
            for c in range(1,ks+1):
                X = PCA_components[choosed_component]
                
                colors=['red','green','blue','magenta','black','yellow']
                model = KMedoids(n_clusters=c)
                model.fit(X)
                y_kmeans = model.predict(X)
                tdf["cluster"] = y_kmeans
                inertias.append((c,model.inertia_))
                
                trace0 = go.Scatter(x=X[1],y=X[2],mode='markers',  marker=dict(
                    color=tdf.cluster.apply(lambda x: colors[x]),
                    colorscale='Viridis',
                    showscale=True
                ),name="Cluster Points")
                trace1 = go.Scatter(x=model.cluster_centers_[:, 0], y=model.cluster_centers_[:, 1],mode='markers', marker=dict(
                    color=colors,
                    size=20,
                    showscale=True
                ),name="Cluster Mean")
                    
                data7 = go.Data([trace0, trace1])
                fig = go.Figure(data=data7)
                fig.update_layout(title=f"Cluster Size {c}")
                st.plotly_chart(fig)

            inertias=np.array(inertias).reshape(-1,2)
            performance = go.Scatter(x=inertias[:,0], y=inertias[:,1])
            layout = go.Layout(
                title="Cluster Number vs Inertia",
                xaxis=dict(
                    title="Ks"
                ),
                yaxis=dict(
                    title="Inertia"
                ) ) 
            fig=go.Figure(data=go.Data([performance]))
            fig.update_layout(layout)
            st.plotly_chart(fig)
    # if chosen KMeans, do respective operations
    if calg == "K-Means":
        st.markdown("### K-Means Clustering")        
        use_pca = sidebar.radio("Use PCA?",["Yes","No"])
        if use_pca=="No":
            st.markdown("### Not Using PCA")
            inertias = []
            for c in range(1,ks+1):
                tdf = udf.copy()
                X = tdf[features]                
                colors=['red','green','blue','magenta','black','yellow']
                model = KMeans(n_clusters=c)
                model.fit(X)
                y_kmeans = model.predict(X)
                tdf["cluster"] = y_kmeans
                inertias.append((c,model.inertia_))
                
                trace0 = go.Scatter(x=tdf[features[0]],y=tdf[features[1]],mode='markers',  marker=dict(
                    color=tdf.cluster.apply(lambda x: colors[x]),
                    colorscale='Viridis',
                    showscale=True
                ),name="Cluster Points")
                trace1 = go.Scatter(x=model.cluster_centers_[:, 0], y=model.cluster_centers_[:, 1],mode='markers', marker=dict(
                    color=colors,
                    size=20,
                    showscale=True
                ),name="Cluster Mean")
                    
                data7 = go.Data([trace0, trace1])
                fig = go.Figure(data=data7)
                fig.update_layout(height=600, width=800, title=f"Cluster Size {c}")
                st.plotly_chart(fig)

            inertias=np.array(inertias).reshape(-1,2)
            performance = go.Scatter(x=inertias[:,0], y=inertias[:,1])
            layout = go.Layout(
                title="Cluster Number vs Inertia",
                xaxis=dict(
                    title="Ks"
                ),
                yaxis=dict(
                    title="Inertia"
                ) ) 
            fig=go.Figure(data=go.Data([performance]))
            fig.update_layout(layout)
            st.plotly_chart(fig)
            
        if use_pca=="Yes":
            st.markdown("### Using PCA")
            tdf=udf.copy()
            
            X = udf[features]
            scaler = StandardScaler()
            X_scaled = scaler.fit_transform(X)
            
            pca = PCA(n_components=3)
            principalComponents = pca.fit_transform(X_scaled)
            feat = list(range(pca.n_components_))
            PCA_components = pd.DataFrame(principalComponents, columns=list(range(len(feat))))
            choosed_component = sidebar.multiselect("Choose Components",feat,default=[1,2])
            choosed_component=[int(i) for i in choosed_component]
            inertias = []
            for c in range(1,ks+1):
                X = PCA_components[choosed_component]
                
                colors=['red','green','blue','magenta','black','yellow']
                model = KMeans(n_clusters=c)
                model.fit(X)
                y_kmeans = model.predict(X)
                tdf["cluster"] = y_kmeans
                inertias.append((c,model.inertia_))
                
                trace0 = go.Scatter(x=X[1],y=X[2],mode='markers',  marker=dict(
                    color=tdf.cluster.apply(lambda x: colors[x]),
                    colorscale='Viridis',
                    showscale=True
                ),name="Cluster Points")
                trace1 = go.Scatter(x=model.cluster_centers_[:, 0], y=model.cluster_centers_[:, 1],mode='markers', marker=dict(
                    color=colors,
                    size=20,
                    showscale=True
                ),name="Cluster Mean")
                    
                data7 = go.Data([trace0, trace1])
                fig = go.Figure(data=data7)
                fig.update_layout(title=f"Cluster Size {c}")
                st.plotly_chart(fig)

            inertias=np.array(inertias).reshape(-1,2)
            performance = go.Scatter(x=inertias[:,0], y=inertias[:,1])
            layout = go.Layout(
                title="Cluster Number vs Inertia",
                xaxis=dict(
                    title="Ks"
                ),
                yaxis=dict(
                    title="Inertia"
                ) ) 
            fig=go.Figure(data=go.Data([performance]))
            fig.update_layout(layout)
            st.plotly_chart(fig)          
            

Please refer to the comments for explanation of the code above. The web app should be something like below:

Now we will move on to the Regression part and implement it on our APP.

Regression

In this part, we will perform linear regression where we try to predict the occupancy based on other features. The metric will be calculated using model.score. The metric will be R2 Score.

from sklearn.linear_model import LinearRegression

model = LinearRegression()
features = ["Temperature", "Humidity", "CO2", "HumidityRatio","Light"]
dfs['btrain'] = pd.concat([train,test2])
xtest = dfs["test1"][features].to_numpy()
ytest = dfs["test1"]["Occupancy"]

for ddn,d in dfs.items():
    if ddn!="test1":
        print(ddn)
        X = d[features].to_numpy().reshape(-1,len(features))
        y = d["Occupancy"]

        model.fit(X,y)
        print(f"Model R2: {model.score(X,y)}")
        print(f"Test R2: {model.score(xtest,ytest)}")
    
train
Model R2: 0.8580749633459134
Test R2: 0.8714317856126421
test2
Model R2: 0.8952863420051961
Test R2: 0.8658567155646273
btrain
Model R2: 0.8693410187120196
Test R2: 0.8649947193268359

Looking over the results above, btrain seems to have given a high R2 Score but train also have good test score.

features = ["Temperature", "Humidity", "CO2", "HumidityRatio","Light"]
for dn,d in dfs.items():
    if ddn!="test1":
        print(ddn)
        X = d[features].to_numpy().reshape(-1,len(features))
        y = d["Occupancy"]
        scaler = StandardScaler()
        X_scaled = scaler.fit_transform(X)


        # Create a PCA instance: pca
        pca = PCA(n_components=3)
        principalComponents = pca.fit_transform(X_scaled)
        feat = range(pca.n_components_)
        PCA_components = pd.DataFrame(principalComponents, columns=list(feat))

        model=LinearRegression()
        model.fit(PCA_components.to_numpy().reshape(-1,len(feat)),y)

        print(f"Model R2: {model.score(PCA_components.to_numpy().reshape(-1,len(feat)),y)}")
        #print(f"Test R2: {model.score(xtest,ytest)}")
btrain
Model R2: 0.8533646108336054
btrain
Model R2: 0.8786596706469451
btrain
Model R2: 0.6389469109358212
btrain
Model R2: 0.6305745433548783

It seems that our best model is from default Linear regression but still lets take PCA into the Streamlit app.

Taking Regression to Streamlit App

from sklearn.linear_model import LinearRegression, LogisticRegression, Ridge, Lasso, ElasticNet


if mode == "Regression":
    features = ["Temperature", "Humidity", "CO2", "HumidityRatio","Light"]
    algorithm = sidebar.selectbox("Choose Algorithm",["Linear Regression","Ridge Regression","Lasso Regression","Elastic Net"])
    st.markdown(f"### Chosen {algorithm}")
    models = {"Linear Regression":LinearRegression(), "Ridge Regression":Ridge(), "Lasso Regression":Lasso(), "Elastic Net":ElasticNet()}
    model = models[algorithm]
    
    train_df = sidebar.selectbox("Choose Train Data",data_list)
    test_df = sidebar.selectbox("Choose Test Data",[i for i in data_list if i != train_df])
    selected = sidebar.multiselect("Choose Features",features,default=features)
       
    xtrain = dfs[train_df][selected].to_numpy().reshape(-1,len(selected))
    xtest = dfs[test_df][selected].to_numpy().reshape(-1,len(selected))
    ytrain = dfs[train_df]["Occupancy"].to_numpy()
    ytest = dfs[test_df]["Occupancy"].to_numpy()

    model.fit(xtrain,ytrain)
    st.markdown(f"Train R2 Score: {model.score(xtrain,ytrain)}")
    st.markdown(f"Test R2 Score: {model.score(xtest,ytest)}")
    
  • We have imported few regression algorithms from sklearn.
  • We made a select box to select an algorithm.
  • Made select box to choose train/test data.
  • Made multi select box to choose features to use while making a model.
  • Then we trained a model using selected data, selected feature and selected algorithm.
  • Printed the accuracy also.

Adding few more lines of codes to show coefficient and take user input for a prediction:

    if sidebar.checkbox("Show Coefficients"):
        st.markdown("#### Showing Coefficents and Intercept")
        st.write(f"Coeffs: {model.coef_}")
        st.write(f"Intercept: {model.intercept_}")
    if sidebar.checkbox("Show Prediction"):
        st.markdown("#### Showing Prediction")
        input_values = [float((st.number_input(t))) for t in selected]
        prediction = model.predict([input_values])
        st.write(f"Predicted {prediction}")

All Codes

Below is the codes that we wrote upto now.

import streamlit as st
import numpy as np
import pandas as pd
import cufflinks
from plotly.subplots import make_subplots
import plotly.graph_objects as go
from sklearn_extra.cluster import KMedoids
from sklearn.cluster import KMeans
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression, LogisticRegression, Ridge, Lasso, ElasticNet

hline="--"*40

@st.cache
def get_data():
    train = pd.read_csv("https://github.com/LuisM78/Occupancy-detection-data/raw/master/datatraining.txt")
    train["date"]=pd.to_datetime(train.date)
    
    test1 = pd.read_csv("https://github.com/LuisM78/Occupancy-detection-data/raw/master/datatest.txt")
    test1["date"]=pd.to_datetime(test1.date)
    
    test2 = pd.read_csv("https://github.com/LuisM78/Occupancy-detection-data/raw/master/datatest2.txt")
    test2["date"]=pd.to_datetime(test2.date)
    
    dfs = {"train":train,"test1":test1,"test2":test2}
    return dfs

dfs = get_data()
data_list = list(dfs.keys())
sidebar = st.sidebar

# select modes, EDA, Clustering, Regression and Classification
mode = sidebar.selectbox("Select a mode.",options=["EDA", "Clustering", "Regression", "Classification"])
st.markdown(f"### {mode} Mode Selected")
st.markdown(hline)
    

# If selected EDA, show EDA related plots
if mode=="EDA":
    # if selected show the data
    show_data = sidebar.checkbox("Show data")
    if show_data:
        # if selected, show train data
        if sidebar.checkbox("Show Train data"):
            st.markdown("### Train Data")
            st.dataframe(dfs["train"])
        
        # if selected, show test1 data
        if sidebar.checkbox("Show Test1 data"):
            st.markdown("### Test1 Data")
            st.dataframe(dfs["test1"])
            
        # if selected, show test2 data
        if sidebar.checkbox("Show Test2 data"):
            st.markdown("### Test2 Data")
            st.dataframe(dfs["test2"])
    
    # if selected, show the comparision data
    show_comparison = sidebar.checkbox("Show comparison")
    if show_comparison:
        
        # make a multiselect to select the columns to compare
        selected = sidebar.multiselect("Select Columns ", [d for d in dfs["train"].columns if d not in ["date"]])
        
        
        titles=list(dfs.keys())
        train = dfs["train"]
        test1 = dfs["test1"]
        test2 = dfs["test2"]
        
        if selected:
            st.markdown(f"### Selected Columns: {', '.join(selected)}")
            
            for c in selected:
                fig = make_subplots(rows=2,cols=3, subplot_titles=titles, )
                fig.add_trace(go.Box(y=train[c].tolist(), name=titles[0]), row=1, col=1)
                fig.add_trace(go.Box(y=test1[c].tolist(), name = titles[1]), row=1, col=2)
                fig.add_trace(go.Box(y=test2[c].tolist(), name = titles[2]), row=1, col=3)
                
                fig.add_trace(go.Histogram(y=train[c].tolist(), name=titles[0]), row=2, col=1)
                fig.add_trace(go.Histogram(y=test1[c].tolist(), name = titles[1]), row=2, col=2)
                fig.add_trace(go.Histogram(y=test2[c].tolist(), name = titles[2]), row=2, col=3)
                fig.update_layout(height=600, width=800, title_text=f"Box and Distribution of {c}")
                st.plotly_chart(fig)
        
        # if selected show correlation
        show_corr = sidebar.checkbox("Show Correlation")
        if show_corr:
            st.markdown("### Correlation")
            fig = make_subplots(rows=1,cols=3, subplot_titles=titles)
            fig.add_trace(go.Heatmap(z=train.corr(), y=train.corr().columns,x=train.corr().index, name=titles[0]), row=1, col=1)
            fig.add_trace(go.Heatmap(z=test1.corr(), x=train.corr().index, name = titles[1]), row=1, col=2)
            fig.add_trace(go.Heatmap(z=test2.corr(), x=train.corr().index, name = titles[2]), row=1, col=3)
            st.plotly_chart(fig)
if mode=="Clustering":
    features = ["Temperature", "Humidity", "CO2", "HumidityRatio","Light"][:-2]
    
    # select a  clustering algorithm
    calg = sidebar.selectbox("Select a clustering algorithm", ["K-Medoids","K-Means"])
    
    # select number of clusters
    ks = sidebar.slider("Select number of clusters", min_value=2, max_value=10, value=2)
    
    # select a dataframe to apply cluster on
    data_type = sidebar.selectbox("Select a dataframe:", ["Train","Test1","Test2"])
    st.markdown(f"## Dataframe selected {data_type}")
    udf = dfs[data_type.lower()]
    
    # if selected kmedoids, do respective operations
    if calg == "K-Medoids":  
        st.markdown("### K-Medoids Clustering")      
        
        # if using PCA or not
        use_pca = sidebar.radio("Use PCA?",["Yes","No"])
        # if not using pca, do default clustering
        if use_pca=="No":
            st.markdown("### Not Using PCA")
            inertias = []
            for c in range(1,ks+1):
                tdf = udf.copy()
                X = tdf[features]                
                colors=['red','green','blue','magenta','black','yellow']
                model = KMedoids(n_clusters=c)
                model.fit(X)
                y_kmeans = model.predict(X)
                tdf["cluster"] = y_kmeans
                inertias.append((c,model.inertia_))
                
                trace0 = go.Scatter(x=tdf[features[0]],y=tdf[features[1]],mode='markers',  marker=dict(
                    color=tdf.cluster.apply(lambda x: colors[x]),
                    colorscale='Viridis',
                    showscale=True
                ),name="Cluster Points")
                trace1 = go.Scatter(x=model.cluster_centers_[:, 0], y=model.cluster_centers_[:, 1],mode='markers', marker=dict(
                    color=colors,
                    size=20,
                    showscale=True
                ),name="Cluster Mean")
                    
                data7 = go.Data([trace0, trace1])
                fig = go.Figure(data=data7)
                fig.update_layout(height=600, width=800, title=f"Cluster Size {c}")
                st.plotly_chart(fig)

            inertias=np.array(inertias).reshape(-1,2)
            performance = go.Scatter(x=inertias[:,0], y=inertias[:,1])
            layout = go.Layout(
                title="Cluster Number vs Inertia",
                xaxis=dict(
                    title="Ks"
                ),
                yaxis=dict(
                    title="Inertia"
                ) ) 
            fig=go.Figure(data=go.Data([performance]))
            fig.update_layout(layout)
            st.plotly_chart(fig)
        
        # if using pca, use pca to reduce dimensionality and then do clustering    
        if use_pca=="Yes":
            st.markdown("### Using PCA")
            tdf=udf.copy()
            
            X = udf[features]
            scaler = StandardScaler()
            X_scaled = scaler.fit_transform(X)
            
            pca = PCA(n_components=3)
            principalComponents = pca.fit_transform(X_scaled)
            feat = list(range(pca.n_components_))
            PCA_components = pd.DataFrame(principalComponents, columns=list(range(len(feat))))
            choosed_component = sidebar.multiselect("Choose Components",feat,default=[1,2])
            choosed_component=[int(i) for i in choosed_component]
            inertias = []
            for c in range(1,ks+1):
                X = PCA_components[choosed_component]
                
                colors=['red','green','blue','magenta','black','yellow']
                model = KMedoids(n_clusters=c)
                model.fit(X)
                y_kmeans = model.predict(X)
                tdf["cluster"] = y_kmeans
                inertias.append((c,model.inertia_))
                
                trace0 = go.Scatter(x=X[1],y=X[2],mode='markers',  marker=dict(
                    color=tdf.cluster.apply(lambda x: colors[x]),
                    colorscale='Viridis',
                    showscale=True
                ),name="Cluster Points")
                trace1 = go.Scatter(x=model.cluster_centers_[:, 0], y=model.cluster_centers_[:, 1],mode='markers', marker=dict(
                    color=colors,
                    size=20,
                    showscale=True
                ),name="Cluster Mean")
                    
                data7 = go.Data([trace0, trace1])
                fig = go.Figure(data=data7)
                fig.update_layout(title=f"Cluster Size {c}")
                st.plotly_chart(fig)

            inertias=np.array(inertias).reshape(-1,2)
            performance = go.Scatter(x=inertias[:,0], y=inertias[:,1])
            layout = go.Layout(
                title="Cluster Number vs Inertia",
                xaxis=dict(
                    title="Ks"
                ),
                yaxis=dict(
                    title="Inertia"
                ) ) 
            fig=go.Figure(data=go.Data([performance]))
            fig.update_layout(layout)
            st.plotly_chart(fig)
    # if chosen KMeans, do respective operations
    if calg == "K-Means":
        st.markdown("### K-Means Clustering")        
        use_pca = sidebar.radio("Use PCA?",["Yes","No"])
        if use_pca=="No":
            st.markdown("### Not Using PCA")
            inertias = []
            for c in range(1,ks+1):
                tdf = udf.copy()
                X = tdf[features]                
                colors=['red','green','blue','magenta','black','yellow']
                model = KMeans(n_clusters=c)
                model.fit(X)
                y_kmeans = model.predict(X)
                tdf["cluster"] = y_kmeans
                inertias.append((c,model.inertia_))
                
                trace0 = go.Scatter(x=tdf[features[0]],y=tdf[features[1]],mode='markers',  marker=dict(
                    color=tdf.cluster.apply(lambda x: colors[x]),
                    colorscale='Viridis',
                    showscale=True
                ),name="Cluster Points")
                trace1 = go.Scatter(x=model.cluster_centers_[:, 0], y=model.cluster_centers_[:, 1],mode='markers', marker=dict(
                    color=colors,
                    size=20,
                    showscale=True
                ),name="Cluster Mean")
                    
                data7 = go.Data([trace0, trace1])
                fig = go.Figure(data=data7)
                fig.update_layout(height=600, width=800, title=f"Cluster Size {c}")
                st.plotly_chart(fig)

            inertias=np.array(inertias).reshape(-1,2)
            performance = go.Scatter(x=inertias[:,0], y=inertias[:,1])
            layout = go.Layout(
                title="Cluster Number vs Inertia",
                xaxis=dict(
                    title="Ks"
                ),
                yaxis=dict(
                    title="Inertia"
                ) ) 
            fig=go.Figure(data=go.Data([performance]))
            fig.update_layout(layout)
            st.plotly_chart(fig)
            
        if use_pca=="Yes":
            st.markdown("### Using PCA")
            tdf=udf.copy()
            
            X = udf[features]
            scaler = StandardScaler()
            X_scaled = scaler.fit_transform(X)
            
            pca = PCA(n_components=3)
            principalComponents = pca.fit_transform(X_scaled)
            feat = list(range(pca.n_components_))
            PCA_components = pd.DataFrame(principalComponents, columns=list(range(len(feat))))
            choosed_component = sidebar.multiselect("Choose Components",feat,default=[1,2])
            choosed_component=[int(i) for i in choosed_component]
            inertias = []
            for c in range(1,ks+1):
                X = PCA_components[choosed_component]
                
                colors=['red','green','blue','magenta','black','yellow']
                model = KMeans(n_clusters=c)
                model.fit(X)
                y_kmeans = model.predict(X)
                tdf["cluster"] = y_kmeans
                inertias.append((c,model.inertia_))
                
                trace0 = go.Scatter(x=X[1],y=X[2],mode='markers',  marker=dict(
                    color=tdf.cluster.apply(lambda x: colors[x]),
                    colorscale='Viridis',
                    showscale=True
                ),name="Cluster Points")
                trace1 = go.Scatter(x=model.cluster_centers_[:, 0], y=model.cluster_centers_[:, 1],mode='markers', marker=dict(
                    color=colors,
                    size=20,
                    showscale=True
                ),name="Cluster Mean")
                    
                data7 = go.Data([trace0, trace1])
                fig = go.Figure(data=data7)
                fig.update_layout(title=f"Cluster Size {c}")
                st.plotly_chart(fig)

            inertias=np.array(inertias).reshape(-1,2)
            performance = go.Scatter(x=inertias[:,0], y=inertias[:,1])
            layout = go.Layout(
                title="Cluster Number vs Inertia",
                xaxis=dict(
                    title="Ks"
                ),
                yaxis=dict(
                    title="Inertia"
                ) ) 
            fig=go.Figure(data=go.Data([performance]))
            fig.update_layout(layout)
            st.plotly_chart(fig)
            
if mode == "Regression":
    features = ["Temperature", "Humidity", "CO2", "HumidityRatio","Light"]
    algorithm = sidebar.selectbox("Choose Algorithm",["Linear Regression","Ridge Regression","Lasso Regression","Elastic Net"])
    st.markdown(f"### Chosen {algorithm}")
    models = {"Linear Regression":LinearRegression(), "Ridge Regression":Ridge(), "Lasso Regression":Lasso(), "Elastic Net":ElasticNet()}
    model = models[algorithm]
    
    train_df = sidebar.selectbox("Choose Train Data",data_list)
    test_df = sidebar.selectbox("Choose Test Data",[i for i in data_list if i != train_df])
    selected = sidebar.multiselect("Choose Features",features,default=features)
       
    xtrain = dfs[train_df][selected].to_numpy().reshape(-1,len(selected))
    xtest = dfs[test_df][selected].to_numpy().reshape(-1,len(selected))
    ytrain = dfs[train_df]["Occupancy"].to_numpy()
    ytest = dfs[test_df]["Occupancy"].to_numpy()

    model.fit(xtrain,ytrain)
    st.markdown(f"Train R2 Score: {model.score(xtrain,ytrain)}")
    st.markdown(f"Test R2 Score: {model.score(xtest,ytest)}")
    
    if sidebar.checkbox("Show Coefficients"):
        st.markdown("#### Showing Coefficents and Intercept")
        st.write(f"Coeffs: {model.coef_}")
        st.write(f"Intercept: {model.intercept_}")
    if sidebar.checkbox("Show Prediction"):
        st.markdown("#### Showing Prediction")
        input_values = [float((st.number_input(t))) for t in selected]
        prediction = model.predict([input_values])
        st.write(f"Predicted {prediction}")
    

Classification

Taking Classification to Streamlit App

Until now we have created EDA, Clustering and Regression modes now is the time for us to create a classification models. We have covered a most of the redundant part of try out on above sections but in this one, we will jump right into the implementation part.

from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import confusion_matrix, f1_score, accuracy_score

if mode == "Classification":
    features = ["Temperature", "Humidity", "CO2", "HumidityRatio","Light"]
    algorithm = sidebar.selectbox("Choose Algorithm",["Logistic Regression","KNN","Decision Tree","Random Forest", "Ada Boost"])
    st.markdown(f"### Chosen {algorithm}")
    models = {"Logistic Regression":LogisticRegression(), "KNN":KNeighborsClassifier(), "Decision Tree":DecisionTreeClassifier(), "Random Forest":RandomForestClassifier(), "Ada Boost":AdaBoostClassifier()}
    model = models[algorithm]
    
    train_df = sidebar.selectbox("Choose Train Data",data_list)
    test_df = sidebar.selectbox("Choose Test Data",[i for i in data_list if i != train_df])
    selected = sidebar.multiselect("Choose Features",features,default=features)
       
    xtrain = dfs[train_df][selected].to_numpy().reshape(-1,len(selected))
    xtest = dfs[test_df][selected].to_numpy().reshape(-1,len(selected))
    ytrain = dfs[train_df]["Occupancy"].to_numpy()
    ytest = dfs[test_df]["Occupancy"].to_numpy()

    model.fit(xtrain,ytrain)
    st.markdown(f"##### R2 Score: Train = {model.score(xtrain,ytrain) : .2f}, Test = {model.score(xtest,ytest) : .2f}")
    
    train_pred = model.predict(xtrain)
    test_pred = model.predict(xtest)
    
    cm_train = confusion_matrix(ytrain,train_pred,labels=[0,1])
    cm_test = confusion_matrix(ytest,test_pred,labels=[0,1])
    
    train_f1 = f1_score(ytrain,train_pred,average="macro")
    test_f1 = f1_score(ytest,test_pred,average="macro")
    train_acc = accuracy_score(ytrain,train_pred)
    test_acc = accuracy_score(ytest,test_pred)
    
    st.markdown(f"##### F1 Score: Train = {train_f1 : .2f}, Test = {test_f1 : .2f}")
    st.markdown(f"##### Accuracy Score: Train = {train_acc : .2f}, Test = {test_acc : .2f}")
    
    fig = make_subplots(rows=1,cols=2, subplot_titles=["Train","Test"])
    labels = ["Vacant","Occupied"]
    fig1 = go.Heatmap(z=cm_train, y=labels, x=labels, name="Train")
    fig2 = go.Heatmap(z=cm_test, y=labels, x=labels, name="Test")
    
    fig.add_trace(fig1, row=1, col=1)
    fig.add_trace(fig2, row=1, col=2)
    fig.update_layout({"title":"Confusion Matrix","xaxis": {"title": "Predicted value"},
        "yaxis": {"title": "Real value"}})
    st.plotly_chart(fig)
    
    if sidebar.checkbox("Show Coefficients"):
        st.markdown("#### Showing Coefficents and Intercept")
        try:
            st.write(f"Coeffs: {model.coef_}")
            st.write(f"Intercept: {model.intercept_}")
        except:
            st.write("Coeffs: Not Available")
    if sidebar.checkbox("Show Prediction"):
        st.markdown("#### Showing Prediction")
        input_values = [float((st.number_input(t))) for t in selected]
        prediction = model.predict([input_values])
        st.write(f"Predicted {prediction}")
    

What we are doing in above code is:

  • Select the classification algorithm and make its class.
  • Choose features and then prepare data using that features.
  • Train a model and show train/test metrics.
  • Show confusion matrix.

If everything is fine, then our app should look like below:

All Codes

from sklearn.metrics import plot_confusion_matrix
import streamlit as st
import numpy as np
import pandas as pd
import cufflinks
from plotly.subplots import make_subplots
import plotly.graph_objects as go
from sklearn_extra.cluster import KMedoids
from sklearn.cluster import KMeans
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression, LogisticRegression, Ridge, Lasso, ElasticNet
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import confusion_matrix, f1_score, accuracy_score

hline="--"*40

@st.cache
def get_data():
    train = pd.read_csv("https://github.com/LuisM78/Occupancy-detection-data/raw/master/datatraining.txt")
    train["date"]=pd.to_datetime(train.date)
    
    test1 = pd.read_csv("https://github.com/LuisM78/Occupancy-detection-data/raw/master/datatest.txt")
    test1["date"]=pd.to_datetime(test1.date)
    
    test2 = pd.read_csv("https://github.com/LuisM78/Occupancy-detection-data/raw/master/datatest2.txt")
    test2["date"]=pd.to_datetime(test2.date)
    
    dfs = {"train":train,"test1":test1,"test2":test2}
    return dfs

dfs = get_data()
data_list = list(dfs.keys())
sidebar = st.sidebar

# select modes, EDA, Clustering, Regression and Classification
mode = sidebar.selectbox("Select a mode.",options=["EDA", "Clustering", "Regression", "Classification"])
st.markdown(f"### {mode} Mode Selected")
st.markdown(hline)
    

# If selected EDA, show EDA related plots
if mode=="EDA":
    # if selected show the data
    show_data = sidebar.checkbox("Show data")
    if show_data:
        # if selected, show train data
        if sidebar.checkbox("Show Train data"):
            st.markdown("### Train Data")
            st.dataframe(dfs["train"])
        
        # if selected, show test1 data
        if sidebar.checkbox("Show Test1 data"):
            st.markdown("### Test1 Data")
            st.dataframe(dfs["test1"])
            
        # if selected, show test2 data
        if sidebar.checkbox("Show Test2 data"):
            st.markdown("### Test2 Data")
            st.dataframe(dfs["test2"])
    
    # if selected, show the comparision data
    show_comparison = sidebar.checkbox("Show comparison")
    if show_comparison:
        
        # make a multiselect to select the columns to compare
        selected = sidebar.multiselect("Select Columns ", [d for d in dfs["train"].columns if d not in ["date"]])
        
        
        titles=list(dfs.keys())
        train = dfs["train"]
        test1 = dfs["test1"]
        test2 = dfs["test2"]
        
        if selected:
            st.markdown(f"### Selected Columns: {', '.join(selected)}")
            
            for c in selected:
                fig = make_subplots(rows=2,cols=3, subplot_titles=titles, )
                fig.add_trace(go.Box(y=train[c].tolist(), name=titles[0]), row=1, col=1)
                fig.add_trace(go.Box(y=test1[c].tolist(), name = titles[1]), row=1, col=2)
                fig.add_trace(go.Box(y=test2[c].tolist(), name = titles[2]), row=1, col=3)
                
                fig.add_trace(go.Histogram(y=train[c].tolist(), name=titles[0]), row=2, col=1)
                fig.add_trace(go.Histogram(y=test1[c].tolist(), name = titles[1]), row=2, col=2)
                fig.add_trace(go.Histogram(y=test2[c].tolist(), name = titles[2]), row=2, col=3)
                fig.update_layout(height=600, width=800, title_text=f"Box and Distribution of {c}")
                st.plotly_chart(fig)
        
        # if selected show correlation
        show_corr = sidebar.checkbox("Show Correlation")
        if show_corr:
            st.markdown("### Correlation")
            fig = make_subplots(rows=1,cols=3, subplot_titles=titles)
            fig.add_trace(go.Heatmap(z=train.corr(), y=train.corr().columns,x=train.corr().index, name=titles[0]), row=1, col=1)
            fig.add_trace(go.Heatmap(z=test1.corr(), x=train.corr().index, name = titles[1]), row=1, col=2)
            fig.add_trace(go.Heatmap(z=test2.corr(), x=train.corr().index, name = titles[2]), row=1, col=3)
            st.plotly_chart(fig)
if mode=="Clustering":
    features = ["Temperature", "Humidity", "CO2", "HumidityRatio","Light"][:-2]
    
    # select a  clustering algorithm
    calg = sidebar.selectbox("Select a clustering algorithm", ["K-Medoids","K-Means"])
    
    # select number of clusters
    ks = sidebar.slider("Select number of clusters", min_value=2, max_value=10, value=2)
    
    # select a dataframe to apply cluster on
    data_type = sidebar.selectbox("Select a dataframe:", ["Train","Test1","Test2"])
    st.markdown(f"## Dataframe selected {data_type}")
    udf = dfs[data_type.lower()]
    
    # if selected kmedoids, do respective operations
    if calg == "K-Medoids":  
        st.markdown("### K-Medoids Clustering")      
        
        # if using PCA or not
        use_pca = sidebar.radio("Use PCA?",["Yes","No"])
        # if not using pca, do default clustering
        if use_pca=="No":
            st.markdown("### Not Using PCA")
            inertias = []
            for c in range(1,ks+1):
                tdf = udf.copy()
                X = tdf[features]                
                colors=['red','green','blue','magenta','black','yellow']
                model = KMedoids(n_clusters=c)
                model.fit(X)
                y_kmeans = model.predict(X)
                tdf["cluster"] = y_kmeans
                inertias.append((c,model.inertia_))
                
                trace0 = go.Scatter(x=tdf[features[0]],y=tdf[features[1]],mode='markers',  marker=dict(
                    color=tdf.cluster.apply(lambda x: colors[x]),
                    colorscale='Viridis',
                    showscale=True
                ),name="Cluster Points")
                trace1 = go.Scatter(x=model.cluster_centers_[:, 0], y=model.cluster_centers_[:, 1],mode='markers', marker=dict(
                    color=colors,
                    size=20,
                    showscale=True
                ),name="Cluster Mean")
                    
                data7 = go.Data([trace0, trace1])
                fig = go.Figure(data=data7)
                fig.update_layout(height=600, width=800, title=f"Cluster Size {c}")
                st.plotly_chart(fig)

            inertias=np.array(inertias).reshape(-1,2)
            performance = go.Scatter(x=inertias[:,0], y=inertias[:,1])
            layout = go.Layout(
                title="Cluster Number vs Inertia",
                xaxis=dict(
                    title="Ks"
                ),
                yaxis=dict(
                    title="Inertia"
                ) ) 
            fig=go.Figure(data=go.Data([performance]))
            fig.update_layout(layout)
            st.plotly_chart(fig)
        
        # if using pca, use pca to reduce dimensionality and then do clustering    
        if use_pca=="Yes":
            st.markdown("### Using PCA")
            tdf=udf.copy()
            
            X = udf[features]
            scaler = StandardScaler()
            X_scaled = scaler.fit_transform(X)
            
            pca = PCA(n_components=3)
            principalComponents = pca.fit_transform(X_scaled)
            feat = list(range(pca.n_components_))
            PCA_components = pd.DataFrame(principalComponents, columns=list(range(len(feat))))
            choosed_component = sidebar.multiselect("Choose Components",feat,default=[1,2])
            choosed_component=[int(i) for i in choosed_component]
            inertias = []
            for c in range(1,ks+1):
                X = PCA_components[choosed_component]
                
                colors=['red','green','blue','magenta','black','yellow']
                model = KMedoids(n_clusters=c)
                model.fit(X)
                y_kmeans = model.predict(X)
                tdf["cluster"] = y_kmeans
                inertias.append((c,model.inertia_))
                
                trace0 = go.Scatter(x=X[1],y=X[2],mode='markers',  marker=dict(
                    color=tdf.cluster.apply(lambda x: colors[x]),
                    colorscale='Viridis',
                    showscale=True
                ),name="Cluster Points")
                trace1 = go.Scatter(x=model.cluster_centers_[:, 0], y=model.cluster_centers_[:, 1],mode='markers', marker=dict(
                    color=colors,
                    size=20,
                    showscale=True
                ),name="Cluster Mean")
                    
                data7 = go.Data([trace0, trace1])
                fig = go.Figure(data=data7)
                fig.update_layout(title=f"Cluster Size {c}")
                st.plotly_chart(fig)

            inertias=np.array(inertias).reshape(-1,2)
            performance = go.Scatter(x=inertias[:,0], y=inertias[:,1])
            layout = go.Layout(
                title="Cluster Number vs Inertia",
                xaxis=dict(
                    title="Ks"
                ),
                yaxis=dict(
                    title="Inertia"
                ) ) 
            fig=go.Figure(data=go.Data([performance]))
            fig.update_layout(layout)
            st.plotly_chart(fig)
    # if chosen KMeans, do respective operations
    if calg == "K-Means":
        st.markdown("### K-Means Clustering")        
        use_pca = sidebar.radio("Use PCA?",["Yes","No"])
        if use_pca=="No":
            st.markdown("### Not Using PCA")
            inertias = []
            for c in range(1,ks+1):
                tdf = udf.copy()
                X = tdf[features]                
                colors=['red','green','blue','magenta','black','yellow']
                model = KMeans(n_clusters=c)
                model.fit(X)
                y_kmeans = model.predict(X)
                tdf["cluster"] = y_kmeans
                inertias.append((c,model.inertia_))
                
                trace0 = go.Scatter(x=tdf[features[0]],y=tdf[features[1]],mode='markers',  marker=dict(
                    color=tdf.cluster.apply(lambda x: colors[x]),
                    colorscale='Viridis',
                    showscale=True
                ),name="Cluster Points")
                trace1 = go.Scatter(x=model.cluster_centers_[:, 0], y=model.cluster_centers_[:, 1],mode='markers', marker=dict(
                    color=colors,
                    size=20,
                    showscale=True
                ),name="Cluster Mean")
                    
                data7 = go.Data([trace0, trace1])
                fig = go.Figure(data=data7)
                fig.update_layout(height=600, width=800, title=f"Cluster Size {c}")
                st.plotly_chart(fig)

            inertias=np.array(inertias).reshape(-1,2)
            performance = go.Scatter(x=inertias[:,0], y=inertias[:,1])
            layout = go.Layout(
                title="Cluster Number vs Inertia",
                xaxis=dict(
                    title="Ks"
                ),
                yaxis=dict(
                    title="Inertia"
                ) ) 
            fig=go.Figure(data=go.Data([performance]))
            fig.update_layout(layout)
            st.plotly_chart(fig)
            
        if use_pca=="Yes":
            st.markdown("### Using PCA")
            tdf=udf.copy()
            
            X = udf[features]
            scaler = StandardScaler()
            X_scaled = scaler.fit_transform(X)
            
            pca = PCA(n_components=3)
            principalComponents = pca.fit_transform(X_scaled)
            feat = list(range(pca.n_components_))
            PCA_components = pd.DataFrame(principalComponents, columns=list(range(len(feat))))
            choosed_component = sidebar.multiselect("Choose Components",feat,default=[1,2])
            choosed_component=[int(i) for i in choosed_component]
            inertias = []
            for c in range(1,ks+1):
                X = PCA_components[choosed_component]
                
                colors=['red','green','blue','magenta','black','yellow']
                model = KMeans(n_clusters=c)
                model.fit(X)
                y_kmeans = model.predict(X)
                tdf["cluster"] = y_kmeans
                inertias.append((c,model.inertia_))
                
                trace0 = go.Scatter(x=X[1],y=X[2],mode='markers',  marker=dict(
                    color=tdf.cluster.apply(lambda x: colors[x]),
                    colorscale='Viridis',
                    showscale=True
                ),name="Cluster Points")
                trace1 = go.Scatter(x=model.cluster_centers_[:, 0], y=model.cluster_centers_[:, 1],mode='markers', marker=dict(
                    color=colors,
                    size=20,
                    showscale=True
                ),name="Cluster Mean")
                    
                data7 = go.Data([trace0, trace1])
                fig = go.Figure(data=data7)
                fig.update_layout(title=f"Cluster Size {c}")
                st.plotly_chart(fig)

            inertias=np.array(inertias).reshape(-1,2)
            performance = go.Scatter(x=inertias[:,0], y=inertias[:,1])
            layout = go.Layout(
                title="Cluster Number vs Inertia",
                xaxis=dict(
                    title="Ks"
                ),
                yaxis=dict(
                    title="Inertia"
                ) ) 
            fig=go.Figure(data=go.Data([performance]))
            fig.update_layout(layout)
            st.plotly_chart(fig)
            
if mode == "Regression":
    features = ["Temperature", "Humidity", "CO2", "HumidityRatio","Light"]
    algorithm = sidebar.selectbox("Choose Algorithm",["Linear Regression","Ridge Regression","Lasso Regression","Elastic Net"])
    st.markdown(f"### Chosen {algorithm}")
    models = {"Linear Regression":LinearRegression(), "Ridge Regression":Ridge(), "Lasso Regression":Lasso(), "Elastic Net":ElasticNet()}
    model = models[algorithm]
    
    train_df = sidebar.selectbox("Choose Train Data",data_list)
    test_df = sidebar.selectbox("Choose Test Data",[i for i in data_list if i != train_df])
    selected = sidebar.multiselect("Choose Features",features,default=features)
       
    xtrain = dfs[train_df][selected].to_numpy().reshape(-1,len(selected))
    xtest = dfs[test_df][selected].to_numpy().reshape(-1,len(selected))
    ytrain = dfs[train_df]["Occupancy"].to_numpy()
    ytest = dfs[test_df]["Occupancy"].to_numpy()

    model.fit(xtrain,ytrain)
    st.markdown(f"Train R2 Score: {model.score(xtrain,ytrain) : .2f}")
    st.markdown(f"Test R2 Score: {model.score(xtest,ytest) : .2f}")
    
    if sidebar.checkbox("Show Coefficients"):
        st.markdown("#### Showing Coefficents and Intercept")
        st.write(f"Coeffs: {model.coef_}")
        st.write(f"Intercept: {model.intercept_}")
    if sidebar.checkbox("Show Prediction"):
        st.markdown("#### Showing Prediction")
        input_values = [float((st.number_input(t))) for t in selected]
        prediction = model.predict([input_values])
        st.write(f"Predicted {prediction}")

if mode == "Classification":
    features = ["Temperature", "Humidity", "CO2", "HumidityRatio","Light"]
    algorithm = sidebar.selectbox("Choose Algorithm",["Logistic Regression","KNN","Decision Tree","Random Forest", "Ada Boost"])
    st.markdown(f"### Chosen {algorithm}")
    models = {"Logistic Regression":LogisticRegression(), "KNN":KNeighborsClassifier(), "Decision Tree":DecisionTreeClassifier(), "Random Forest":RandomForestClassifier(), "Ada Boost":AdaBoostClassifier()}
    model = models[algorithm]
    
    train_df = sidebar.selectbox("Choose Train Data",data_list)
    test_df = sidebar.selectbox("Choose Test Data",[i for i in data_list if i != train_df])
    selected = sidebar.multiselect("Choose Features",features,default=features)
       
    xtrain = dfs[train_df][selected].to_numpy().reshape(-1,len(selected))
    xtest = dfs[test_df][selected].to_numpy().reshape(-1,len(selected))
    ytrain = dfs[train_df]["Occupancy"].to_numpy()
    ytest = dfs[test_df]["Occupancy"].to_numpy()

    model.fit(xtrain,ytrain)
    st.markdown(f"##### R2 Score: Train = {model.score(xtrain,ytrain) : .2f}, Test = {model.score(xtest,ytest) : .2f}")
    
    train_pred = model.predict(xtrain)
    test_pred = model.predict(xtest)
    
    cm_train = confusion_matrix(ytrain,train_pred,labels=[0,1])
    cm_test = confusion_matrix(ytest,test_pred,labels=[0,1])
    
    train_f1 = f1_score(ytrain,train_pred,average="macro")
    test_f1 = f1_score(ytest,test_pred,average="macro")
    train_acc = accuracy_score(ytrain,train_pred)
    test_acc = accuracy_score(ytest,test_pred)
    
    st.markdown(f"##### F1 Score: Train = {train_f1 : .2f}, Test = {test_f1 : .2f}")
    st.markdown(f"##### Accuracy Score: Train = {train_acc : .2f}, Test = {test_acc : .2f}")
    
    fig = make_subplots(rows=1,cols=2, subplot_titles=["Train","Test"])
    labels = ["Vacant","Occupied"]
    fig1 = go.Heatmap(z=cm_train, y=labels, x=labels, name="Train")
    fig2 = go.Heatmap(z=cm_test, y=labels, x=labels, name="Test")
    
    fig.add_trace(fig1, row=1, col=1)
    fig.add_trace(fig2, row=1, col=2)
    fig.update_layout({"title":"Confusion Matrix","xaxis": {"title": "Predicted value"},
        "yaxis": {"title": "Real value"}})
    st.plotly_chart(fig)
    
    if sidebar.checkbox("Show Coefficients"):
        st.markdown("#### Showing Coefficents and Intercept")
        try:
            st.write(f"Coeffs: {model.coef_}")
            st.write(f"Intercept: {model.intercept_}")
        except:
            st.write("Coeffs: Not Available")
    if sidebar.checkbox("Show Prediction"):
        st.markdown("#### Showing Prediction")
        input_values = [float((st.number_input(t))) for t in selected]
        prediction = model.predict([input_values])
        st.write(f"Predicted {prediction}")
    
    
    

Add Inference Mode

For this purpose, we will need to save a model during Regression or Classification phase and then upload it on inference to test it. What we will do is,

  • Accept a model file upload and read number of features in it.
  • Create a form where we will have number of input fields equal to number of features to accept input vlaues.
  • Create a submit button to pass those input values to loaded model and then print the result.
if mode in ["Regression", "Classification"]:    
    filename=sidebar.text_input("Enter File Name",value="model.sav")
    save = sidebar.button("Save Model") 
    if save and model:
        pickle.dump(model, open(filename, 'wb'))
        sidebar.markdown("Saved Model!")

if mode =="Inference":
    model=None
    if "temp_model.csv" not in os.listdir():
        file = sidebar.file_uploader("Upload Model File", accept_multiple_files=False)
        if file:  
            model = pickle.load(file)
            pickle.dump(model, open("temp_model.sav", 'wb'))
            st.markdown("Model Loaded.")
    if model:
        st.markdown(f"Loaded {type(model).__name__} Model!")
        nfeatures="Not Known"
        try:
            nfeatures=model.n_features_
            
        except:
            nfeatures = model.coef_.shape[-1]
                
        st.markdown(f"Number of features: {nfeatures}")
            
        if nfeatures!="Not Known":
            form1 = st.form(key='my_form')

            st.markdown("#### Showing Prediction")
            input_values = [float((form1.number_input(f"Feature {t} Value"))) for t in range(nfeatures)]
            
            submit= form1.form_submit_button("Predict")
            prediction=None
            if submit:
                prediction = model.predict([input_values])
            st.markdown(f"Input Values: {input_values}")
            st.write(f"Predicted {prediction}")
        

Now we should be able to see a text field to enter a file name and then save button on Regression and Classification Mode.

Lets save some models and load them on Inference mode and try to predict from it. And our output should look like below:

Conclusion

It was a long ride from past few projects, where we done some EDA, then explored possible ML Models to predict, cluster data and in this part, we tried to do all those things from the web app and also we can now try to predict from the webapp by inserting some values into the form.


Comments