Taking Data Apps into WebApp: Using Streamlit, Plotly, and Python

29 minute read

Introduction

From the past 2 stories of a data and its journey to confess the insights, we have explored several areas and to point out few:

We have done EDA based on descriptive and inferential part of the statistics to find strong evidences, relationships and facts about the data.
We used some of valuable insights from the EDA and tried to classify the possible environment that the properties reflects to. One example is, we tried to predict the value of CO based on Smoke and LPG.

But now in this part, we will try to take those experiments into web app where we could tweak different aspects our experiment by making a simple yet powerful web app using Streamlit. Streamlit is a free tool available in Python that allows us to make Data Apps faster.

Making Things Ready

Please install Streamlit by doing pip install streamlit.
Once installed, please make sure it is recognized by system as a environment variable by doing streamlit --version and if it gives a output, then we are ready to go.
Please install Plotly as we will be making interactive plots based on it.

Getting Data Ready

For this purpose, we will be working with Room Occupancy Detection Data. Which is similar to the previous data. There are 3 text files with CSV formats, datatraining.txt, datatest.txt and datatest2.txt. Lets read them using Pandas and convert the date column to datetime. The column Occupancy contains a binary value 0/1 which will be the label for us later on.

import pandas as pd
import cufflinks
import plotly.io as pio
import warnings
import numpy as np
cufflinks.go_offline()
cufflinks.set_config_file(world_readable=True, theme='pearl')
pio.renderers.default = "notebook"
warnings.simplefilter("ignore")

train = pd.read_csv("https://github.com/LuisM78/Occupancy-detection-data/raw/master/datatraining.txt")
train["date"]=pd.to_datetime(train.date)
train

	date	Temperature	Humidity	Light	CO2	HumidityRatio	Occupancy
1	2015-02-04 17:51:00	23.18	27.2720	426.0	721.250000	0.004793	1
2	2015-02-04 17:51:59	23.15	27.2675	429.5	714.000000	0.004783	1
3	2015-02-04 17:53:00	23.15	27.2450	426.0	713.500000	0.004779	1
4	2015-02-04 17:54:00	23.15	27.2000	426.0	708.250000	0.004772	1
5	2015-02-04 17:55:00	23.10	27.2000	426.0	704.500000	0.004757	1
...	...	...	...	...	...	...	...
8139	2015-02-10 09:29:00	21.05	36.0975	433.0	787.250000	0.005579	1
8140	2015-02-10 09:29:59	21.05	35.9950	433.0	789.500000	0.005563	1
8141	2015-02-10 09:30:59	21.10	36.0950	433.0	798.500000	0.005596	1
8142	2015-02-10 09:32:00	21.10	36.2600	433.0	820.333333	0.005621	1
8143	2015-02-10 09:33:00	21.10	36.2000	447.0	821.000000	0.005612	1

test1 = pd.read_csv("https://github.com/LuisM78/Occupancy-detection-data/raw/master/datatest.txt")
test1["date"]=pd.to_datetime(test1.date)
test1

	date	Temperature	Humidity	Light	CO2	HumidityRatio	Occupancy
140	2015-02-02 14:19:00	23.700000	26.272000	585.200000	749.200000	0.004764	1
141	2015-02-02 14:19:59	23.718000	26.290000	578.400000	760.400000	0.004773	1
142	2015-02-02 14:21:00	23.730000	26.230000	572.666667	769.666667	0.004765	1
143	2015-02-02 14:22:00	23.722500	26.125000	493.750000	774.750000	0.004744	1
144	2015-02-02 14:23:00	23.754000	26.200000	488.600000	779.000000	0.004767	1
...	...	...	...	...	...	...	...
2800	2015-02-04 10:38:59	24.290000	25.700000	808.000000	1150.250000	0.004829	1
2801	2015-02-04 10:40:00	24.330000	25.736000	809.800000	1129.200000	0.004848	1
2802	2015-02-04 10:40:59	24.330000	25.700000	817.000000	1125.800000	0.004841	1
2803	2015-02-04 10:41:59	24.356667	25.700000	813.000000	1123.000000	0.004849	1
2804	2015-02-04 10:43:00	24.408333	25.681667	798.000000	1124.000000	0.004860	1

test2 = pd.read_csv("https://github.com/LuisM78/Occupancy-detection-data/raw/master/datatest2.txt")
test2["date"]=pd.to_datetime(test2.date)
test2

	date	Temperature	Humidity	Light	CO2	HumidityRatio	Occupancy
1	2015-02-11 14:48:00	21.7600	31.133333	437.333333	1029.666667	0.005021	1
2	2015-02-11 14:49:00	21.7900	31.000000	437.333333	1000.000000	0.005009	1
3	2015-02-11 14:50:00	21.7675	31.122500	434.000000	1003.750000	0.005022	1
4	2015-02-11 14:51:00	21.7675	31.122500	439.000000	1009.500000	0.005022	1
5	2015-02-11 14:51:59	21.7900	31.133333	437.333333	1005.666667	0.005030	1
...	...	...	...	...	...	...	...
9748	2015-02-18 09:15:00	20.8150	27.717500	429.750000	1505.250000	0.004213	1
9749	2015-02-18 09:16:00	20.8650	27.745000	423.500000	1514.500000	0.004230	1
9750	2015-02-18 09:16:59	20.8900	27.745000	423.500000	1521.500000	0.004237	1
9751	2015-02-18 09:17:59	20.8900	28.022500	418.750000	1632.000000	0.004279	1
9752	2015-02-18 09:19:00	21.0000	28.100000	409.000000	1864.000000	0.004321	1

Lets look over these data and do necessary actions.

test2.date.describe()

count                    9752
unique                   9752
top       2015-02-15 15:04:59
freq                        1
first     2015-02-11 14:48:00
last      2015-02-18 09:19:00
Name: date, dtype: object

test1.date.describe()

count                    2665
unique                   2665
top       2015-02-03 14:45:59
freq                        1
first     2015-02-02 14:19:00
last      2015-02-04 10:43:00
Name: date, dtype: object

train.date.describe()

count                    8143
unique                   8143
top       2015-02-07 20:26:59
freq                        1
first     2015-02-04 17:51:00
last      2015-02-10 09:33:00
Name: date, dtype: object

Looking over the date of each dataframe, the train data have data from 04 to 10 day, and test1 have 02 to 04 then test2 have 11 to 18 day. It might be best idea to concatenate train and test2 but lets explore it later on.

Exploratory Data Analysis

Missing Values

dfs = {"train":train,"test1":test1,"test2":test2}
for df in dfs.values():
    total = df.isnull().sum().sort_values(ascending = False)
    percent = (df.isnull().sum()/df.isnull().count()).sort_values(ascending = False)
    mdf = pd.concat([total, percent], axis=1, keys=['Total', 'Percent'])
    mdf = mdf.reset_index()
    print(mdf)

           index  Total  Percent
         date      0      0.0
  Temperature      0      0.0
     Humidity      0      0.0
        Light      0      0.0
          CO2      0      0.0
HumidityRatio      0      0.0
    Occupancy      0      0.0
           index  Total  Percent
         date      0      0.0
  Temperature      0      0.0
     Humidity      0      0.0
        Light      0      0.0
          CO2      0      0.0
HumidityRatio      0      0.0
    Occupancy      0      0.0
           index  Total  Percent
         date      0      0.0
  Temperature      0      0.0
     Humidity      0      0.0
        Light      0      0.0
          CO2      0      0.0
HumidityRatio      0      0.0
    Occupancy      0      0.0

It seems that there is no missing values in each columns. Lets see the distribution of each columns.

Summary of Each Variables

For the purpose of comparing distribution of values in each dataframe, we will plot boxplot side by side. Please ignore the import time and fig.write_image(..) part.

from plotly.subplots import make_subplots
import plotly.graph_objects as go
import time

titles = list(dfs.keys())

for c in train.columns:
    if c!="date":
        fig = make_subplots(rows=2,cols=3, subplot_titles=titles, )
        fig.add_trace(go.Box(y=train[c].tolist(), name=titles[0]), row=1, col=1)
        fig.add_trace(go.Box(y=test1[c].tolist(), name = titles[1]), row=1, col=2)
        fig.add_trace(go.Box(y=test2[c].tolist(), name = titles[2]), row=1, col=3)
        
        fig.add_trace(go.Histogram(y=train[c].tolist(), name=titles[0]), row=2, col=1)
        fig.add_trace(go.Histogram(y=test1[c].tolist(), name = titles[1]), row=2, col=2)
        fig.add_trace(go.Histogram(y=test2[c].tolist(), name = titles[2]), row=2, col=3)
        fig.update_layout(height=600, width=800, title_text=f"Box and Distribution of {c}")
        fig.show()
        fig.write_image(f"summary_{c}.png")

Looking over a Histogram and a box plot of different column values, we can see that the descriptive property of a data is not identical to each other. Thus we might need to do some kind of data transformation if our model does not perform well.

Correlation

Lets see if correlation between variables are same and if they do, we will be on the bright side.

fig = make_subplots(rows=1,cols=3, subplot_titles=titles)
fig.add_trace(go.Heatmap(z=train.corr(), y=train.corr().columns,x=train.corr().index, name=titles[0]), row=1, col=1)
fig.add_trace(go.Heatmap(z=test1.corr(), x=train.corr().index, name = titles[1]), row=1, col=2)
fig.add_trace(go.Heatmap(z=test2.corr(), x=train.corr().index, name = titles[2]), row=1, col=3)
fig.show()

fig.write_image(f"corr.png")

The correlation seems almost similar for all 3 dataframes.

Taking EDA to Streamlit App

Please create a project folder and inside it, create a Python file main.py. This file will be our main file where we will do all these plots and it will take our plots, analysis into the web app.

First Streamlit App

We will read our file and then put it in a cache so that we wont have to read it whenever our app is changed.

import streamlit as st
import numpy as np
import pandas as pd
import cufflinks

@st.cache
def get_data():
    train = pd.read_csv("https://github.com/LuisM78/Occupancy-detection-data/raw/master/datatraining.txt")
    train["date"]=pd.to_datetime(train.date)
    
    test1 = pd.read_csv("https://github.com/LuisM78/Occupancy-detection-data/raw/master/datatest.txt")
    test1["date"]=pd.to_datetime(test1.date)
    
    test2 = pd.read_csv("https://github.com/LuisM78/Occupancy-detection-data/raw/master/datatest2.txt")
    test2["date"]=pd.to_datetime(test2.date)
    
    dfs = {"train":train,"test1":test1,"test2":test2}
    return dfs

dfs = get_data()
st.dataframe(dfs["train"])

In above code, we have read 3 files and put them in dictionary as dfs the returned. The @st.cache decorator allows us to cache the file so that we wont need to reload the data whenever the app reloads. Then we have shown the dataframe in a app. App should look like below:

For the next step, we will add few select box and then the analysis parts.

import streamlit as st
import numpy as np
import pandas as pd
import cufflinks
from plotly.subplots import make_subplots
import plotly.graph_objects as go

@st.cache
def get_data():
    train = pd.read_csv("https://github.com/LuisM78/Occupancy-detection-data/raw/master/datatraining.txt")
    train["date"]=pd.to_datetime(train.date)
    
    test1 = pd.read_csv("https://github.com/LuisM78/Occupancy-detection-data/raw/master/datatest.txt")
    test1["date"]=pd.to_datetime(test1.date)
    
    test2 = pd.read_csv("https://github.com/LuisM78/Occupancy-detection-data/raw/master/datatest2.txt")
    test2["date"]=pd.to_datetime(test2.date)
    
    dfs = {"train":train,"test1":test1,"test2":test2}
    return dfs

dfs = get_data()
sidebar = st.sidebar


# select modes, EDA, Clustering, Regression and Classification
mode = sidebar.selectbox("Select a mode.",options=["EDA", "Clustering", "Regression", "Classification"])

# If selected EDA, show EDA related plots
if mode=="EDA":
    
    # if selected show the data
    show_data = sidebar.checkbox("Show data")
    if show_data:
        # if selected, show train data
        if sidebar.checkbox("Show Train data"):
            st.markdown("### Train Data")
            st.dataframe(dfs["train"])
        
        # if selected, show test1 data
        if sidebar.checkbox("Show Test1 data"):
            st.markdown("### Test1 Data")
            st.dataframe(dfs["test1"])
            
        # if selected, show test2 data
        if sidebar.checkbox("Show Test2 data"):
            st.markdown("### Test2 Data")
            st.dataframe(dfs["test2"])
    
    # if selected, show the comparision data
    show_comparison = sidebar.checkbox("Show comparison")
    if show_comparison:
        
        # make a multiselect to select the columns to compare
        selected = sidebar.multiselect("Select Columns ", [d for d in dfs["train"].columns if d not in ["date"]])
        
        
        titles=list(dfs.keys())
        train = dfs["train"]
        test1 = dfs["test1"]
        test2 = dfs["test2"]
        
        if selected:
            st.markdown(f"### Selected Columns: {', '.join(selected)}")
            
            for c in selected:
                fig = make_subplots(rows=2,cols=3, subplot_titles=titles, )
                fig.add_trace(go.Box(y=train[c].tolist(), name=titles[0]), row=1, col=1)
                fig.add_trace(go.Box(y=test1[c].tolist(), name = titles[1]), row=1, col=2)
                fig.add_trace(go.Box(y=test2[c].tolist(), name = titles[2]), row=1, col=3)
                
                fig.add_trace(go.Histogram(y=train[c].tolist(), name=titles[0]), row=2, col=1)
                fig.add_trace(go.Histogram(y=test1[c].tolist(), name = titles[1]), row=2, col=2)
                fig.add_trace(go.Histogram(y=test2[c].tolist(), name = titles[2]), row=2, col=3)
                fig.update_layout(height=600, width=800, title_text=f"Box and Distribution of {c}")
                st.plotly_chart(fig)
        
        # if selected show correlation
        show_corr = sidebar.checkbox("Show Correlation")
        if show_corr:
            st.markdown("### Correlation")
            fig = make_subplots(rows=1,cols=3, subplot_titles=titles)
            fig.add_trace(go.Heatmap(z=train.corr(), y=train.corr().columns,x=train.corr().index, name=titles[0]), row=1, col=1)
            fig.add_trace(go.Heatmap(z=test1.corr(), x=train.corr().index, name = titles[1]), row=1, col=2)
            fig.add_trace(go.Heatmap(z=test2.corr(), x=train.corr().index, name = titles[2]), row=1, col=3)
            st.plotly_chart(fig)

In above code, we have added everything we did on EDA into a web app. We have added a comment above the part of code that needs explanation. Now our app looks like below:

Clustering

Now we want to cluster our data based on the features we have. We already know that there are two classes in data occupancy, but lets try to find if some kind of clusters can be seen or not.

KMeans Clustering

Lets first do clustering based on default features and see the performance of it on the train dataframe.

from sklearn.cluster import KMeans


clusters = 5
features = ["Temperature", "Humidity", "CO2", "HumidityRatio","Light"][:-2]

inertias = []

for c in range(2,clusters+1):
    tdf = train.copy()
    X = tdf[features].to_numpy()
    
    colors=['red','green','blue','magenta','black','yellow']
    model = KMeans(n_clusters=c)
    model.fit(X)
    y_kmeans = model.predict(X)
    tdf["cluster"] = y_kmeans
    inertias.append((c,model.inertia_))
    
    trace0 = go.Scatter(x=tdf[features[0]],y=tdf[features[1]],mode='markers',  marker=dict(
        color=tdf.cluster.apply(lambda x: colors[x]),
        colorscale='Viridis',
        showscale=True
    ),name="Cluster Points")
    trace1 = go.Scatter(x=model.cluster_centers_[:, 0], y=model.cluster_centers_[:, 1],mode='markers', marker=dict(
        color=colors,
        size=20,
        showscale=True
    ),name="Cluster Mean")
        
    data7 = go.Data([trace0, trace1])
    fig = go.Figure(data=data7)
    fig.update_layout(title=f"Cluster Size {c}")
    fig.show()
    fig.write_image(f"kmeans_{c}.png")

inertias=np.array(inertias).reshape(-1,2)
performance = go.Scatter(x=inertias[:,0], y=inertias[:,1])
layout = go.Layout(
    title="Cluster Number vs Inertia",
    xaxis=dict(
        title="Ks"
    ),
    yaxis=dict(
        title="Inertia"
    ) ) 
fig=go.Figure(data=go.Data([performance]))
fig.update_layout(layout)
fig.show()
fig.write_image(f"kmeans_cvi{c}.png")

Looking over the Inertia plot, it seems that inertia has decreased slowly from ks 3. But We already know that data is from two different occupancy. The cluster plots does not seems to be great because we have multiple features used for clustering and plot is 2d. Now lets try to do dimension reduction and see the performance.

PCA for Dimensionality Reduction

PCA is used to reduce the high dimension of the data into more robust features.

from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt

X = train[features]
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)


# Create a PCA instance: pca
pca = PCA(n_components=3)
principalComponents = pca.fit_transform(X_scaled)# Plot the explained variances
feat = range(pca.n_components_)
plt.bar(feat, pca.explained_variance_ratio_, color='black')
plt.xlabel('PCA features')
plt.ylabel('variance %')
plt.xticks(feat)# Save components to a DataFrame
PCA_components = pd.DataFrame(principalComponents, columns=list(range(len(feat))))
plt.show()

plt.scatter(PCA_components[1], PCA_components[2], alpha=.1, color='black')
plt.xlabel('PCA 1')
plt.ylabel('PCA 2')

png

Text(0, 0.5, 'PCA 2')

png

Looking over the plots of components, we can see some kind of clustering. Thus, we will try to make a Cluster now.

clusters = 5
features = ["Temperature", "Humidity", "CO2", "HumidityRatio","Light"][:-2]

inertias = []

for c in range(2,clusters+1):
    X = PCA_components[[1,2]]
    
    colors=['red','green','blue','magenta','black','yellow']
    model = KMeans(n_clusters=c)
    model.fit(X)
    y_kmeans = model.predict(X)
    tdf["cluster"] = y_kmeans
    inertias.append((c,model.inertia_))
    
    trace0 = go.Scatter(x=X[1],y=X[2],mode='markers',  marker=dict(
        color=tdf.cluster.apply(lambda x: colors[x]),
        colorscale='Viridis',
        showscale=True
    ),name="Cluster Points")
    trace1 = go.Scatter(x=model.cluster_centers_[:, 0], y=model.cluster_centers_[:, 1],mode='markers', marker=dict(
        color=colors,
        size=20,
        showscale=True
    ),name="Cluster Mean")
        
    data7 = go.Data([trace0, trace1])
    fig = go.Figure(data=data7)
    fig.update_layout(title=f"Cluster Size {c}")
    fig.show()
    fig.write_image(f"pca_kmeans_{c}1.png")

inertias=np.array(inertias).reshape(-1,2)
performance = go.Scatter(x=inertias[:,0], y=inertias[:,1])
layout = go.Layout(
    title="Cluster Number vs Inertia",
    xaxis=dict(
        title="Ks"
    ),
    yaxis=dict(
        title="Inertia"
    ) ) 
fig=go.Figure(data=go.Data([performance]))
fig.update_layout(layout)
fig.show()
fig.write_image(f"pca_kmeans_cvi.png")

Now we can see the performance in better way. We can make cluster of 2. Before adding this into the web app, lets do KMedoids first.

KMedoids Clustering

We have already covered the theory on previous part but in this one, we will just import KMedoids from sklearn_extra and use it just like the previous part.

from sklearn_extra.cluster import KMedoids


clusters = 5
features = ["Temperature", "Humidity", "CO2", "HumidityRatio","Light"][:-2]

inertias = []

for c in range(2,clusters+1):
    X = PCA_components[[1,2]]
    
    colors=['red','green','blue','magenta','black','yellow']
    model = KMedoids(n_clusters=c)
    model.fit(X)
    y_kmeans = model.predict(X)
    tdf["cluster"] = y_kmeans
    inertias.append((c,model.inertia_))
    
    trace0 = go.Scatter(x=X[1],y=X[2],mode='markers',  marker=dict(
        color=tdf.cluster.apply(lambda x: colors[x]),
        colorscale='Viridis',
        showscale=True
    ),name="Cluster Points")
    trace1 = go.Scatter(x=model.cluster_centers_[:, 0], y=model.cluster_centers_[:, 1],mode='markers', marker=dict(
        color=colors,
        size=20,
        showscale=True
    ),name="Cluster Mean")
        
    data7 = go.Data([trace0, trace1])
    fig = go.Figure(data=data7)
    fig.update_layout(title=f"Cluster Size {c}")
    fig.show()
    fig.write_image(f"kmedoids_{c}.png")

inertias=np.array(inertias).reshape(-1,2)
performance = go.Scatter(x=inertias[:,0], y=inertias[:,1])
layout = go.Layout(
    title="Cluster Number vs Inertia",
    xaxis=dict(
        title="Ks"
    ),
    yaxis=dict(
        title="Inertia"
    ) ) 
fig=go.Figure(data=go.Data([performance]))
fig.update_layout(layout)
fig.show()
fig.write_image(f"kmedoids_kvi.png")

Lets add this into streamlit app now.

Taking Clustering to Streamlit App

Since we have already made a selectbox of each mode, we will add entire clustering codes in a clustering.

from sklearn_extra.cluster import KMedoids
from sklearn.cluster import KMeans
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler

if mode=="Clustering":
    features = ["Temperature", "Humidity", "CO2", "HumidityRatio","Light"][:-2]
    st.markdown("## Clustering Mode Selected")
    st.markdown(hline)
    
    # select a  clustering algorithm
    calg = sidebar.selectbox("Select a clustering algorithm", ["K-Medoids","K-Means"])
    
    # select number of clusters
    ks = sidebar.slider("Select number of clusters", min_value=2, max_value=10, value=2)
    
    # select a dataframe to apply cluster on
    data_type = sidebar.selectbox("Select a dataframe:", ["Train","Test1","Test2"])
    st.markdown(f"## Dataframe selected {data_type}")
    udf = dfs[data_type.lower()]
    
    # if selected kmedoids, do respective operations
    if calg == "K-Medoids":  
        st.markdown("### K-Medoids Clustering")      
        
        # if using PCA or not
        use_pca = sidebar.radio("Use PCA?",["Yes","No"])
        # if not using pca, do default clustering
        if use_pca=="No":
            st.markdown("### Not Using PCA")
            inertias = []
            for c in range(1,ks+1):
                tdf = udf.copy()
                X = tdf[features]                
                colors=['red','green','blue','magenta','black','yellow']
                model = KMedoids(n_clusters=c)
                model.fit(X)
                y_kmeans = model.predict(X)
                tdf["cluster"] = y_kmeans
                inertias.append((c,model.inertia_))
                
                trace0 = go.Scatter(x=tdf[features[0]],y=tdf[features[1]],mode='markers',  marker=dict(
                    color=tdf.cluster.apply(lambda x: colors[x]),
                    colorscale='Viridis',
                    showscale=True
                ),name="Cluster Points")
                trace1 = go.Scatter(x=model.cluster_centers_[:, 0], y=model.cluster_centers_[:, 1],mode='markers', marker=dict(
                    color=colors,
                    size=20,
                    showscale=True
                ),name="Cluster Mean")
                    
                data7 = go.Data([trace0, trace1])
                fig = go.Figure(data=data7)
                fig.update_layout(height=600, width=800, title=f"Cluster Size {c}")
                st.plotly_chart(fig)

            inertias=np.array(inertias).reshape(-1,2)
            performance = go.Scatter(x=inertias[:,0], y=inertias[:,1])
            layout = go.Layout(
                title="Cluster Number vs Inertia",
                xaxis=dict(
                    title="Ks"
                ),
                yaxis=dict(
                    title="Inertia"
                ) ) 
            fig=go.Figure(data=go.Data([performance]))
            fig.update_layout(layout)
            st.plotly_chart(fig)
        
        # if using pca, use pca to reduce dimensionality and then do clustering    
        if use_pca=="Yes":
            st.markdown("### Using PCA")
            tdf=udf.copy()
            
            X = udf[features]
            scaler = StandardScaler()
            X_scaled = scaler.fit_transform(X)
            
            pca = PCA(n_components=3)
            principalComponents = pca.fit_transform(X_scaled)
            feat = list(range(pca.n_components_))
            PCA_components = pd.DataFrame(principalComponents, columns=list(range(len(feat))))
            choosed_component = sidebar.multiselect("Choose Components",feat,default=[1,2])
            choosed_component=[int(i) for i in choosed_component]
            inertias = []
            for c in range(1,ks+1):
                X = PCA_components[choosed_component]
                
                colors=['red','green','blue','magenta','black','yellow']
                model = KMedoids(n_clusters=c)
                model.fit(X)
                y_kmeans = model.predict(X)
                tdf["cluster"] = y_kmeans
                inertias.append((c,model.inertia_))
                
                trace0 = go.Scatter(x=X[1],y=X[2],mode='markers',  marker=dict(
                    color=tdf.cluster.apply(lambda x: colors[x]),
                    colorscale='Viridis',
                    showscale=True
                ),name="Cluster Points")
                trace1 = go.Scatter(x=model.cluster_centers_[:, 0], y=model.cluster_centers_[:, 1],mode='markers', marker=dict(
                    color=colors,
                    size=20,
                    showscale=True
                ),name="Cluster Mean")
                    
                data7 = go.Data([trace0, trace1])
                fig = go.Figure(data=data7)
                fig.update_layout(title=f"Cluster Size {c}")
                st.plotly_chart(fig)

            inertias=np.array(inertias).reshape(-1,2)
            performance = go.Scatter(x=inertias[:,0], y=inertias[:,1])
            layout = go.Layout(
                title="Cluster Number vs Inertia",
                xaxis=dict(
                    title="Ks"
                ),
                yaxis=dict(
                    title="Inertia"
                ) ) 
            fig=go.Figure(data=go.Data([performance]))
            fig.update_layout(layout)
            st.plotly_chart(fig)
    # if chosen KMeans, do respective operations
    if calg == "K-Means":
        st.markdown("### K-Means Clustering")        
        use_pca = sidebar.radio("Use PCA?",["Yes","No"])
        if use_pca=="No":
            st.markdown("### Not Using PCA")
            inertias = []
            for c in range(1,ks+1):
                tdf = udf.copy()
                X = tdf[features]                
                colors=['red','green','blue','magenta','black','yellow']
                model = KMeans(n_clusters=c)
                model.fit(X)
                y_kmeans = model.predict(X)
                tdf["cluster"] = y_kmeans
                inertias.append((c,model.inertia_))
                
                trace0 = go.Scatter(x=tdf[features[0]],y=tdf[features[1]],mode='markers',  marker=dict(
                    color=tdf.cluster.apply(lambda x: colors[x]),
                    colorscale='Viridis',
                    showscale=True
                ),name="Cluster Points")
                trace1 = go.Scatter(x=model.cluster_centers_[:, 0], y=model.cluster_centers_[:, 1],mode='markers', marker=dict(
                    color=colors,
                    size=20,
                    showscale=True
                ),name="Cluster Mean")
                    
                data7 = go.Data([trace0, trace1])
                fig = go.Figure(data=data7)
                fig.update_layout(height=600, width=800, title=f"Cluster Size {c}")
                st.plotly_chart(fig)

            inertias=np.array(inertias).reshape(-1,2)
            performance = go.Scatter(x=inertias[:,0], y=inertias[:,1])
            layout = go.Layout(
                title="Cluster Number vs Inertia",
                xaxis=dict(
                    title="Ks"
                ),
                yaxis=dict(
                    title="Inertia"
                ) ) 
            fig=go.Figure(data=go.Data([performance]))
            fig.update_layout(layout)
            st.plotly_chart(fig)
            
        if use_pca=="Yes":
            st.markdown("### Using PCA")
            tdf=udf.copy()
            
            X = udf[features]
            scaler = StandardScaler()
            X_scaled = scaler.fit_transform(X)
            
            pca = PCA(n_components=3)
            principalComponents = pca.fit_transform(X_scaled)
            feat = list(range(pca.n_components_))
            PCA_components = pd.DataFrame(principalComponents, columns=list(range(len(feat))))
            choosed_component = sidebar.multiselect("Choose Components",feat,default=[1,2])
            choosed_component=[int(i) for i in choosed_component]
            inertias = []
            for c in range(1,ks+1):
                X = PCA_components[choosed_component]
                
                colors=['red','green','blue','magenta','black','yellow']
                model = KMeans(n_clusters=c)
                model.fit(X)
                y_kmeans = model.predict(X)
                tdf["cluster"] = y_kmeans
                inertias.append((c,model.inertia_))
                
                trace0 = go.Scatter(x=X[1],y=X[2],mode='markers',  marker=dict(
                    color=tdf.cluster.apply(lambda x: colors[x]),
                    colorscale='Viridis',
                    showscale=True
                ),name="Cluster Points")
                trace1 = go.Scatter(x=model.cluster_centers_[:, 0], y=model.cluster_centers_[:, 1],mode='markers', marker=dict(
                    color=colors,
                    size=20,
                    showscale=True
                ),name="Cluster Mean")
                    
                data7 = go.Data([trace0, trace1])
                fig = go.Figure(data=data7)
                fig.update_layout(title=f"Cluster Size {c}")
                st.plotly_chart(fig)

            inertias=np.array(inertias).reshape(-1,2)
            performance = go.Scatter(x=inertias[:,0], y=inertias[:,1])
            layout = go.Layout(
                title="Cluster Number vs Inertia",
                xaxis=dict(
                    title="Ks"
                ),
                yaxis=dict(
                    title="Inertia"
                ) ) 
            fig=go.Figure(data=go.Data([performance]))
            fig.update_layout(layout)
            st.plotly_chart(fig)          
            

Please refer to the comments for explanation of the code above. The web app should be something like below:

Now we will move on to the Regression part and implement it on our APP.

Regression

In this part, we will perform linear regression where we try to predict the occupancy based on other features. The metric will be calculated using model.score. The metric will be R2 Score.

from sklearn.linear_model import LinearRegression

model = LinearRegression()
features = ["Temperature", "Humidity", "CO2", "HumidityRatio","Light"]
dfs['btrain'] = pd.concat([train,test2])
xtest = dfs["test1"][features].to_numpy()
ytest = dfs["test1"]["Occupancy"]

for ddn,d in dfs.items():
    if ddn!="test1":
        print(ddn)
        X = d[features].to_numpy().reshape(-1,len(features))
        y = d["Occupancy"]

        model.fit(X,y)
        print(f"Model R2: {model.score(X,y)}")
        print(f"Test R2: {model.score(xtest,ytest)}")
    

train
Model R2: 0.8580749633459134
Test R2: 0.8714317856126421
test2
Model R2: 0.8952863420051961
Test R2: 0.8658567155646273
btrain
Model R2: 0.8693410187120196
Test R2: 0.8649947193268359

Looking over the results above, btrain seems to have given a high R2 Score but train also have good test score.

features = ["Temperature", "Humidity", "CO2", "HumidityRatio","Light"]
for dn,d in dfs.items():
    if ddn!="test1":
        print(ddn)
        X = d[features].to_numpy().reshape(-1,len(features))
        y = d["Occupancy"]
        scaler = StandardScaler()
        X_scaled = scaler.fit_transform(X)


        # Create a PCA instance: pca
        pca = PCA(n_components=3)
        principalComponents = pca.fit_transform(X_scaled)
        feat = range(pca.n_components_)
        PCA_components = pd.DataFrame(principalComponents, columns=list(feat))

        model=LinearRegression()
        model.fit(PCA_components.to_numpy().reshape(-1,len(feat)),y)

        print(f"Model R2: {model.score(PCA_components.to_numpy().reshape(-1,len(feat)),y)}")
        #print(f"Test R2: {model.score(xtest,ytest)}")

btrain
Model R2: 0.8533646108336054
btrain
Model R2: 0.8786596706469451
btrain
Model R2: 0.6389469109358212
btrain
Model R2: 0.6305745433548783

It seems that our best model is from default Linear regression but still lets take PCA into the Streamlit app.

Taking Regression to Streamlit App

from sklearn.linear_model import LinearRegression, LogisticRegression, Ridge, Lasso, ElasticNet


if mode == "Regression":
    features = ["Temperature", "Humidity", "CO2", "HumidityRatio","Light"]
    algorithm = sidebar.selectbox("Choose Algorithm",["Linear Regression","Ridge Regression","Lasso Regression","Elastic Net"])
    st.markdown(f"### Chosen {algorithm}")
    models = {"Linear Regression":LinearRegression(), "Ridge Regression":Ridge(), "Lasso Regression":Lasso(), "Elastic Net":ElasticNet()}
    model = models[algorithm]
    
    train_df = sidebar.selectbox("Choose Train Data",data_list)
    test_df = sidebar.selectbox("Choose Test Data",[i for i in data_list if i != train_df])
    selected = sidebar.multiselect("Choose Features",features,default=features)
       
    xtrain = dfs[train_df][selected].to_numpy().reshape(-1,len(selected))
    xtest = dfs[test_df][selected].to_numpy().reshape(-1,len(selected))
    ytrain = dfs[train_df]["Occupancy"].to_numpy()
    ytest = dfs[test_df]["Occupancy"].to_numpy()

    model.fit(xtrain,ytrain)
    st.markdown(f"Train R2 Score: {model.score(xtrain,ytrain)}")
    st.markdown(f"Test R2 Score: {model.score(xtest,ytest)}")
    

We have imported few regression algorithms from sklearn.
We made a select box to select an algorithm.
Made select box to choose train/test data.
Made multi select box to choose features to use while making a model.
Then we trained a model using selected data, selected feature and selected algorithm.
Printed the accuracy also.

Adding few more lines of codes to show coefficient and take user input for a prediction:

    if sidebar.checkbox("Show Coefficients"):
        st.markdown("#### Showing Coefficents and Intercept")
        st.write(f"Coeffs: {model.coef_}")
        st.write(f"Intercept: {model.intercept_}")
    if sidebar.checkbox("Show Prediction"):
        st.markdown("#### Showing Prediction")
        input_values = [float((st.number_input(t))) for t in selected]
        prediction = model.predict([input_values])
        st.write(f"Predicted {prediction}")

All Codes

Below is the codes that we wrote upto now.

import streamlit as st
import numpy as np
import pandas as pd
import cufflinks
from plotly.subplots import make_subplots
import plotly.graph_objects as go
from sklearn_extra.cluster import KMedoids
from sklearn.cluster import KMeans
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression, LogisticRegression, Ridge, Lasso, ElasticNet

hline="--"*40

@st.cache
def get_data():
    train = pd.read_csv("https://github.com/LuisM78/Occupancy-detection-data/raw/master/datatraining.txt")
    train["date"]=pd.to_datetime(train.date)
    
    test1 = pd.read_csv("https://github.com/LuisM78/Occupancy-detection-data/raw/master/datatest.txt")
    test1["date"]=pd.to_datetime(test1.date)
    
    test2 = pd.read_csv("https://github.com/LuisM78/Occupancy-detection-data/raw/master/datatest2.txt")
    test2["date"]=pd.to_datetime(test2.date)
    
    dfs = {"train":train,"test1":test1,"test2":test2}
    return dfs

dfs = get_data()
data_list = list(dfs.keys())
sidebar = st.sidebar

# select modes, EDA, Clustering, Regression and Classification
mode = sidebar.selectbox("Select a mode.",options=["EDA", "Clustering", "Regression", "Classification"])
st.markdown(f"### {mode} Mode Selected")
st.markdown(hline)
    

# If selected EDA, show EDA related plots
if mode=="EDA":
    # if selected show the data
    show_data = sidebar.checkbox("Show data")
    if show_data:
        # if selected, show train data
        if sidebar.checkbox("Show Train data"):
            st.markdown("### Train Data")
            st.dataframe(dfs["train"])
        
        # if selected, show test1 data
        if sidebar.checkbox("Show Test1 data"):
            st.markdown("### Test1 Data")
            st.dataframe(dfs["test1"])
            
        # if selected, show test2 data
        if sidebar.checkbox("Show Test2 data"):
            st.markdown("### Test2 Data")
            st.dataframe(dfs["test2"])
    
    # if selected, show the comparision data
    show_comparison = sidebar.checkbox("Show comparison")
    if show_comparison:
        
        # make a multiselect to select the columns to compare
        selected = sidebar.multiselect("Select Columns ", [d for d in dfs["train"].columns if d not in ["date"]])
        
        
        titles=list(dfs.keys())
        train = dfs["train"]
        test1 = dfs["test1"]
        test2 = dfs["test2"]
        
        if selected:
            st.markdown(f"### Selected Columns: {', '.join(selected)}")
            
            for c in selected:
                fig = make_subplots(rows=2,cols=3, subplot_titles=titles, )
                fig.add_trace(go.Box(y=train[c].tolist(), name=titles[0]), row=1, col=1)
                fig.add_trace(go.Box(y=test1[c].tolist(), name = titles[1]), row=1, col=2)
                fig.add_trace(go.Box(y=test2[c].tolist(), name = titles[2]), row=1, col=3)
                
                fig.add_trace(go.Histogram(y=train[c].tolist(), name=titles[0]), row=2, col=1)
                fig.add_trace(go.Histogram(y=test1[c].tolist(), name = titles[1]), row=2, col=2)
                fig.add_trace(go.Histogram(y=test2[c].tolist(), name = titles[2]), row=2, col=3)
                fig.update_layout(height=600, width=800, title_text=f"Box and Distribution of {c}")
                st.plotly_chart(fig)
        
        # if selected show correlation
        show_corr = sidebar.checkbox("Show Correlation")
        if show_corr:
            st.markdown("### Correlation")
            fig = make_subplots(rows=1,cols=3, subplot_titles=titles)
            fig.add_trace(go.Heatmap(z=train.corr(), y=train.corr().columns,x=train.corr().index, name=titles[0]), row=1, col=1)
            fig.add_trace(go.Heatmap(z=test1.corr(), x=train.corr().index, name = titles[1]), row=1, col=2)
            fig.add_trace(go.Heatmap(z=test2.corr(), x=train.corr().index, name = titles[2]), row=1, col=3)
            st.plotly_chart(fig)
if mode=="Clustering":
    features = ["Temperature", "Humidity", "CO2", "HumidityRatio","Light"][:-2]
    
    # select a  clustering algorithm
    calg = sidebar.selectbox("Select a clustering algorithm", ["K-Medoids","K-Means"])
    
    # select number of clusters
    ks = sidebar.slider("Select number of clusters", min_value=2, max_value=10, value=2)
    
    # select a dataframe to apply cluster on
    data_type = sidebar.selectbox("Select a dataframe:", ["Train","Test1","Test2"])
    st.markdown(f"## Dataframe selected {data_type}")
    udf = dfs[data_type.lower()]
    
    # if selected kmedoids, do respective operations
    if calg == "K-Medoids":  
        st.markdown("### K-Medoids Clustering")      
        
        # if using PCA or not
        use_pca = sidebar.radio("Use PCA?",["Yes","No"])
        # if not using pca, do default clustering
        if use_pca=="No":
            st.markdown("### Not Using PCA")
            inertias = []
            for c in range(1,ks+1):
                tdf = udf.copy()
                X = tdf[features]                
                colors=['red','green','blue','magenta','black','yellow']
                model = KMedoids(n_clusters=c)
                model.fit(X)
                y_kmeans = model.predict(X)
                tdf["cluster"] = y_kmeans
                inertias.append((c,model.inertia_))
                
                trace0 = go.Scatter(x=tdf[features[0]],y=tdf[features[1]],mode='markers',  marker=dict(
                    color=tdf.cluster.apply(lambda x: colors[x]),
                    colorscale='Viridis',
                    showscale=True
                ),name="Cluster Points")
                trace1 = go.Scatter(x=model.cluster_centers_[:, 0], y=model.cluster_centers_[:, 1],mode='markers', marker=dict(
                    color=colors,
                    size=20,
                    showscale=True
                ),name="Cluster Mean")
                    
                data7 = go.Data([trace0, trace1])
                fig = go.Figure(data=data7)
                fig.update_layout(height=600, width=800, title=f"Cluster Size {c}")
                st.plotly_chart(fig)

            inertias=np.array(inertias).reshape(-1,2)
            performance = go.Scatter(x=inertias[:,0], y=inertias[:,1])
            layout = go.Layout(
                title="Cluster Number vs Inertia",
                xaxis=dict(
                    title="Ks"
                ),
                yaxis=dict(
                    title="Inertia"
                ) ) 
            fig=go.Figure(data=go.Data([performance]))
            fig.update_layout(layout)
            st.plotly_chart(fig)
        
        # if using pca, use pca to reduce dimensionality and then do clustering    
        if use_pca=="Yes":
            st.markdown("### Using PCA")
            tdf=udf.copy()
            
            X = udf[features]
            scaler = StandardScaler()
            X_scaled = scaler.fit_transform(X)
            
            pca = PCA(n_components=3)
            principalComponents = pca.fit_transform(X_scaled)
            feat = list(range(pca.n_components_))
            PCA_components = pd.DataFrame(principalComponents, columns=list(range(len(feat))))
            choosed_component = sidebar.multiselect("Choose Components",feat,default=[1,2])
            choosed_component=[int(i) for i in choosed_component]
            inertias = []
            for c in range(1,ks+1):
                X = PCA_components[choosed_component]
                
                colors=['red','green','blue','magenta','black','yellow']
                model = KMedoids(n_clusters=c)
                model.fit(X)
                y_kmeans = model.predict(X)
                tdf["cluster"] = y_kmeans
                inertias.append((c,model.inertia_))
                
                trace0 = go.Scatter(x=X[1],y=X[2],mode='markers',  marker=dict(
                    color=tdf.cluster.apply(lambda x: colors[x]),
                    colorscale='Viridis',
                    showscale=True
                ),name="Cluster Points")
                trace1 = go.Scatter(x=model.cluster_centers_[:, 0], y=model.cluster_centers_[:, 1],mode='markers', marker=dict(
                    color=colors,
                    size=20,
                    showscale=True
                ),name="Cluster Mean")
                    
                data7 = go.Data([trace0, trace1])
                fig = go.Figure(data=data7)
                fig.update_layout(title=f"Cluster Size {c}")
                st.plotly_chart(fig)

            inertias=np.array(inertias).reshape(-1,2)
            performance = go.Scatter(x=inertias[:,0], y=inertias[:,1])
            layout = go.Layout(
                title="Cluster Number vs Inertia",
                xaxis=dict(
                    title="Ks"
                ),
                yaxis=dict(
                    title="Inertia"
                ) ) 
            fig=go.Figure(data=go.Data([performance]))
            fig.update_layout(layout)
            st.plotly_chart(fig)
    # if chosen KMeans, do respective operations
    if calg == "K-Means":
        st.markdown("### K-Means Clustering")        
        use_pca = sidebar.radio("Use PCA?",["Yes","No"])
        if use_pca=="No":
            st.markdown("### Not Using PCA")
            inertias = []
            for c in range(1,ks+1):
                tdf = udf.copy()
                X = tdf[features]                
                colors=['red','green','blue','magenta','black','yellow']
                model = KMeans(n_clusters=c)
                model.fit(X)
                y_kmeans = model.predict(X)
                tdf["cluster"] = y_kmeans
                inertias.append((c,model.inertia_))
                
                trace0 = go.Scatter(x=tdf[features[0]],y=tdf[features[1]],mode='markers',  marker=dict(
                    color=tdf.cluster.apply(lambda x: colors[x]),
                    colorscale='Viridis',
                    showscale=True
                ),name="Cluster Points")
                trace1 = go.Scatter(x=model.cluster_centers_[:, 0], y=model.cluster_centers_[:, 1],mode='markers', marker=dict(
                    color=colors,
                    size=20,
                    showscale=True
                ),name="Cluster Mean")
                    
                data7 = go.Data([trace0, trace1])
                fig = go.Figure(data=data7)
                fig.update_layout(height=600, width=800, title=f"Cluster Size {c}")
                st.plotly_chart(fig)

            inertias=np.array(inertias).reshape(-1,2)
            performance = go.Scatter(x=inertias[:,0], y=inertias[:,1])
            layout = go.Layout(
                title="Cluster Number vs Inertia",
                xaxis=dict(
                    title="Ks"
                ),
                yaxis=dict(
                    title="Inertia"
                ) ) 
            fig=go.Figure(data=go.Data([performance]))
            fig.update_layout(layout)
            st.plotly_chart(fig)
            
        if use_pca=="Yes":
            st.markdown("### Using PCA")
            tdf=udf.copy()
            
            X = udf[features]
            scaler = StandardScaler()
            X_scaled = scaler.fit_transform(X)
            
            pca = PCA(n_components=3)
            principalComponents = pca.fit_transform(X_scaled)
            feat = list(range(pca.n_components_))
            PCA_components = pd.DataFrame(principalComponents, columns=list(range(len(feat))))
            choosed_component = sidebar.multiselect("Choose Components",feat,default=[1,2])
            choosed_component=[int(i) for i in choosed_component]
            inertias = []
            for c in range(1,ks+1):
                X = PCA_components[choosed_component]
                
                colors=['red','green','blue','magenta','black','yellow']
                model = KMeans(n_clusters=c)
                model.fit(X)
                y_kmeans = model.predict(X)
                tdf["cluster"] = y_kmeans
                inertias.append((c,model.inertia_))
                
                trace0 = go.Scatter(x=X[1],y=X[2],mode='markers',  marker=dict(
                    color=tdf.cluster.apply(lambda x: colors[x]),
                    colorscale='Viridis',
                    showscale=True
                ),name="Cluster Points")
                trace1 = go.Scatter(x=model.cluster_centers_[:, 0], y=model.cluster_centers_[:, 1],mode='markers', marker=dict(
                    color=colors,
                    size=20,
                    showscale=True
                ),name="Cluster Mean")
                    
                data7 = go.Data([trace0, trace1])
                fig = go.Figure(data=data7)
                fig.update_layout(title=f"Cluster Size {c}")
                st.plotly_chart(fig)

            inertias=np.array(inertias).reshape(-1,2)
            performance = go.Scatter(x=inertias[:,0], y=inertias[:,1])
            layout = go.Layout(
                title="Cluster Number vs Inertia",
                xaxis=dict(
                    title="Ks"
                ),
                yaxis=dict(
                    title="Inertia"
                ) ) 
            fig=go.Figure(data=go.Data([performance]))
            fig.update_layout(layout)
            st.plotly_chart(fig)
            
if mode == "Regression":
    features = ["Temperature", "Humidity", "CO2", "HumidityRatio","Light"]
    algorithm = sidebar.selectbox("Choose Algorithm",["Linear Regression","Ridge Regression","Lasso Regression","Elastic Net"])
    st.markdown(f"### Chosen {algorithm}")
    models = {"Linear Regression":LinearRegression(), "Ridge Regression":Ridge(), "Lasso Regression":Lasso(), "Elastic Net":ElasticNet()}
    model = models[algorithm]
    
    train_df = sidebar.selectbox("Choose Train Data",data_list)
    test_df = sidebar.selectbox("Choose Test Data",[i for i in data_list if i != train_df])
    selected = sidebar.multiselect("Choose Features",features,default=features)
       
    xtrain = dfs[train_df][selected].to_numpy().reshape(-1,len(selected))
    xtest = dfs[test_df][selected].to_numpy().reshape(-1,len(selected))
    ytrain = dfs[train_df]["Occupancy"].to_numpy()
    ytest = dfs[test_df]["Occupancy"].to_numpy()

    model.fit(xtrain,ytrain)
    st.markdown(f"Train R2 Score: {model.score(xtrain,ytrain)}")
    st.markdown(f"Test R2 Score: {model.score(xtest,ytest)}")
    
    if sidebar.checkbox("Show Coefficients"):
        st.markdown("#### Showing Coefficents and Intercept")
        st.write(f"Coeffs: {model.coef_}")
        st.write(f"Intercept: {model.intercept_}")
    if sidebar.checkbox("Show Prediction"):
        st.markdown("#### Showing Prediction")
        input_values = [float((st.number_input(t))) for t in selected]
        prediction = model.predict([input_values])
        st.write(f"Predicted {prediction}")
    

Classification

Taking Classification to Streamlit App

Until now we have created EDA, Clustering and Regression modes now is the time for us to create a classification models. We have covered a most of the redundant part of try out on above sections but in this one, we will jump right into the implementation part.

from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import confusion_matrix, f1_score, accuracy_score

if mode == "Classification":
    features = ["Temperature", "Humidity", "CO2", "HumidityRatio","Light"]
    algorithm = sidebar.selectbox("Choose Algorithm",["Logistic Regression","KNN","Decision Tree","Random Forest", "Ada Boost"])
    st.markdown(f"### Chosen {algorithm}")
    models = {"Logistic Regression":LogisticRegression(), "KNN":KNeighborsClassifier(), "Decision Tree":DecisionTreeClassifier(), "Random Forest":RandomForestClassifier(), "Ada Boost":AdaBoostClassifier()}
    model = models[algorithm]
    
    train_df = sidebar.selectbox("Choose Train Data",data_list)
    test_df = sidebar.selectbox("Choose Test Data",[i for i in data_list if i != train_df])
    selected = sidebar.multiselect("Choose Features",features,default=features)
       
    xtrain = dfs[train_df][selected].to_numpy().reshape(-1,len(selected))
    xtest = dfs[test_df][selected].to_numpy().reshape(-1,len(selected))
    ytrain = dfs[train_df]["Occupancy"].to_numpy()
    ytest = dfs[test_df]["Occupancy"].to_numpy()

    model.fit(xtrain,ytrain)
    st.markdown(f"##### R2 Score: Train = {model.score(xtrain,ytrain) : .2f}, Test = {model.score(xtest,ytest) : .2f}")
    
    train_pred = model.predict(xtrain)
    test_pred = model.predict(xtest)
    
    cm_train = confusion_matrix(ytrain,train_pred,labels=[0,1])
    cm_test = confusion_matrix(ytest,test_pred,labels=[0,1])
    
    train_f1 = f1_score(ytrain,train_pred,average="macro")
    test_f1 = f1_score(ytest,test_pred,average="macro")
    train_acc = accuracy_score(ytrain,train_pred)
    test_acc = accuracy_score(ytest,test_pred)
    
    st.markdown(f"##### F1 Score: Train = {train_f1 : .2f}, Test = {test_f1 : .2f}")
    st.markdown(f"##### Accuracy Score: Train = {train_acc : .2f}, Test = {test_acc : .2f}")
    
    fig = make_subplots(rows=1,cols=2, subplot_titles=["Train","Test"])
    labels = ["Vacant","Occupied"]
    fig1 = go.Heatmap(z=cm_train, y=labels, x=labels, name="Train")
    fig2 = go.Heatmap(z=cm_test, y=labels, x=labels, name="Test")
    
    fig.add_trace(fig1, row=1, col=1)
    fig.add_trace(fig2, row=1, col=2)
    fig.update_layout({"title":"Confusion Matrix","xaxis": {"title": "Predicted value"},
        "yaxis": {"title": "Real value"}})
    st.plotly_chart(fig)
    
    if sidebar.checkbox("Show Coefficients"):
        st.markdown("#### Showing Coefficents and Intercept")
        try:
            st.write(f"Coeffs: {model.coef_}")
            st.write(f"Intercept: {model.intercept_}")
        except:
            st.write("Coeffs: Not Available")
    if sidebar.checkbox("Show Prediction"):
        st.markdown("#### Showing Prediction")
        input_values = [float((st.number_input(t))) for t in selected]
        prediction = model.predict([input_values])
        st.write(f"Predicted {prediction}")
    

What we are doing in above code is:

Select the classification algorithm and make its class.
Choose features and then prepare data using that features.
Train a model and show train/test metrics.
Show confusion matrix.

If everything is fine, then our app should look like below:

All Codes

from sklearn.metrics import plot_confusion_matrix
import streamlit as st
import numpy as np
import pandas as pd
import cufflinks
from plotly.subplots import make_subplots
import plotly.graph_objects as go
from sklearn_extra.cluster import KMedoids
from sklearn.cluster import KMeans
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression, LogisticRegression, Ridge, Lasso, ElasticNet
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import confusion_matrix, f1_score, accuracy_score

hline="--"*40

@st.cache
def get_data():
    train = pd.read_csv("https://github.com/LuisM78/Occupancy-detection-data/raw/master/datatraining.txt")
    train["date"]=pd.to_datetime(train.date)
    
    test1 = pd.read_csv("https://github.com/LuisM78/Occupancy-detection-data/raw/master/datatest.txt")
    test1["date"]=pd.to_datetime(test1.date)
    
    test2 = pd.read_csv("https://github.com/LuisM78/Occupancy-detection-data/raw/master/datatest2.txt")
    test2["date"]=pd.to_datetime(test2.date)
    
    dfs = {"train":train,"test1":test1,"test2":test2}
    return dfs

dfs = get_data()
data_list = list(dfs.keys())
sidebar = st.sidebar

# select modes, EDA, Clustering, Regression and Classification
mode = sidebar.selectbox("Select a mode.",options=["EDA", "Clustering", "Regression", "Classification"])
st.markdown(f"### {mode} Mode Selected")
st.markdown(hline)
    

# If selected EDA, show EDA related plots
if mode=="EDA":
    # if selected show the data
    show_data = sidebar.checkbox("Show data")
    if show_data:
        # if selected, show train data
        if sidebar.checkbox("Show Train data"):
            st.markdown("### Train Data")
            st.dataframe(dfs["train"])
        
        # if selected, show test1 data
        if sidebar.checkbox("Show Test1 data"):
            st.markdown("### Test1 Data")
            st.dataframe(dfs["test1"])
            
        # if selected, show test2 data
        if sidebar.checkbox("Show Test2 data"):
            st.markdown("### Test2 Data")
            st.dataframe(dfs["test2"])
    
    # if selected, show the comparision data
    show_comparison = sidebar.checkbox("Show comparison")
    if show_comparison:
        
        # make a multiselect to select the columns to compare
        selected = sidebar.multiselect("Select Columns ", [d for d in dfs["train"].columns if d not in ["date"]])
        
        
        titles=list(dfs.keys())
        train = dfs["train"]
        test1 = dfs["test1"]
        test2 = dfs["test2"]
        
        if selected:
            st.markdown(f"### Selected Columns: {', '.join(selected)}")
            
            for c in selected:
                fig = make_subplots(rows=2,cols=3, subplot_titles=titles, )
                fig.add_trace(go.Box(y=train[c].tolist(), name=titles[0]), row=1, col=1)
                fig.add_trace(go.Box(y=test1[c].tolist(), name = titles[1]), row=1, col=2)
                fig.add_trace(go.Box(y=test2[c].tolist(), name = titles[2]), row=1, col=3)
                
                fig.add_trace(go.Histogram(y=train[c].tolist(), name=titles[0]), row=2, col=1)
                fig.add_trace(go.Histogram(y=test1[c].tolist(), name = titles[1]), row=2, col=2)
                fig.add_trace(go.Histogram(y=test2[c].tolist(), name = titles[2]), row=2, col=3)
                fig.update_layout(height=600, width=800, title_text=f"Box and Distribution of {c}")
                st.plotly_chart(fig)
        
        # if selected show correlation
        show_corr = sidebar.checkbox("Show Correlation")
        if show_corr:
            st.markdown("### Correlation")
            fig = make_subplots(rows=1,cols=3, subplot_titles=titles)
            fig.add_trace(go.Heatmap(z=train.corr(), y=train.corr().columns,x=train.corr().index, name=titles[0]), row=1, col=1)
            fig.add_trace(go.Heatmap(z=test1.corr(), x=train.corr().index, name = titles[1]), row=1, col=2)
            fig.add_trace(go.Heatmap(z=test2.corr(), x=train.corr().index, name = titles[2]), row=1, col=3)
            st.plotly_chart(fig)
if mode=="Clustering":
    features = ["Temperature", "Humidity", "CO2", "HumidityRatio","Light"][:-2]
    
    # select a  clustering algorithm
    calg = sidebar.selectbox("Select a clustering algorithm", ["K-Medoids","K-Means"])
    
    # select number of clusters
    ks = sidebar.slider("Select number of clusters", min_value=2, max_value=10, value=2)
    
    # select a dataframe to apply cluster on
    data_type = sidebar.selectbox("Select a dataframe:", ["Train","Test1","Test2"])
    st.markdown(f"## Dataframe selected {data_type}")
    udf = dfs[data_type.lower()]
    
    # if selected kmedoids, do respective operations
    if calg == "K-Medoids":  
        st.markdown("### K-Medoids Clustering")      
        
        # if using PCA or not
        use_pca = sidebar.radio("Use PCA?",["Yes","No"])
        # if not using pca, do default clustering
        if use_pca=="No":
            st.markdown("### Not Using PCA")
            inertias = []
            for c in range(1,ks+1):
                tdf = udf.copy()
                X = tdf[features]                
                colors=['red','green','blue','magenta','black','yellow']
                model = KMedoids(n_clusters=c)
                model.fit(X)
                y_kmeans = model.predict(X)
                tdf["cluster"] = y_kmeans
                inertias.append((c,model.inertia_))
                
                trace0 = go.Scatter(x=tdf[features[0]],y=tdf[features[1]],mode='markers',  marker=dict(
                    color=tdf.cluster.apply(lambda x: colors[x]),
                    colorscale='Viridis',
                    showscale=True
                ),name="Cluster Points")
                trace1 = go.Scatter(x=model.cluster_centers_[:, 0], y=model.cluster_centers_[:, 1],mode='markers', marker=dict(
                    color=colors,
                    size=20,
                    showscale=True
                ),name="Cluster Mean")
                    
                data7 = go.Data([trace0, trace1])
                fig = go.Figure(data=data7)
                fig.update_layout(height=600, width=800, title=f"Cluster Size {c}")
                st.plotly_chart(fig)

            inertias=np.array(inertias).reshape(-1,2)
            performance = go.Scatter(x=inertias[:,0], y=inertias[:,1])
            layout = go.Layout(
                title="Cluster Number vs Inertia",
                xaxis=dict(
                    title="Ks"
                ),
                yaxis=dict(
                    title="Inertia"
                ) ) 
            fig=go.Figure(data=go.Data([performance]))
            fig.update_layout(layout)
            st.plotly_chart(fig)
        
        # if using pca, use pca to reduce dimensionality and then do clustering    
        if use_pca=="Yes":
            st.markdown("### Using PCA")
            tdf=udf.copy()
            
            X = udf[features]
            scaler = StandardScaler()
            X_scaled = scaler.fit_transform(X)
            
            pca = PCA(n_components=3)
            principalComponents = pca.fit_transform(X_scaled)
            feat = list(range(pca.n_components_))
            PCA_components = pd.DataFrame(principalComponents, columns=list(range(len(feat))))
            choosed_component = sidebar.multiselect("Choose Components",feat,default=[1,2])
            choosed_component=[int(i) for i in choosed_component]
            inertias = []
            for c in range(1,ks+1):
                X = PCA_components[choosed_component]
                
                colors=['red','green','blue','magenta','black','yellow']
                model = KMedoids(n_clusters=c)
                model.fit(X)
                y_kmeans = model.predict(X)
                tdf["cluster"] = y_kmeans
                inertias.append((c,model.inertia_))
                
                trace0 = go.Scatter(x=X[1],y=X[2],mode='markers',  marker=dict(
                    color=tdf.cluster.apply(lambda x: colors[x]),
                    colorscale='Viridis',
                    showscale=True
                ),name="Cluster Points")
                trace1 = go.Scatter(x=model.cluster_centers_[:, 0], y=model.cluster_centers_[:, 1],mode='markers', marker=dict(
                    color=colors,
                    size=20,
                    showscale=True
                ),name="Cluster Mean")
                    
                data7 = go.Data([trace0, trace1])
                fig = go.Figure(data=data7)
                fig.update_layout(title=f"Cluster Size {c}")
                st.plotly_chart(fig)

            inertias=np.array(inertias).reshape(-1,2)
            performance = go.Scatter(x=inertias[:,0], y=inertias[:,1])
            layout = go.Layout(
                title="Cluster Number vs Inertia",
                xaxis=dict(
                    title="Ks"
                ),
                yaxis=dict(
                    title="Inertia"
                ) ) 
            fig=go.Figure(data=go.Data([performance]))
            fig.update_layout(layout)
            st.plotly_chart(fig)
    # if chosen KMeans, do respective operations
    if calg == "K-Means":
        st.markdown("### K-Means Clustering")        
        use_pca = sidebar.radio("Use PCA?",["Yes","No"])
        if use_pca=="No":
            st.markdown("### Not Using PCA")
            inertias = []
            for c in range(1,ks+1):
                tdf = udf.copy()
                X = tdf[features]                
                colors=['red','green','blue','magenta','black','yellow']
                model = KMeans(n_clusters=c)
                model.fit(X)
                y_kmeans = model.predict(X)
                tdf["cluster"] = y_kmeans
                inertias.append((c,model.inertia_))
                
                trace0 = go.Scatter(x=tdf[features[0]],y=tdf[features[1]],mode='markers',  marker=dict(
                    color=tdf.cluster.apply(lambda x: colors[x]),
                    colorscale='Viridis',
                    showscale=True
                ),name="Cluster Points")
                trace1 = go.Scatter(x=model.cluster_centers_[:, 0], y=model.cluster_centers_[:, 1],mode='markers', marker=dict(
                    color=colors,
                    size=20,
                    showscale=True
                ),name="Cluster Mean")
                    
                data7 = go.Data([trace0, trace1])
                fig = go.Figure(data=data7)
                fig.update_layout(height=600, width=800, title=f"Cluster Size {c}")
                st.plotly_chart(fig)

            inertias=np.array(inertias).reshape(-1,2)
            performance = go.Scatter(x=inertias[:,0], y=inertias[:,1])
            layout = go.Layout(
                title="Cluster Number vs Inertia",
                xaxis=dict(
                    title="Ks"
                ),
                yaxis=dict(
                    title="Inertia"
                ) ) 
            fig=go.Figure(data=go.Data([performance]))
            fig.update_layout(layout)
            st.plotly_chart(fig)
            
        if use_pca=="Yes":
            st.markdown("### Using PCA")
            tdf=udf.copy()
            
            X = udf[features]
            scaler = StandardScaler()
            X_scaled = scaler.fit_transform(X)
            
            pca = PCA(n_components=3)
            principalComponents = pca.fit_transform(X_scaled)
            feat = list(range(pca.n_components_))
            PCA_components = pd.DataFrame(principalComponents, columns=list(range(len(feat))))
            choosed_component = sidebar.multiselect("Choose Components",feat,default=[1,2])
            choosed_component=[int(i) for i in choosed_component]
            inertias = []
            for c in range(1,ks+1):
                X = PCA_components[choosed_component]
                
                colors=['red','green','blue','magenta','black','yellow']
                model = KMeans(n_clusters=c)
                model.fit(X)
                y_kmeans = model.predict(X)
                tdf["cluster"] = y_kmeans
                inertias.append((c,model.inertia_))
                
                trace0 = go.Scatter(x=X[1],y=X[2],mode='markers',  marker=dict(
                    color=tdf.cluster.apply(lambda x: colors[x]),
                    colorscale='Viridis',
                    showscale=True
                ),name="Cluster Points")
                trace1 = go.Scatter(x=model.cluster_centers_[:, 0], y=model.cluster_centers_[:, 1],mode='markers', marker=dict(
                    color=colors,
                    size=20,
                    showscale=True
                ),name="Cluster Mean")
                    
                data7 = go.Data([trace0, trace1])
                fig = go.Figure(data=data7)
                fig.update_layout(title=f"Cluster Size {c}")
                st.plotly_chart(fig)

            inertias=np.array(inertias).reshape(-1,2)
            performance = go.Scatter(x=inertias[:,0], y=inertias[:,1])
            layout = go.Layout(
                title="Cluster Number vs Inertia",
                xaxis=dict(
                    title="Ks"
                ),
                yaxis=dict(
                    title="Inertia"
                ) ) 
            fig=go.Figure(data=go.Data([performance]))
            fig.update_layout(layout)
            st.plotly_chart(fig)
            
if mode == "Regression":
    features = ["Temperature", "Humidity", "CO2", "HumidityRatio","Light"]
    algorithm = sidebar.selectbox("Choose Algorithm",["Linear Regression","Ridge Regression","Lasso Regression","Elastic Net"])
    st.markdown(f"### Chosen {algorithm}")
    models = {"Linear Regression":LinearRegression(), "Ridge Regression":Ridge(), "Lasso Regression":Lasso(), "Elastic Net":ElasticNet()}
    model = models[algorithm]
    
    train_df = sidebar.selectbox("Choose Train Data",data_list)
    test_df = sidebar.selectbox("Choose Test Data",[i for i in data_list if i != train_df])
    selected = sidebar.multiselect("Choose Features",features,default=features)
       
    xtrain = dfs[train_df][selected].to_numpy().reshape(-1,len(selected))
    xtest = dfs[test_df][selected].to_numpy().reshape(-1,len(selected))
    ytrain = dfs[train_df]["Occupancy"].to_numpy()
    ytest = dfs[test_df]["Occupancy"].to_numpy()

    model.fit(xtrain,ytrain)
    st.markdown(f"Train R2 Score: {model.score(xtrain,ytrain) : .2f}")
    st.markdown(f"Test R2 Score: {model.score(xtest,ytest) : .2f}")
    
    if sidebar.checkbox("Show Coefficients"):
        st.markdown("#### Showing Coefficents and Intercept")
        st.write(f"Coeffs: {model.coef_}")
        st.write(f"Intercept: {model.intercept_}")
    if sidebar.checkbox("Show Prediction"):
        st.markdown("#### Showing Prediction")
        input_values = [float((st.number_input(t))) for t in selected]
        prediction = model.predict([input_values])
        st.write(f"Predicted {prediction}")

if mode == "Classification":
    features = ["Temperature", "Humidity", "CO2", "HumidityRatio","Light"]
    algorithm = sidebar.selectbox("Choose Algorithm",["Logistic Regression","KNN","Decision Tree","Random Forest", "Ada Boost"])
    st.markdown(f"### Chosen {algorithm}")
    models = {"Logistic Regression":LogisticRegression(), "KNN":KNeighborsClassifier(), "Decision Tree":DecisionTreeClassifier(), "Random Forest":RandomForestClassifier(), "Ada Boost":AdaBoostClassifier()}
    model = models[algorithm]
    
    train_df = sidebar.selectbox("Choose Train Data",data_list)
    test_df = sidebar.selectbox("Choose Test Data",[i for i in data_list if i != train_df])
    selected = sidebar.multiselect("Choose Features",features,default=features)
       
    xtrain = dfs[train_df][selected].to_numpy().reshape(-1,len(selected))
    xtest = dfs[test_df][selected].to_numpy().reshape(-1,len(selected))
    ytrain = dfs[train_df]["Occupancy"].to_numpy()
    ytest = dfs[test_df]["Occupancy"].to_numpy()

    model.fit(xtrain,ytrain)
    st.markdown(f"##### R2 Score: Train = {model.score(xtrain,ytrain) : .2f}, Test = {model.score(xtest,ytest) : .2f}")
    
    train_pred = model.predict(xtrain)
    test_pred = model.predict(xtest)
    
    cm_train = confusion_matrix(ytrain,train_pred,labels=[0,1])
    cm_test = confusion_matrix(ytest,test_pred,labels=[0,1])
    
    train_f1 = f1_score(ytrain,train_pred,average="macro")
    test_f1 = f1_score(ytest,test_pred,average="macro")
    train_acc = accuracy_score(ytrain,train_pred)
    test_acc = accuracy_score(ytest,test_pred)
    
    st.markdown(f"##### F1 Score: Train = {train_f1 : .2f}, Test = {test_f1 : .2f}")
    st.markdown(f"##### Accuracy Score: Train = {train_acc : .2f}, Test = {test_acc : .2f}")
    
    fig = make_subplots(rows=1,cols=2, subplot_titles=["Train","Test"])
    labels = ["Vacant","Occupied"]
    fig1 = go.Heatmap(z=cm_train, y=labels, x=labels, name="Train")
    fig2 = go.Heatmap(z=cm_test, y=labels, x=labels, name="Test")
    
    fig.add_trace(fig1, row=1, col=1)
    fig.add_trace(fig2, row=1, col=2)
    fig.update_layout({"title":"Confusion Matrix","xaxis": {"title": "Predicted value"},
        "yaxis": {"title": "Real value"}})
    st.plotly_chart(fig)
    
    if sidebar.checkbox("Show Coefficients"):
        st.markdown("#### Showing Coefficents and Intercept")
        try:
            st.write(f"Coeffs: {model.coef_}")
            st.write(f"Intercept: {model.intercept_}")
        except:
            st.write("Coeffs: Not Available")
    if sidebar.checkbox("Show Prediction"):
        st.markdown("#### Showing Prediction")
        input_values = [float((st.number_input(t))) for t in selected]
        prediction = model.predict([input_values])
        st.write(f"Predicted {prediction}")
    
    
    

Add Inference Mode

For this purpose, we will need to save a model during Regression or Classification phase and then upload it on inference to test it. What we will do is,

Accept a model file upload and read number of features in it.
Create a form where we will have number of input fields equal to number of features to accept input vlaues.
Create a submit button to pass those input values to loaded model and then print the result.

if mode in ["Regression", "Classification"]:    
    filename=sidebar.text_input("Enter File Name",value="model.sav")
    save = sidebar.button("Save Model") 
    if save and model:
        pickle.dump(model, open(filename, 'wb'))
        sidebar.markdown("Saved Model!")

if mode =="Inference":
    model=None
    if "temp_model.csv" not in os.listdir():
        file = sidebar.file_uploader("Upload Model File", accept_multiple_files=False)
        if file:  
            model = pickle.load(file)
            pickle.dump(model, open("temp_model.sav", 'wb'))
            st.markdown("Model Loaded.")
    if model:
        st.markdown(f"Loaded {type(model).__name__} Model!")
        nfeatures="Not Known"
        try:
            nfeatures=model.n_features_
            
        except:
            nfeatures = model.coef_.shape[-1]
                
        st.markdown(f"Number of features: {nfeatures}")
            
        if nfeatures!="Not Known":
            form1 = st.form(key='my_form')

            st.markdown("#### Showing Prediction")
            input_values = [float((form1.number_input(f"Feature {t} Value"))) for t in range(nfeatures)]
            
            submit= form1.form_submit_button("Predict")
            prediction=None
            if submit:
                prediction = model.predict([input_values])
            st.markdown(f"Input Values: {input_values}")
            st.write(f"Predicted {prediction}")
        

Now we should be able to see a text field to enter a file name and then save button on Regression and Classification Mode.

Lets save some models and load them on Inference mode and try to predict from it. And our output should look like below:

Conclusion

It was a long ride from past few projects, where we done some EDA, then explored possible ML Models to predict, cluster data and in this part, we tried to do all those things from the web app and also we can now try to predict from the webapp by inserting some values into the form.

Twitter Facebook LinkedIn

Quassarian Viper

Taking Data Apps into WebApp: Using Streamlit, Plotly, and Python

Introduction

Making Things Ready

Getting Data Ready

Exploratory Data Analysis

Missing Values

Summary of Each Variables

Correlation

Taking EDA to Streamlit App

First Streamlit App

Clustering

KMeans Clustering

PCA for Dimensionality Reduction

KMedoids Clustering

Taking Clustering to Streamlit App

Regression

Taking Regression to Streamlit App

All Codes

Classification

Taking Classification to Streamlit App

All Codes

Add Inference Mode

Conclusion

Comments

You May Also Enjoy

ImageBaker - Making Image Labelling Fun

Advent of Code 2022 with Python

Text Analysis with WordCloud in Python

WorldCup Tweet Sentiment Analysis in Python