Writing a Deep Neural Network from Scratch on Python
Contents
 Updates:
 1.1 What I am covering on this blog?
 2. Warning
 3. Steps
 3.1 Creating a FF layer class.
 3.1.1 As usual, importing necessary requirements.
 3.1.2 Next, create a class and initialize it with possible parameters.
 3.1.3 Now prepare the activation functions. For beginning, we will use only few.
 3.1.4 Next create a method to perform activation.
 3.1.5 Next create a method to set the new weight vector.
 3.1.6 Next create a method to get total parameters of this layer:
 3.1.7 Now create a method which will call above
get_parameters
andset_n_input
also do additional task and  3.1.8 Finally, last but not least, a backpropagation method of this layer.
 3.2 Writing a stacking class
 3.2.1 Initializing a class.
 3.2.2 Writing a method for stackking layers.
 3.2.3 Lets write a method for a summary. And yes we will test it right now.
 3.2.4 Train Method
 3.2.5 Training Method
 3.2.6 Write a feedforward method.
 3.2.7 Next we need a method to find error. We have few error functions on our assumption.
 3.2.8 Backpropagate method:
 4. Let’s do something interesting.
Let’s write a multi layer deep neural network from scratch in Python. But why do we need to write neural networks from scratch while there are already tools like Keras? It is a good exercise as well as refreshment of the knowledge about how neural networks works. For a neural network from scratch, we need to have some knowledge of Calculus, Linear Algebra, OOP (here in this blog).
I am not using gist for codes, so don’t panic if you find unfriendly text formats. Also I have written this blog on Markdown
of Jupyter Notebook
so the formats are a bit different. But the truth is, the class we will be building will be just like keras
. Yes Keras!
Updates:
 2020/05/29: Published blog.
 2022/11/10: Fixed errors in derivative.
1.1 What I am covering on this blog?
 Honestly, a scary and another blog about writing a Neural Network from scratch but I am leaving all the complex mathematics(also giving links to them at last).
 This blog will also act as prerequisites concept for Convolutional Neural Network from scratch which I will write on next blog.
 Doing MNIST classification using
softmax cross entropy and
GD` optimizer.  Saving and loading model.
For this code, I will be using:
numpy
matplotlib
for plottingpandas
for just summarytime
for viewing time
2. Warning
I am not going to make the next Keras or Tensorflow here. This is only going to be a simple multilayer neural network from scratch.
Most of these days, we have many ML frameworks with many choices. We have high level to low level frameworks. Recently PyTorch has gained huge popularity but for beginners, Keras is the best choice. But writing a ML code and neural network from scratch is always challenging and complex for even intermediate programmers. The mathematics behind the cute ML frameworks are scary. But once we understood the prerequisites of ML, then starting to code neural networks from scratch is a good idea.
3. Steps
 Create a FF layer class.
 Create a NN class which will bind FF layers and also does training.
OOP is a very awesome feature of python and using the object of FF layer class, we can access its attributes and methods anywhere at any time.
3.1 Creating a FF layer class.
3.1.1 As usual, importing necessary requirements.
import numpy as np
3.1.2 Next, create a class and initialize it with possible parameters.
def __init__(self, input_shape=None, neurons=1, bias=None, weights=None, activation=None, is_bias = True):
np.random.seed(100)
self.input_shape = input_shape
self.neurons = neurons
self.isbias = is_bias
self.name = ""
self.w = weights
self.b = bias
if input_shape != None:
self.output_shape = neurons
if self.input_shape != None:
self.weights = weights if weights != None else np.random.randn(self.input_shape, neurons)
self.parameters = self.input_shape * self.neurons + self.neurons if self.isbias else 0
if(is_bias):
self.biases = bias if bias != None else np.random.randn(neurons)
else:
self.biases = 0
self.out = None
self.input = None
self.error = None
self.delta = None
activations = ["relu", "sigmoid", "tanh", "softmax"]
self.delta_weights = 0
self.delta_biases = 0
self.pdelta_weights = 0
self.pdelta_biases = 0
if activation not in activations and activation != None:
raise ValueError(f"Activation function not recognised. Use one of {activations} instead.")
else:
self.activation = activation
input_shape
: It is for the number of input from the previous layer’s neurons.neurons
: How many neurons are on this layer?activation
: What activation function to use?bias
: A bias value ifis_bias
istrue
.isbias
: Will we use bias?
3.1.2.1 Inside __init__
self.name
: To store the name of this layer.self.weights
: A connection strength or weights from previous to this layer. Use fromnp.random.randn(n_input, neurons)
if not given.self.biases
: A bias value. On this layer.self.out
: Output of this layer.self.input
: Input to this layer. Is the input data for the input layer, and is output of the previous layer for all others.self.error
: Error term of this layer.self.delta_weights
: \begin{equation}\delta{w}\end{equation}self.delta_biases
: \begin{equation}\delta{b}\end{equation}self.pdelta_weights
: Previous self.delta_weightsself.pdelta_biases
: Previous self.delta_biasesactivations
: A list of possible activation functions. If the given activation function is not recognised, raise an error.self.activation
: A variable to store activation function of this layer.
3.1.3 Now prepare the activation functions. For beginning, we will use only few.
def activation_fn(self, r):
"""
A method of FFL which contains the operation and definition of a given activation function.
"""
if self.activation == None or self.activation == "linear":
return r
if self.activation == 'tanh': #tanh
return np.tanh(r)
if self.activation == 'sigmoid': # sigmoid
return 1 / (1 + np.exp(r))
if self.activation == "softmax":# stable softmax
r = r  np.max(r)
s = np.exp(r)
return s / np.sum(s)
Recall the mathematics,
\begin{equation} i. tanh(soma) = \frac{1soma}{1+soma} \end{equation}
\begin{equation} ii. linear(soma) = soma \end{equation}
\begin{equation} iii. sigmoid(soma) = \frac{1}{1 + exp^{(soma)}} \end{equation}
\begin{equation} iv. relu(soma) = \max(0, soma) \end{equation}
\begin{equation} v. softmax(x_j) = \frac{exp^{(x_j)}}{\sum_{i=1}^n{exp^{(x_i)}}} \end{equation}
\begin{equation} Where, soma = XW + \theta \end{equation}
And W
is the weight vector of shape (n, w)
. X
is input vector of shape (m, n)
and 𝜃
is bias term of shape w, 1
.
def activation_dfn(self, r):
"""
A method of FFL to find the derivative of a given activation function.
"""
if self.activation is None:
return np.ones(r.shape)
if self.activation == 'tanh':
return 1  r ** 2
if self.activation == 'sigmoid':
return r * (1  r)
if self.activtion == 'softmax':
soft = self.activation_fn(r)
return soft * (1  soft)
if self.activation=='relu':
r[r>=1]=1
r[r<1]=0
return r
Let’s revise a bit of calculus.
3.1.3.2 Why do we need a derivative?
Well, if you are here then you already know that gradient descent is based upon the derivatives(gradients) of activation functions and errors. So we need to perform this derivative. But you are on your own to perform calculations. I will also explain the gradient descent later.
\begin{equation} i. \frac{d(linear(x))}{d(x)} = 1 \end{equation}
\begin{equation} ii. \frac{d(sigmoid(x))}{d(x)} = sigmoid(x)(1 sigmoid(x)) \end{equation}
\begin{equation} iii. \frac{d(tanh(x))}{d(x)} = 1tanh(x)**2 \end{equation}
\begin{equation} iv. \frac{d(relu(x))}{d(x)} = 1 if x>0 else 0 \end{equation}
\begin{equation}
v. \frac{d(softmax(x_j))}{d(x_k)} = softmax(x_j)(1 softmax(x_j)) \space when \space j = k \space else
\space softmax({x_j}).softmax({x_k})
\end{equation}
For the sake of simplicity, we use the case of j = k
for softmax.
3.1.4 Next create a method to perform activation.
def apply_activation(self, x):
soma = np.dot(x, self.weights) + self.biases
self.out = self.activation_fn(soma)
This method takes the input vector x and performs the linear combination and then applies activation function to this value. The soma term is the total input to this node.
3.1.5 Next create a method to set the new weight vector.
This method is called when this layer is hidden. If a layer is hidden, we won’t give input shape but only the neurons on this layer. So we must set the n_input
manually and the same as weights. This method is used when we will be stacking the layers to make a sequential model.
def set_n_input(self):
self.weights = self.w if self.w != None else np.random.normal(size=(self.n_input, self.neurons))
I think we have made a simple Feedforward layer. Now is the time for us to create a class which can stack these layers together and also perform operations like train.
3.1.6 Next create a method to get total parameters of this layer:
def get_parameters(self):
self.parameters = self.input_shape * self.neurons + self.neurons if self.isbias else 0
return self.parameters
Total parameters of a layer is a total number of weights plus total biases.
3.1.7 Now create a method which will call above get_parameters
and set_n_input
also do additional task and
def set_output_shape(self):
self.set_n_input()
self.output_shape = self.neurons
self.get_parameters()
This method will be called from the stacking class. And I have made this method to be identical to the CNN
layers.
3.1.8 Finally, last but not least, a backpropagation method of this layer.
Note that every layer has a different way of passing error backwards. I have done CNN from scratch hence I am making this article to support that one also.
def backpropagate(self, nx_layer):
self.error = np.dot(nx_layer.weights, nx_layer.delta)
self.delta = self.error * self.activation_dfn(self.out)
self.delta_weights += self.delta * np.atleast_2d(self.input).T
self.delta_biases += self.delta
Here, nx_layer is the next layer. Let me share a little equation from Tom M Mitchell’s ML book(page 80+).
If the layer is output layer then its error is final error: \begin{equation} \delta_j = \frac{d(E_j)}{d(o_j)} f^1(o_j) \end{equation} And for all hidden and input layers: \begin{equation} \delta_j =  \frac{d(E_j)}{d(net_j)} = f^1(o_j) \sum_{k=downstream(j)} \delta_k w_{kj} \end{equation}
Note that: If this layer is the output layer, then the error will be the final error and we will not call this method. The term 𝑑(𝐸𝑗)/𝑑(𝑜𝑗)
is the derivative of error function wrt. output. I will share some explanations later on Gradient Descent.
Again going back to our method backpropagate
here, this method is called only when this layer is not the final layer. Otherwise the next layer won’t exist. Let’s take a look into self.error
, it is brought to this layer from its immediate layer or called downstream(j)
here. Then we find the delta term. We need the first derivative of the activation
function of this layer and we do it wrt output. When the term delta
for this layer is found, we can get delta_weights
for this layer by multiplying delta
with this layer’s most recent input
. Similarly delta_biases
is just the term delta. Note that, the len of delta will be equal to a total number of neurons. It stores the delta term for this layer.
3.2 Writing a stacking class
AHHHH long journey Aye!!
We will name it NN
. And we will perform all training operations under this class.
3.2.1 Initializing a class.
(Note that: the assumption of how many attributes we need will always fail, you might use less than initialized or you will create later on). Please follow the written comments below, for explanation.
def __init__(self):
self.layers = [] # a list to stack all the layers
self.info_df = {} # this dictionary will store the information of our model
self.column = ["LName", "Input", "Output", "Activation", "Bias"] # this list will be used the header of our summary
self.parameters = 0 # how many parameters do we have?
self.optimizer = "" # what optimizer are we using?
self.loss = "" # what loss function are we using?
self.all_loss = {} # loss through very epochs, needed for visualizing
self.lr = 1 # learning rate
self.metrics = []
self.av_optimizers = ["sgd", "iterative", "momentum", "rmsprop", "adagrad", "adam", "adamax", "adadelta"] # available optimizers
self.av_metrics = ["mse", "accuracy", "cse"] # available metrics
self.av_loss = ["mse", "cse"] # available loss functions
self.iscompiled = False # if model is compiled
self.batch_size = 8 # batch size of input
self.mr = 0.0001 # momentum rate, often called velocity
self.all_acc = {} # all accuracy
self.eps = 1e8 # epsilion, often used to avoid divide by 0.
And hold on, we will write all optimizers from scratch too.
3.2.2 Writing a method for stackking layers.
def add(self, layer):
if(len(self.layers) > 0):
prev_layer = self.layers[1]
if prev_layer.name != "Input Layer":
prev_layer.name = f"Hidden Layer{len(self.layers)  1}"
if layer.input_shape == None:
layer.input_shape = prev_layer.output_shape
layer.set_output_shape()
layer.name = "Output Layer"
if prev_layer.neurons != layer.input_shape and layer.input_shape != None:
raise ValueError(f"This layer '{layer.name}' must have neurons={prev_layer.neurons} because '{prev_layer.name}' has output of {prev_layer.neurons}.")
else:
layer.name = "Input Layer"
self.layers.append(layer)
Lots of dumb things happening under this method. It takes the object of the layer and stacks it to the previous layer.
First we check if we have more than 0 layers from self.layers
. If we do, then we set prev_layer to the last layer of all layers. And if the name of prev_layer is not “Input Layer” we will name all hidden layers as “Hidden Layer”. And if this layer’s number of input is none, we set it to the number of neurons of prev_layer. Because any hidden layer will have input as the output of the previous layer. And then we call the set_output_shape
method for weight initialization, and other tasks. Note that the number of bias terms is equal to the number of neurons or nodes, hence we won’t have to set them like this. But if this layer’s input is given and it doesn’t match the number of neurons of the previous layer is not equal then this is invalid assumption and we will throw an error.
Second, if we have 0 layers, then it is obviously an Input layer. We name it so.
Finally we make a stack of layers(not the data structure stack but a list) by appending them to a list of layers.
3.2.3 Lets write a method for a summary. And yes we will test it right now.
def summary(self):
lname = []
linput = []
lneurons = []
lactivation = []
lisbias = []
for layer in self.layers:
lname.append(layer.name)
linput.append(layer.input_shape)
lneurons.append(layer.neurons)
lactivation.append(layer.activation)
lisbias.append(layer.isbias)
self.parameters += layer.parameters
model_dict = {"Layer Name": lname, "Input": linput, "Neurons": lneurons, "Activation": lactivation, "Bias": lisbias}
model_df = pd.DataFrame(model_dict).set_index("Layer Name")
print(model_df)
print("Total Parameters: ", self.parameters)
I am taking help from the pandas
library here and instead of writing tables like output, why not use the table? Nothing huge is happening here, but we created a different list for layer name, input shape, neurons, activation, bias and appended every layer’s value on this. Then after we collected every value of the attribute from every layer, we created a dictionary with the right keys. Then BAAAAM! We created a dataframe and set the index to Layer Name
.
Let’s write a example:
model = NN()
model.add(FFL(input_shape=28*28, 10, activation="softmax"))
model.summary()
If there are no errors, then let’s proceed.
3.2.4 Train Method
Afterall, what use of all those fancy methods if you still not get train method?
But before that, let’s create a method to check if our dataset meets the requirements of the model.
def check_trainnable(self, X, Y):
if self.iscompiled == False:
raise ValueError("Model is not compiled.")
if len(X) != len(Y):
raise ValueError("Length of training input and label is not equal.")
if X[0].shape[0] != self.layers[0].input_shape:
layer = self.layers[0]
raise ValueError(f"'{layer.name}' expects input of {layer.input_shape} while {X[0].shape[0]} is given.")
if Y.shape[1] != self.layers[1].neurons:
op_layer = self.layers[1]
raise ValueError(f"'{op_layer.name}' expects input of {op_layer.neurons} while {Y.shape[1]} is given.")
This method takes training input and labels, and if it is all good then we can walk proudly to the train method. We are checking if the model is compiled. Well model compilation is done by another method and will be presented here. Then there are other cases of error. Please see the statement inside ValueError
for explanation.
Let’s write a compiling method, shall we?
What this method should do is, prepare a optimizer, prepare a loss fxn, learning rate and so on.
def compile_model(self, lr=0.01, mr = 0.001, opt = "sgd", loss = "mse", metrics=['mse']):
if opt not in self.av_optimizers:
raise ValueError(f"Optimizer is not understood, use one of {self.av_optimizers}.")
for m in metrics:
if m not in self.av_metrics:
raise ValueError(f"Metrics is not understood, use one of {self.av_metrics}.")
if loss not in self.av_loss:
raise ValueError(f"Loss function is not understood, use one of {self.av_loss}.")
self.loss = loss
self.lr = lr
self.mr = mr
self.metrics = metrics
self.iscompiled = True
self.optimizer = Optimizer(layers=self.layers, name=opt, learning_rate=lr, mr=mr)
self.optimizer = self.optimizer.opt_dict[opt]
This method is under development but the important part here is the last two lines. Optimizer(layers=self.layers, name=opt, learning_rate=lr, mr=mr)
is a class which encapsulates all our optimizers. When calling a class, it will initialize all our optimizer’s necessary terms also. I will provide that code also but let’s take a look at some glimpse.
class Optimizer:
def __init__(self, layers, name=None, learning_rate = 0.01, mr=0.001):
self.name = name
self.learning_rate = learning_rate
self.mr = mr
keys = ["sgd"]
values = [self.sgd]
self.opt_dict = {keys[i]:values[i] for i in range(len(keys))}
if name != None and name in keys:
self.opt_dict[name](layers=layers, training=False)
def sgd(self, layers, learning_rate=0.01, beta=0.001, training=True):
learning_rate = self.learning_rate
for l in layers:
if l.parameters !=0:
if training:
l.weights += l.pdelta_weights*self.mr + l.delta_weights * learning_rate
l.biases += l.pdelta_biases*self.mr + l.delta_biases * learning_rate
l.pdelta_weights = l.delta_weights
l.pdelta_biases = l.delta_biases
We will be using only Gradient Descent
here. I will also provide code to other optimizers on another blog. Things to note are, we create a key as a normal string of corresponding optimizer and value as a method.
3.2.4.1 Gradient Descent
(Refer from the chapter 4 (page 80) of Machine Learning by Tom M. Mitchell.)
For weight update, we use this concept along with Back Propagation. Let’s first prepare notations. \begin{equation} E_j\ is\ error\ function. \end{equation} \begin{equation} net_j\ is\ soma\ i.e. XW + \theta\ or\ \sum_i{w_{ji}x_{ji}} \end{equation} \begin{equation} o_j\ is\ the\ output\ of\ unit\ j\ due\ to\ the\ activation\ function\ i.e.\ o_j = f(net_j) \end{equation} \begin{equation} t_j\ is\ target\ for\ j \end{equation} \begin{equation} w_{ji}\ is\ weight\ value\ from\ j^{th}\ unit\ to\ i^{th}\ unit. \end{equation} \begin{equation} x_{ji} is\ the\ input\ value\ from\ j^{th}\ unit\ to\ i^{th}\ unit. \end{equation}
Note that 𝑑(𝐸𝑗)/𝑑(𝑤𝑗𝑖)
varies with the case, if jth unit is output unit or internal.

Case 1: j is output unit. \begin{equation} \frac{d(E_j)}{d({w_{ji})}} = \frac{d(E_j)}{d({net_j})} \frac{d(net_j)}{d({w_{ji}})}
\end{equation} \begin{equation} \space = \frac{d(E_j)}{d_{net_j}} x_{ji} \end{equation} \begin{equation} \ = \frac{d(E_j)}{d(o_j)} \frac{d(o_j)}{d(net_j)}x_{ji} \end{equation} \begin{equation} \ = \frac{d(E_j)}{d(o_j)} f^1(o_j) x_{ji} \end{equation} 
Case 2: j is hidden unit,
We have to refer to the set of all units immediately downstream of unit j.(i.e all units whose direct i/p include o/p of unit j) and denoted bydownstream(j)
. Andnet_j
can influence network o/p by onlydownstream(j)
.
\begin{equation} \frac{d(E_j)}{d({net_{j})}} = \sum_{k=downstream(j)} \frac{d(E)}{d({net_k})} \frac{d(net_k)}{d({net_{j}})} \end{equation} \begin{equation} \ = \sum_{k=downstream(j)} \delta_k \frac{d(net_k)}{d({o_{j}})} \frac{d(o_j)}{d({net_{j}})} \end{equation} \begin{equation} \ = \sum_{k=downstream(j)} \delta_k w_{kj} f^1(oj) \end{equation} \begin{equation} \ reordering\ terms, \end{equation} \begin{equation} \ \delta_j =  \frac{d(E_j)}{d(net_j)} = f^1(o_j) \sum_{k=downstream(j)} \delta_k w_{kj} \end{equation}
And the weight update term for all units is:
\begin{equation}
\triangle w_{ji} = \alpha \delta_j x_{ji}
\end{equation}
\begin{equation}
\ when\ momentum\ term\ is\ applied\,
\end{equation}
\begin{equation}
\triangle w_{ji}(n) = \beta \delta_j x_{ji} + \triangle w_{ji}(n1)
\end{equation}
\begin{equation}
\ \beta\ is\ momentum\ rate
\end{equation}
\begin{equation}
\delta_j\ formula\ varies\ with\ the\ unit\ being\ output\ or\ internal.
\end{equation}
\begin{equation}
w_{ji} = w_{ji}  \triangle w_{ji}
\end{equation}
The Gradient Descent algorithm will be easier to understand after we specify the activation function and loss function. Which I will be covering on below parts.
3.2.5 Training Method
def train(self, X, Y, epochs, show_every=1, batch_size = 32, shuffle=True):
self.check_trainnable(X, Y)
self.batch_size = batch_size
t1 = time.time()
len_batch = int(len(X)/batch_size)
batches = []
curr_ind = np.arange(0, len(X), dtype=np.int32)
if shuffle:
np.random.shuffle(curr_ind)
if len(curr_ind) % batch_size != 0:
len_batch+=1
batches = np.array_split(curr_ind, len_batch)
for e in range(epochs):
err = []
for batch in batches:
curr_x, curr_y = X[batch], Y[batch]
b = 0
batch_loss = 0
for x, y in zip(curr_x, curr_y):
out = self.feedforward(x)
loss, error = self.apply_loss(y, out)
batch_loss += loss
err.append(error)
update = False
if b == batch_size1:
update = True
loss = batch_loss/batch_size
self.backpropagate(loss, update)
b+=1
if e % show_every == 0:
out = self.feedforward(X)
loss, error = self.apply_loss(Y, out)
out_activation = self.layers[1].activation
print(out_activation)
if out_activation == "softmax":
pred = out.argmax(axis=1) == Y.argmax(axis=1)
elif out_activation == "sigmoid":
pred = out > 0.7
elif out_activation == None:
pred = abs(Yout) < 0.000001
self.all_loss[e] = round(error.mean(), 4)
self.all_acc[e] = round(pred.mean() * 100, 4)
print(f"Time: {round(time.time()  t1, 3)}sec")
t1 = time.time()
print('Epoch: #%s, Loss: %f' % (e, round(error.mean(), 4)))
print(f"Accuracy: {round(pred.mean() * 100, 4)}%")
Alright folks, this is the train method. I hope you are not scared of the size. Some major steps:
 Check if the dataset is trainable or not
 Start a timer(or should we start timer after making batches)
 Create a indices of dataset
 If shuffle, then we do shuffle
 Then we create indices for each batch, we also make each batch of mostly the same size but on odd cases
np.array_split
does work.  On every epoch:
 For each batch:
 For each x, y on batch:
 Feed Forward the example set, (method is given below)
 Find the loss for last layer and error, (method is given below)
 Add loss to batch loss
 If current example is last of batch, then we will update parameters
 We backpropagate the error of current example, (the backpropagate method is given below)
 For each x, y on batch:
 If we want to show on this epoch,
 Feedforward all trainsets and take training output.
 Find train error
 Find accuracy
 Take the average of error and accuracy and show them.
 Store loss and accuracy of this epoch(we will visualize later)
 For each batch:
3.2.6 Write a feedforward method.
def feedforward(self, x):
for l in self.layers:
l.input = x
x = l.apply_activation(x)
l.out = x
return x
Nothing strange is happening here. We take an input vector of a single example and pass it to the first layer. Then we set the input of that layer to x
and get the output of this layer. And also set out
of this layer to output given by the apply_activation method of that layer. Note that we need the output of this every layer for backpropagation and also the output of one layer acts as input to another. When there are no layers left, we pass the output of the last layer(o/p layer) as the output of this input.
3.2.7 Next we need a method to find error. We have few error functions on our assumption.
def apply_loss(self, y, out):
if self.loss == "mse":
loss = y  out
mse = np.mean(np.square(loss))
return loss, mse
if self.loss == 'cse':
""" Requires out to be probability values. """
if len(out) == len(y) == 1: #print("Using Binary CSE.")
cse = (y * np.log(out) + (1  y) * np.log(1  out))
loss = (y / out  (1  y) / (1  out))
else: #print("Using Categorical CSE.")
if self.layers[1].activation == "softmax":
"""if o/p layer's fxn is softmax then loss is y  out
check the derivation of softmax and crossentropy with derivative"""
loss = y  out
loss = loss / self.layers[1].activation_dfn(out)
else:
y = np.float64(y)
out += self.eps
loss = (np.nan_to_num(y / out)  np.nan_to_num((1  y) / (1  out)))
cse = np.sum((y * np.nan_to_num(np.log(out)) + (1  y) * np.nan_to_num(np.log(1  out))))
return loss, cse
The code is pretty weird but math is cute.

MSE(Mean Squared Error): Mean of Squared Error. \begin{equation} E = \frac{1}{m} \sum_{i=1}^m(t_i  o_i)^2 \end{equation} where
o
is the output of the model andt
is target or true label. 
CSE(Cross Entropy): Good for penalizing bad predictions more. \begin{equation} E = \frac{1}{m}\sum_{i=1}^{m} ylog(h_{(\theta)}(x^i)  (1y)log(1h_{(\theta)}(x^i) \end{equation} The loss value returned from the above equation is the term required for gradient descent. It will be clear by viewing Gradient Descent.
Recall the delta term from Gradient Descent, as the delta term depends upon the derivative of error function w.r.t weight, we need to find it. In fact our target is to find the term 𝑑(𝐸𝑗)/𝑑(𝑜𝑗)
. It is not that hard by the way.
i. MSE
\begin{equation}
\frac{d(E_j)}{d(o_j)} = \frac{d\frac{1}{m} \sum_{i=1}^m(t_i  o_i)^2}{d(o_j)}
\end{equation}
\begin{equation}
above\ term\ is\ 0\ for\ all\ except\ i=j
\end{equation}
\begin{equation}
\therefore\ \frac{d(E_j)}{d(o_j)} = \frac{d\frac{1}{m} (t_j  o_j)^2}{d(o_j)}
\end{equation}
\begin{equation}
\ = (t_j  o_j)
\end{equation}
\begin{equation}
\ and\ term\ \frac{d(E_j)}{d(net_j)} = (t_j  o_j) f^1(o_j)
\end{equation}
ii. CSE
I am skipping long derivatives but note that d(log(x))/d(x)
= 1/x
.
\begin{equation}
E = \frac{1}{m}\sum_{i=1}^{m} t_ilog(o_i)  (1t_i)log(1o_i)
\end{equation}
\begin{equation}
\ now\ term\ \frac{d(E_j)}{d(o_j)} =  \frac{t_i}{o_i} + \frac{1t_i}{1o_i} will\ be\ calculated.
\end{equation}
Now going back to our code, what if we have an activation function softmax
for the output layer? Well, since we will be using its derivative as softmax(1softmax)
. Here softmax
is o
. So if we rearrange terms, 𝑑(𝐸𝑗)/𝑑(𝑜𝑗)
=
(ot)/(o(1o))
. Hence the term 𝑑(𝐸𝑗)/𝑑(𝑤𝑗𝑖)
will be (ot)
when using softmax and crossentropy.
np.nan_to_num
will turn nan
value to 0 that we got from log
or 1/0
.
3.2.8 Backpropagate method:
def backpropagate(self, loss, update = True):
for i in reversed(range(len(self.layers))):
layer = self.layers[i]
if layer == self.layers[1]:
layer.error = loss
layer.delta = layer.error * layer.activation_dfn(layer.out)
layer.delta_weights += layer.delta * np.atleast_2d(layer.input).T
layer.delta_biases += layer.delta
else:
nx_layer = self.layers[i+1]
layer.backpropagate(nx_layer)
if update:
layer.delta_weights /= self.batch_size
layer.delta_biases /= self.batch_size
if update:
self.optimizer(self.layers)
self.zerograd()
This method is called per example on every batch on every epoch. What happens is, when we pass the loss of model and update term, it runs over every layer and checks updates the delta term for all parameters. More simply:
 For every layer from output to input:
 If this layer is output layer, find delta term now
 If this layer isn’t the output layer, call the
backpropagate
method of that layer and send the next layer also. (I have already provided an individualbackpropagate
method for Feedforward layer.)  If we want to update the parameters now, then average the delta terms
 If we are updating, then call the optimizer method, if we look back to the
compile
method, then we can see thatself.optimizer
is holding a reference to the method ofOptimizer
class. We pass the entire layers again here.  Now we have updated our parameters, we need to zero all the gradient terms. So we have another method,
zerograd
.
def zerograd(self):
for l in self.layers:
l.delta_weights=0
l.delta_biases = 0
It is pretty simple here. But once we are working with more than one type of layer, it will get messy.
from keras.datasets import mnist
(x_train, y_train), (x_test, y_test) = mnist.load_data()
x = x_train.reshape(1, 28 * 28)
x = (xx.mean(axis=1).reshape(1, 1))/x.std(axis=1).reshape(1, 1)
y = pd.get_dummies(y_train).to_numpy()
m = NN()
m.add(FFL(784, 10, activation="softmax"))
m.compile_model(lr=0.01, opt="sgd", loss="cse", mr= 0.001)
m.summary()
m.train(x[:], y[:], epochs=100, batch_size=32)
This is a classification problem.
I am using keras
just for getting a mnist dataset. We can get mnist data from the official website also. Then we normalize our data by subtracting mean and dividing with its corresponding standard deviation. Thanks to NumPy. Then we converted our data to one hot encoding using the pandas method, get_dummies and convert it to a NumPy array. We created an object of NN class and then added a Feed Forward layer of input shape 784 and neurons 10, we gave activation function as softmax. Since mnist dataset is 28X28 on each example, we made single image of shape 28*28. Softmax function is very useful for classification problems and usually used on last layers. Something like below happens but accuracy increases very slowly.\
Time: 19.278sec
Epoch: #20, Loss: 3208783.038700
Accuracy: 88.55%
We can make our dataset a one hot encoded
vector using the below method also:\
def one_hot_encoding(lbl, classes):
encoded = np.zeros((len(lbl), classes))
c = list(set(lbl))
if len(c) != classes:
raise ValueError("Number of classes is not equal to unique labels.")
for i in range(len(yy)):
for j in range(len(c)):
if c[j] == lbl[i]:
encoded[i, j] = 1
return encoded
With model like below accuracy was great
m = NN()
m.add(FFL(784, 100, activation="sigmoid"))
m.add(FFL(100, 10, activation="softmax"))
m.compile_model(lr=0.01, opt="adam", loss="cse", mr= 0.001)
m.summary()
m.train(x[:], y[:], epochs=100, batch_size=32)
4. Let’s do something interesting.
4.1 Preparing Train/Validate data
Up to now, we have done some training only. But it is not a good idea to boast the train accuracy. We need to take validation data also. For that lets modify our few methods. First, we will edit __init__
method of our NN
.\
self.train_loss = {} # to store train loss per view_every
self.val_loss = {} # to store val loss per view_every
self.train_acc = {} # to store train acc per view_every
self.val_acc = {} # to store val acc per view_every
Next, change train
method as below:\
def train(self, X, Y, epochs, show_every=1, batch_size = 32, shuffle=True, val_split=0.1, val_x=None, val_y=None):
self.check_trainnable(X, Y)
self.batch_size = batch_size
t1 = time.time()
curr_ind = np.arange(0, len(X), dtype=np.int32)
if shuffle:
np.random.shuffle(curr_ind)
if val_x != None and val_y != None:
self.check_trainnable(val_x, val_y)
print("\nValidation data found.\n")
else:
val_ex = int(len(X) * val_split)
val_exs = []
while len(val_exs) != val_ex:
rand_ind = np.random.randint(0, len(X))
if rand_ind not in val_exs:
val_exs.append(rand_ind)
val_ex = np.array(val_exs)
val_x, val_y = X[val_ex], Y[val_ex]
curr_ind = np.array([v for v in curr_ind if v not in val_ex])
print(f"\nTotal {len(X)} samples.\nTraining samples: {len(curr_ind)} Validation samples: {len(val_x)}.")
batches = []
len_batch = int(len(curr_ind)/batch_size)
if len(curr_ind)%batch_size != 0:
len_batch+=1
batches = np.array_split(curr_ind, len_batch)
print(f"Total {len_batch} batches, most batch has {batch_size} samples.\n")
batches = []
if(len(curr_ind) % batch_size) != 0 :
nx = batch_sizelen(curr_ind) % batch_size
nx = curr_ind[:nx]
curr_ind = np.hstack([curr_ind, nx])
batches = np.split(curr_ind, batch_size)
for e in range(epochs):
err = []
for batch in batches:
a = []
curr_x, curr_y = X[batch], Y[batch]
b = 0
batch_loss = 0
for x, y in zip(curr_x, curr_y):
out = self.feedforward(x)
loss, error = self.apply_loss(y, out)
batch_loss += loss
err.append(error)
update = False
if b == batch_size1:
update = True
loss = batch_loss/batch_size
self.backpropagate(loss, update)
b+=1
if e % show_every == 0:
train_out = self.feedforward(X[curr_ind])
train_loss, train_error = self.apply_loss(Y[curr_ind], train_out)
out_activation = self.layers[1].activation
val_out = self.feedforward(val_x)
val_loss, val_error = self.apply_loss(val_y, val_out)
if out_activation == "softmax":
train_acc = train_out.argmax(axis=1) == Y[curr_ind].argmax(axis=1)
val_acc = val_out.argmax(axis=1) == val_y.argmax(axis=1)
elif out_activation == "sigmoid":
train_acc = train_out > 0.7
val_acc = val_out > 0.7
elif out_activation == None:
train_acc = abs(Y[curr_ind]train_out) < 0.000001
val_acc = abs(Y[val_ex]val_out) < 0.000001
self.train_loss[e] = round(train_error.mean(), 4)
self.train_acc[e] = round(train_acc.mean() * 100, 4)
self.val_loss[e] = round(val_error.mean(), 4)
self.val_acc[e] = round(val_acc.mean()*100, 4)
print(f"Epoch: {e}, Time: {round(time.time()  t1, 3)}sec")
print(f"Train Loss: {round(train_error.mean(), 4)} Train Accuracy: {round(train_acc.mean() * 100, 4)}%")
print(f'Val Loss: {(round(val_error.mean(), 4))} Val Accuracy: {round(val_acc.mean() * 100, 4)}% \n')
t1 = time.time()
The pseudo code or explanation of above code is:
 Check trainable training data.
 Prepare indices from 0 to no. examples.
 If validation data is given on
val_x, val_y
then check their trainability also.  Else, we will split the prepared indices of data for train and validation.
 First we get a number of indices for validation, then get indices for them and data too.
 We will also edit
curr_ind
Instead of using actual data, I am using only indices because of memory.  Then train just as above processes.
 For
show_every
, We do pass entire train data and get accuracy, loss. And do a similar validation set.
4.2 Lets add some visualizing methods
def visualize(self):
plt.figure(figsize=(10,10))
k = list(self.train_loss.keys())
v = list(self.train_loss.values())
plt.plot(k, v, "g")
k = list(self.val_loss.keys())
v = list(self.val_loss.values())
plt.plot(k, v, "r")
plt.xlabel("Epochs")
plt.ylabel(self.loss)
plt.legend(["Train Loss", "Val Loss"])
plt.title("Loss vs Epoch")
plt.show()
plt.figure(figsize=(10,10))
k = list(self.train_acc.keys())
v = list(self.train_acc.values())
plt.plot(k, v, "g")
k = list(self.val_acc.keys())
v = list(self.val_acc.values())
plt.plot(k, v, "r")
plt.xlabel("Epochs")
plt.ylabel("Accuracy")
plt.title("Acc vs epoch")
plt.legend(["Train Acc", "Val Acc"])
plt.grid(True)
plt.show()
Nothing strange is happening here. We are only using the keys and values of previously stored train/val acc/loss. If we set show_every=1
then, the graph will be shown great.
5 Finally
My version of the final Feedforward Deep Neural Network will be given on the link and in the meantime, I am gonna share my results.
from keras.datasets import mnist
(x_train, y_train), (x_test, y_test) = mnist.load_data()
x = x_train.reshape(1, 28 * 28)
x = (xx.mean(axis=1).reshape(1, 1))/x.std(axis=1).reshape(1, 1)
y = pd.get_dummies(y_train).to_numpy()
xt = x_test.reshape(1, 28 * 28)
xt = (xtxt.mean(axis=1).reshape(1, 1))/xt.std(axis=1).reshape(1, 1)
yt = pd.get_dummies(y_test).to_numpy()
m = NN()
m.add(FFL(784, 10, activation='sigmoid'))
m.add(FFL(10, 10, activation="softmax"))
m.compile_model(lr=0.01, opt="adam", loss="cse", mr= 0.001)
m.summary()
m.train(x[:], y[:], epochs=10, batch_size=32, val_x=xt, val_y = yt)
m.visualize()
6 Bonus Topics
6.1 Saving Model on JSON File
import os
import json
def save_model(self, path="model.json"):
"""
path: where to save a model including filename
saves Json file on given path.
"""
dict_model = {"model":str(type(self).__name__)}
to_save = ["name", "isbias", "neurons", "input_shape", "output_shape", "weights", "biases", "activation", "parameters"]
for l in self.layers:
current_layer = vars(l)
values = {"type":str(type(l).__name__)}
for key, value in current_layer.items():
if key in to_save:
if key in ["weights", "biases"]:
value = value.tolist()
values[key] = value
dict_model[l.name] = values
json_dict = json.dumps(dict_model)
with open(path, mode="w") as f:
f.write(json_dict)
save_model(m)
Note that, we are not saving parameters on encrypted form and neither are we saving it on different files.
 We want to save everything in JSON format, so we are creating a dictionary first.
vars(obj)
allows us to create a dictionary from theattrib:value
structure of the class object. We are about to save only a few things necessary to use a model.
to_save
is a list of all the attributes that we need to predict a model. Still we haven’t implemented a way to check if the saved model is compiled or not. But we do need a
predict
method.
6.2 Loading a JSON Model
def load_model(path="model.json"):
"""
path: path of model file including filename
returns: a model
"""
models = {"NN": NN}
layers = {"FFL": FFL}
"""layers = {"FFL": FFL, "Conv2d":Conv2d, "Dropout":Dropout, "Flatten": Flatten, "Pool2d":Pool2d}"""
with open(path, "r") as f:
dict_model = json.load(f)
model = dict_model["model"]
model = models[model]()
for layer, params in dict_model.items():
if layer != "model":
lyr = layers[params["type"]](neurons=params["neurons"])# create a layer obj
if params.get("weights"):
lyr.weights = params["weights"]
if params.get("biases"):
lyr.biases = params["biases"]
lyr.name = layer
lyr.activation = params["activation"]
lyr.isbias = params["isbias"]
lyr.input_shape = params["input_shape"]
lyr.output_shape = params["output_shape"]
lyr.neurons = params["neurons"]
lyr.parameters = params["parameters"]
model.layers.append(lyr)
return model
m = load_model()
 Nothing is strange here. But a few things to note is,
FFL
is a method’s address. AndNN
is a class which we will call later.  The model is created on line
model = models[model]()
.  First test of if our model works or not can be seen from
m.summary()
.  Next try to use the
predict(x)
method.
6.3 Predict Method
def predict(self, X):
out = []
for x in X:
out.append(self.feedforward(x))
return out
Now, this is where this blog ends but I have written another blog, Convolutional Neural Network from Scratch too. I hope you’ve found this blog to be useful and this will be helpful when you try to write your own version of a neural network from scratch.
7 References and Credits
 Optimizers were referenced from here
 About Softmax Activation Function and Crossentropy
 Machine Learning by Tom M Mitchell
 Tensorflow For Dummies by Matthew Scarpino
 Artificial Intelligence Deep Learning Machine Learning Tutorials(Most awesome repository.)
 Grokking Deep Learning by Andrew Trask
Comments