Building OCR For Devanagari Handwritten Character

6 minute read

Using Keras, OpenCv, Numpy build a simple OCR.

Contents

Inspiration

Devanagari is popular across India and Nepal. It is also a National font of Nepal so back in 2018 I thought of doing OCR for our font as a project. I had no clue how to do it but I knew some basics of Machine Learning. But I started doing it in 2019 and it ended in 3 months. In the end it became my school project.

Recognition Of Devanagari Character

Requirements

Some basic knowledge on Machine Learning. And for coding, you might need keras 2.X, open-cv 4.X, Numpy and Matplotlib.

Introduction

Devanagari is the national font of Nepal and is used widely throughout India also. It contains 10 numerals(०, १, २, ३, ४, ५, ६, ७, ८, ९) and 36 consonants (क, ख, ग, घ, ङ, च, छ, ज, झ, ञ, ट, ठ, ड, ढ, ण, त, थ, द, ध,न, प,फ, ब, भ, म, य, र, ल, व, श, ष, स, ह, क्ष, त्र, ज्ञ). Some consonants are complex and made by combining some others. However, throughout this project I considered them as single characters.

The required dataset is publicly available on the link. Huge credit goes to the team who collected the dataset and made it public.

Dataset Preparation

We could create our own version of the dataset but why take a lot of time rather than working with the already collected dataset. The format of the image was Grayscale with 2 pixels margin on each side. I didn’t know that much about ‘Image Datagenerator’ of Keras then so I converted all the image files to CSV file with the first column as label and remaining 1024 as pixel values. But now I highly recommend to use ‘Image Datagenerator’.

Model Preparation

For the model, I picked up a simple CNN. Keras Summary of model is given below.

Keras Model Summary

Model Training

  • Loss: Categorical Cross Entropy
  • Optimizer: SGD
  • Batch size: 32
  • Epochs: 100
  • Validation split: 0.2
  • Train time: 37.86 minutes on Google Colab
  • Test accuracy: 99.29%

I have used various models and some weeks to train a fine model and ended up with the best one by using above parameters. Here is the image about how the model was tuned.

Model Accuracy
Model Loss

My Github Repository consists of all top 3 models(cnn0, cnn1, cnn2) and their code please follow through this link.

Image Processing

Training a model alone will not create an OCR. And we can’t use real world images on the model without doing pre-processing. Here is a complete image processing model code.

def preprocess(bgr_img):#gray image
img = bgr_img[:]
blur = cv2.GaussianBlur(img,(5,5),0)
ret,th_img = cv2.threshold(blur,0,255,cv2.THRESH_BINARY_INV+cv2.THRESH_OTSU) #converts black to white and inverse
rows, cols = th_img.shape
bg_test = np.array([th_img[i][i] for i in range(5)])
if bg_test.all() == 0:
text_color = 255
else:
text_color = 0
#print('Process: Localization....\n')
tb = borders(th_img, text_color)
lr = borders(th_img.T, text_color)
dummy = int(np.average((tb[2], lr[2]))) + 2
template = th_img[tb[0]+dummy:tb[1]-dummy, lr[0]+dummy:lr[1]-dummy]
#print("Process: Segmentation....\n")
segments = segmentation(template, text_color)
#print('Process: Detection.....\n')
return segments, template, th_img, text_color
view raw preprocess.py hosted with ❤ by GitHub

First we will copy the BGR(OpenCv reads on BGR) image and then convert it to Binary Image using OpenCv’s threshold function. Thresholding the image always reduces the complexity of tasks because we will be working on only 2 pixel values 0 and 255. Another important thing is we have to find the background and foreground pixels. This is a really tricky part because there will be a case where our text will be white and background be black and vice versa so we need to do some trick to always find foreground and background pixels. Here I checked only 5 pixels from the top left corner. This idea will not always work but for some time it will be a good approach.

Next we need to find the ROI. Because the text might be situated on any side of the image. So we must find the exact image position and crop it to do further processing. Next We will do segmentation. Here I used only Numpy for image cropping and segmenting. This sounds funny but it is true.

Finding ROI

def borders(here_img, thresh):
size = here_img.shape
check = int(115 * size[0] / 600)
image = here_img[:]
top, bottom = 0, size[0] - 1
#plt.imshow(image)
#plt.show()
shape = size
#find the background color for empty column
bg = np.repeat(thresh, shape[1])
count = 0
for row in range(1, shape[0]):
if (np.equal(bg, image[row]).any()) == True:
#print(count)
count += 1
else:
count = 0
if count >= check:
top = row - check
break
shape = image.shape
bg = np.repeat(thresh, shape[1])
count = 0
rows = np.arange(1, shape[0])
#print(rows)
for row in rows[::-1]:
if (np.equal(bg, image[row]).any()) == True:
count += 1
else:
count = 0
if count >= check:
bottom = row + count
break
#print(count)
#plt.imshow(here_img[top:bottom, :])
#plt.imshow(here_img[top:bottom, :])
#plt.show()
d1 = (top - 2) >= 0
d2 = (bottom + 2) < size[0]
d = d1 and d2
if(d):
b = 2
else:
b = 0
return (top, bottom, b)
view raw preprocess.py hosted with ❤ by GitHub
Working of Cropping

Finding the real position of text is another problem. Because on real time, image can consists lot of noise and there is always a chance of finding the false shapes. To handle this I have wrote some formula. Here I set the some pixel values to be noise and neglect them. Then we keep checking from the top of the image. Whenever we find the foreground pixels more than the noise value, we crop the image from the position current_row — noise_value. We do same for other 3 sides also.

Cropped Image

Segmentation

def segmentation(bordered, thresh):
try:
shape = bordered.shape
check = int(50 * shape[0] / 320)
image = bordered[:]
image = image[check:].T
shape = image.shape
#plt.imshow(image)
#plt.show()
#find the background color for empty column
bg = np.repeat(255 - thresh, shape[1])
bg_keys = []
for row in range(1, shape[0]):
if (np.equal(bg, image[row]).all()):
bg_keys.append(row)
lenkeys = len(bg_keys)-1
new_keys = [bg_keys[1], bg_keys[-1]]
#print(lenkeys)
for i in range(1, lenkeys):
if (bg_keys[i+1] - bg_keys[i]) > check:
new_keys.append(bg_keys[i])
#print(i)
new_keys = sorted(new_keys)
#print(new_keys)
segmented_templates = []
first = 0
for key in new_keys[1:]:
segment = bordered.T[first:key]
segmented_templates.append(segment.T)
#show middle segments
#plt.imshow(segment.T)
#plt.show()
first = key
last_segment = bordered.T[new_keys[-1]:]
segmented_templates.append(last_segment.T)
#check if each segment shape is enough to do recognition
return(segmented_templates)
except:
return [bordered]
view raw preprocess.py hosted with ❤ by GitHub

Now the craziest part is image segmentation using Numpy. We take the copy of the cropped image and remove the top most part of the text. In Nepali, we call it ‘Dika’. By doing this we can actually get some space between characters. So I wrote a general formula which will work for all images.

Removed ‘Dika’

Next we will run a loop through each column and see if any column is entire a background pixel. If so it might be the right place to do segmentation but there is always a risk of having a large margin. So we send them again to the bordered function to remove the unwanted background spaces. After doing so, we will get the exact column number from which we can slice our original cropped image. I’ve called them segmented_templates here.

Image segmentation

Localization

def localize(main_image, gray_img, localized, bc, show):
#open the template as gray scale image
template = localized
#print(template.shape)
width, height = template.shape[::-1] #get the width and height
#match the template using cv2.matchTemplate
match = cv2.matchTemplate(gray_img, template, cv2.TM_CCOEFF_NORMED)
threshold = 0.8
position = np.where(match >= threshold) #get the location of template in the image
for point in zip(*position[::-1]): #draw the rectangle around the matched template
cv2.rectangle(main_image, point, (point[0] + width, point[1] + height), (255 - bc, 0, bc ), 2)
return main_image
view raw preprocess.py hosted with ❤ by GitHub

Localization is the concept of finding the exact image position and showing the border. We use the previous segments and pass them as template and using OpenCv’s template matching method we find the exact position where that segment matched. Of course the template will match 100% but I’ve set the threshold value to 0.8 here. Whenever a template matches, I’ve draw a rectangle around the matched portion of the original image. For this, I have used OpenCv’s rectangle drawing.

Localizing

Add Border

def detect_text(main_image, gray_img, localized, bc):
cimg = cv2.resize(localized, (30, 30))
bordersize = 1
nimg = cv2.copyMakeBorder(cimg, top=bordersize, bottom=bordersize, left=bordersize, right=bordersize, borderType=cv2.BORDER_CONSTANT, value=[255-bc, 0, 0])
return main_image, nimg
view raw preprocess.py hosted with ❤ by GitHub

Now our image must be converted to 32 by 32 size because our training data is also 32 by 32. But resizing the segments to that shape will cause our prediction to fail mostly. The reason is, our train image has a 2 pixels margin around it. And we need to do so here also.

Prediction With Trained Model

import numpy as np
from keras.models import model_from_json
from keras.models import load_model
def prediction(img):
# load json and create model
json_file = open('cnn2\cnn2.json', 'r')
loaded_model_json = json_file.read()
json_file.close()
loaded_model = model_from_json(loaded_model_json)
# load weights into new model
loaded_model.load_weights("cnn2\cnn2.h5")
#print("Loaded model from disk")
loaded_model.save('cnn.hdf5')
loaded_model=load_model('cnn.hdf5')
characters = '०,१,२,३,४,५,६,७,८,९,क,ख,ग,घ,ङ,च,छ,ज,झ,ञ,ट,ठ,ड,ढ,ण,त,थ,द,ध,न,प,फ,ब,भ,म,य,र,ल,व,श,ष,स,ह,क्ष,त्र,ज्ञ'
characters = characters.split(',')
x = np.asarray(img, dtype = np.float32).reshape(1, 32, 32, 1) / 255
output = loaded_model.predict(x)
output = output.reshape(46)
predicted = np.argmax(output)
devanagari_label = characters[predicted]
success = output[predicted] * 100
return devanagari_label, success
view raw predictor.py hosted with ❤ by GitHub

For each preprocessed segment passed to this function will return the accuracy and label of prediction. We will use these in the recognition method.

Recognition Of Segments

from preprocess import preprocess, detect_text, localize
from predictor import prediction
import numpy as np
import matplotlib.pyplot as plt
import cv2
def recognition(gray_image, show):
segments, template, th_img, text_color = preprocess(gray_image)
labels = []
accuracy = []
show_img = gray_image[:]
#print(len(segments))
for segment in segments:
#plt.imshow(segment)
#plt.show()
recimg, bimg = detect_text(show_img, th_img, segment, text_color)
#print('Process: Recognition....\n')
label, sure = prediction(bimg)
if(sure > 80):
#print(segment)
labels.append(str(label))
accuracy.append(sure)
show_img = localize(show_img, th_img, segment, text_color, show)
char = labels
accuracy = np.average(accuracy)
char = ''.join(char)
if accuracy < 80:
recimg, bimg = detect_text(show_img, th_img, template, text_color)
show_img = localize(show_img, th_img, template, text_color, show)
char, accuracy = prediction(bimg)
if (show == 'show'):
plt.imshow(show_img)
plt.title('Detecting')
plt.xticks([])
plt.yticks([])
plt.show()
else:
cv2.imshow('Detecting..', cv2.cvtColor(show_img, cv2.COLOR_GRAY2BGR))
print('The prediction accuracy for ', char,' is ',"%.2f" % round(accuracy,2), '%')
#plt.imshow(cv2.cvtColor(show_img, cv2.COLOR_GRAY2RGB))
#plt.show()
view raw recognition.py hosted with ❤ by GitHub

No matter how hard we code there will always be false positive predictions. But we can try to reduce them. The problem of false positives can happen when the image quality is low and the entire text is taken as a single character. In that case, we take that as true only if the prediction is more than 80%. But in the final code to prevent localization of false segments, I have done localization after finding if the segment is true. If the prediction is less than 80%, the entire text will be treated as a single character and done prediction.

Recoginiton and Localization of Word

Camera For Realtime

import cv2
from recognition import recognition
import numpy as np
import time
import matplotlib.pyplot as plt
def camera(flag):
# choice = print("Click spacebar for photo and anything else for video.\n")
orig = 1
cap = cv2.VideoCapture(0)
tr = 0.1
br = 0.8
lc = 0.1
rc = 0.8
f = 0
while(flag):
ret, frame = cap.read()
if ret:
#key event
s = cv2.waitKey(2) & 0xFF
if(chr(s) == 'x'):
f = -1
if(chr(s) == 'z'):
f = 1
if(chr(s) == 'a'):
tr = tr + 0.1 * f
if(chr(s) == 'd'):
br = br + 0.1 * f
if (chr(s) == 's'):
lc = lc + 0.1 * f
if (chr(s) == 'w'):
rc = rc + 0.1 * f
s_x, s_y = np.shape(frame)[0] * tr, np.shape(frame)[1] * lc
e_x, e_y = np.shape(frame)[1] * br, np.shape(frame)[0] * rc
s_x, s_y = np.int32(s_x), np.int32(s_y)
e_x, e_y = np.int32(e_x), np.int32(e_y)
gray = cv2.cvtColor(frame, cv2.COLOR_BGR2GRAY)
ogray = gray[:]
gray = gray[s_y:e_y, s_x:e_x]
#original = frame[s_y:e_y, s_x:e_x]
if (s == 32): #space to capture image and do recognition
time1 = time.time()
plt.imshow(frame)
plt.show()
recognition(gray, 'show')
print("In %f" %(time.time()-time1), 'sec')
if (s == 13): #enter to do realtime recognition
orig = 0
cv2.destroyWindow('Project DCR')
print("Doing RT...")
recognition(ogray, 'no')
else:
if(orig != 0):
show = frame[:]
text = "Press 'space' to take a photo and 'enter' to do realtime(slow)."
text1 = "Make sure the character is inside rectangle."
text2 = "Press a/s/d/w for change size of rectangle and z/x to increase/decrease."
cv2.putText(show, text1, (15, 50), cv2.FONT_HERSHEY_COMPLEX, 0.75, (0, 100, 200))
cv2.putText(show, text2, (15, 70), cv2.FONT_HERSHEY_COMPLEX, 0.5, (50, 20, 255))
cv2.rectangle(show, (s_x, s_y), (e_x, e_y), (0, 255, 0), 2)
cv2.putText(show, text, (15, 30), cv2.FONT_HERSHEY_SIMPLEX, 0.6, (15, 0, 255), lineType=cv2.LINE_AA)
cv2.imshow('Project DCR', show)
else:
print('Trying.....\n')
continue
if s == 27:
break
cap.release()
cv2.destroyAllWindows()
view raw camera.py hosted with ❤ by GitHub

OCR needs the camera to work. So I used OpenCV's camera methods for doing real-time image capture. Here on the above code, I wrote plenty of codes to do some interesting things. The camera will show a rectangular box and we can actually manipulate its shape also. The portion of the image lying inside the box will be sent to the recognition process. Here I used some keys like spacebar for capturing images, enter key for real-time video, etc.

Combining It all

from recognition import recognition
import cv2
import matplotlib.pyplot as plt
from video_test import camera
import time
try:
test = input('Please enter the image directory with name.\n')
test = cv2.imread(test, 0)
plt.imshow(cv2.cvtColor(test, cv2.COLOR_GRAY2RGB))
plt.xticks([])
plt.yticks([])
plt.show()
time1 = time.time()
in_img = recognition(test, 'show')
print("In %f" %(time.time()-time1), 'sec')
except:
print("Image not found now turning to video mode.\n")
try:
camera(True)
except:
print('Something is wrong. Try with more stable, less noise and clear picture.\n')
view raw main.py hosted with ❤ by GitHub

Now is the time for integration of all the modules. Users can pass the image location on local storage and if the image doesn’t exist, the program runs the camera mode.

Overall System Process

I have tried to do Android App development by using TensorflowLite also but it is still paused. And I am planning to write a code for a web app.

Thank you so much for reading this article. And please follow to this Github Link for the entire project and the documentation.

Find me on Twitter.

Find me on LinkedIn.

Find me on Youtube.

Comments (undefined)

Post comment
Loading...

Comments