調和技研 技術ブログ

調和技研で取り組んでいる技術を公開していきます

Robust Text Detection and Recognition Engine in Scene Image

f:id:chowagiken_uzzal:20190424175803p:plain I'm Uzzal Podder, an machine learning engineer in Chowagiken. In this post, I want to talk about text Detection and Recognition from Scene Image.

Written Text is a remarkable innovation in human history. It can transfer knowledge from country to country or even generation to generation. In this modern era, computer has revolutionized almost every aspect of life. Although computer is very powerful for processing digital formatted text, even for Optical character, still it is not smart enough to read text preciously from scene text image.

Scene text image is found everywhere. As it contains rich semantic information, many computer vision tasks, including self-driving car and smart phone applications, need to parse this text from natural scene image to understand the corresponding environment.

In this blog, we will build an engine that would be able tp isolate text region and also to recognize text from scene text image. This can be considered as a very robust system because you can apply it for any language, even for any character like shape. Here we will use hand watch image as Scene text image and will try to recognize the text from watch surface.

What are the Challenges in Scene Text Detection and Recognition?

f:id:chowagiken_uzzal:20190424175148p:plain f:id:chowagiken_uzzal:20190515182304p:plain f:id:chowagiken_uzzal:20190515182203p:plain

To read text from document scan image, Optical Character Recognition(OCR) has been using in industry for several years with high accuracy. But OCR fails if the image is a scene image. This is because there are several challenges in Scene Text Recognition(STR).

  • Texts are scattered in the scene image. Unlike document scan image, number of text lines or line spacing in scene image is unpredictable.
  • Scene text may contain specially designed characters or decoration, also may have a high degree variation in angle, color and font size.

How Does This Engine Work?

https://www.research.ibm.com/haifa/projects/imt/video/images/scheme.png

There are two main parts.

  1. Take a scene image, and isolate the candidate text objects from other objects in the image using a neural network.
  2. Pass those Isolated candidate text objects to another deep neural network which will recognize the character(s).

Implementation

At first part, we need a text detection model. For this we will use EAST deep neural net architecture.

At second part, we need a text recognition model. For this, we will use CRNN architecture which is actually a combination of Convolutional neural network and Recurrent neural network.

Text Detection with EAST:

EAST takes a large scene image(768 pixel or larger) and outputs dense per-pixel predictions of words or text lines. Unlike classical text detection models, EAST can isolate text region without involving sub task like candidate proposal, text region formation and word partition. One forward pass is enough to detect text. The only post processing task is Non Maximum Suppression (NMS). For this reason the terminology EAST stands for Efficient and Accuracy Scene Text detection.

According to EAST paper The model can be decomposed into three parts: feature extractor stem, feature-merging branch and output layer.

f:id:chowagiken_uzzal:20190425145550p:plain

The feature extractor stem is a convolutional network (generally pre-trained on ImageNet dataset) with interleaving convolution and pooling layers.

At feature-merging branch , in each merging stage, the feature map from the last stage is first fed to an unpooling layer to double its size, and then concatenated with the current feature map. The last merging stage produces the final feature map of the merging branch and feed it to the output layer.

The final output layer contains several conv1×1 operations to project 32 channels of feature maps into 1 channel of score map and a multi-channel geometry map. f:id:chowagiken_uzzal:20190425155055p:plain

The geometry map output can be either one of RBOX or QUAD. In RBOX the geometry is represented by 4 channels of axis-aligned bounding box (AABB) R and 1 channel rotation angle θ. In QUAD Q, 8 numbers are used to denote the coordinate shift from four corner vertices of the quadrangle to the pixel location.

How to Train EAST?

We will use tensorflow implementation of EAST.

We will use docker container because we can setup the CUDA enabled environment and other library dependency very easily in docker. Also Tensorflow provide very great support for docker image.

First let's pull a fresh tensorflow docker image

docker pull tensorflow/tensorflow:latest-gpu-py3-jupyter

Now run the image with nvidia runtime. For using Tensorboard and Jupyter Notebook we will map 8888 and 6006 port with unused ports in host. Jupyter Notebook_ is not necessary here. It may be useful if you want to modify some code in you container. We will mount our host directory inside docker so that we can access those storage without copying into docker container.

docker run --runtime=nvidia -it \
  -v /share/data/:/home  -w /home   \ 
  -p 8998:8888  -p 6007:6006  \
  tensorflow/tensorflow:latest-gpu-py3-jupyter bash

Now clone the EAST repository. It's better to clone in mounted folder because you can use external editor to edit the source code easily. Also you need not to commit your image every time when you modify your code.

cd /home
https://github.com/argman/EAST.git
cd EAST


Dataset pre-process for EAST

The dataset structure is very simple for EAST. In a single folder you have to put your images and for each image you need a text file containing the text box coordinate information. Please keep in mind, for each training example the file name must be same both for image and text file (except file extension).

f:id:chowagiken_uzzal:20190425182251p:plain

inside a text file it may looks like this. f:id:chowagiken_uzzal:20190425183809p:plain

If you want to train your custom model, obviously you need to annotate text box boundary like above. So you have to use any annotation tools. We use LabelImg. You can install it from https://github.com/tzutalin/labelImg

But the output of LabelImg annotation is .xml file which is not suitable for EAST. So we need to covert it to .txt file. f:id:chowagiken_uzzal:20190425185559p:plain You may do it by your own, but I have created an script to do this https://gitlab.com/chowagiken/dataset_preprocessing_scripts/blob/master/EAST/main_east_preprocess.py

labelimg_annotations_dir = '/home/data/watch/'
dataset_target_dir = '/home/new_data/east_dataset/'

image_dic, xml_label_dic = get_labelimg_annotations(labelimg_annotations_dir)

east_labels_dic = convert_xml_to_east_txt(xml_label_dic)

copy_to_target(image_dic, east_labels_dic, dataset_target_dir)

Training EAST

At this moment our dataset is ready. Now we can start training. But it is obviously better to use a pre-trained weight. For convolutional part, we will initialize it's weight from http://download.tensorflow.org/models/resnet_v1_50_2016_08_28.tar.gz. Download and unzip it.

Also we will use a checkpoint that is already trained on ICDAR 2013 + ICDAR 2015 dataset. Download and unzip it east_icdar2015_resnet_v1_50_rbox.zip - Google ドライブ

Assuming you are in EAST root directory (/home/EAST/)

python multigpu_train.py --gpu_list=0 --input_size=512 --batch_size_per_gpu=14  \
  --text_scale=512 --  geometry=RBOX --learning_rate=0.0001 --num_readers=24 \
  --checkpoint_path=/home/data/east_icdar2015_resnet_v1_50_rbox/ \
  --training_data_path=/home/new_data/east_dataset/  \
  --pretrained_model_path=/home/data/resnet_v1_50.ckpt

Training should start smoothly. You can training log in tensorboard. Open another terminal and go to the container. let's assume that the container ID is 8ba3fa2a2ddb

docker exec -it 8ba3fa2a2ddb bash
cd /home/data/east_icdar2015_resnet_v1_50_rbox/
tensorboard --logdir=.

As we have mapped 6006 port with 6007 host port, use http://localhost:6007

f:id:chowagiken_uzzal:20190507184450p:plain

Evaluating EAST

After training, our model will be saved in /home/data/east_icdar2015_resnet_v1_50_rbox/. Let's test it

from docker container terminal

python eval.py \
    --test_data_path=/home/test_data/image/ \
    --gpu_list=0 \
    --checkpoint_path=/home/data/east_icdar2015_resnet_v1_50_rbox/ \
    --output_dir=/home/east_result/

f:id:chowagiken_uzzal:20190425193817p:plain


Text Recognition with CRNN:

Text Recognition from scene image has been a long-standing research topic in computer vision. Here we will use CRNN which is good for image-based sequence recognition. In CRNN architecture there is no need for character segmentation or horizontal scale normalization. That is why it can naturally handle sequences in arbitrary lengths.

The network architecture of CRNN consists of three components. a convolutional layers, a recurrent layers, and a transcription layer from bottom to top.

The convolutional layers automatically extract a feature sequence from each input image. On top of the convolutional network, a recurrent network is built for making prediction for each frame of the feature sequence, outputted by the convolutional layers. The transcription layer at the top of CRNN is adopted to translate the per-frame predictions by the recurrent layers into a label sequence

f:id:chowagiken_uzzal:20190507131222p:plain

How to Train CRNN?

Like EAST, we will use docker to simplify the environment setup. Official implementations of CRNN has a docker file. But we found that this docker image takes too much time to load cudnn (require('cudnn')). (issue: require 'cutorch' takes long time · Issue #475 · torch/cutorch · GitHub.)

So we will use a different docker image which has resolved that slow loading bug. Inside rremani/cuda_crnn_torch docker image everything is ready along with official CRNN code. (Obviously you can build your docker image from original repository https://github.com/bgshih/crnn, if you want)

First pull the docker from https://hub.docker.com/r/rremani/cuda_crnn_torch

docker pull rremani/cuda_crnn_torch

now run the docker

docker run --runtime=nvidia -e NVIDIA_VISIBLE_DEVICES=0 -it \
  -v /home/uzzal/data:/tmp \
  -w /opt/crnn/src \
  rremani/cuda_crnn_torch /bin/bash

here -e NVIDIA_VISIBLE_DEVICES=0 flag gives access to the docker only for first GPU (if you don't have multiple GPU, you can skip this flag.)

Dataset pre-process for CRNN

Official CRNN implementation expects a LMDB database. Generally LMDB is the database of choice when using Caffe with large datasets. LMDB uses memory-mapped files, giving much better I/O performance. (If lmdb python library is not installed, please install pip install lmdb)

The repository provides a script tool/create_dataset.py to convert your images and corresponding target labels into LMDB database. Please use python 2.7 during running this script because there are some Unicode encoding issues for python 3.6. Also use ASCII charterers for your label. The source code in this repository only accepts alpha-numeric characters(a-z and 0-9). If you want to add extra character, you have to modify the config file (example: model/crnn_demo/config.lua) and str2label function in src/utilities.lua.

For creating dataset you have to pass two list to createDataset function in tool/create_dataset.py. The first list will contain the image file path, and second list will contain the corresponding labels.

/home/uzzal/crnn_data/img1.jpg  ---> apple
/home/uzzal/crnn_data/img2.jpg  ---> mango
/home/uzzal/crnn_data/img3.jpg  ---> apple

It is better to remove non alpha-numeric digit from label. In my case after scanning image files with corresponding label and encoding checking pre-process, I saved those inside a temporary numpy image_label_list.npy file. Then I load that file in a script and call createDataset function twice, one for train data, other for test data. Here tst_lim is the number of test dataset number which will be used for splitting the test and train.

img_lbl_data = np.load('/home/uzzal/image_label_list.npy').astype('str')
np.random.shuffle(img_lbl_data)

tst_lim = 1024
target_dir = '../out/lmdb_dataset/'

image_path_list = img_lbl_data[ : , 1].tolist()
label_list = img_lbl_data[ : , 0].tolist()

createDataset(os.path.join(target_dir, 'train'),
             image_path_list[tst_lim:-1],
             label_list[tst_lim: -1])

createDataset(os.path.join(target_dir, 'val'),
             image_path_list[0:tst_lim],
             label_list[0:tst_lim])

This will generate our dataset

f:id:chowagiken_uzzal:20190507160326p:plain

Create model directory under model/. For example, model/foo_model. Then create configuration file config.lua under the model directory. You can copy model/crnn_demo/config.lua and do modifications at 14-15 line. Use your own database location here.

f:id:chowagiken_uzzal:20190507165816p:plain

Now go to src/

and execute th main_train.lua ../model/foo_model/. Your training should start.

Model snapshots and logging file will be saved into the model directory.

Evaluating CRNN

There is a script in src/demo.lua. update config file path and trained model file path. Also change your test image path.

f:id:chowagiken_uzzal:20190507180705p:plain f:id:chowagiken_uzzal:20190507180816p:plain

Now test it using th demo.lua.

CRNN will try to detect text based on pre-trained model

f:id:chowagiken_uzzal:20190507180328p:plain

Combining EAST and CRNN

We have successfully trained and run both EAST and CRNN for text detection and text recognition respectively . So far so good. It's time to integrate them so that EAST and CRNN can work sequentially without intermediate manual instructions. Actually we want a pipeline where user give an image as input and will receive the detected text on that image as output. We need to modify some code both in EAST and CRNN. Lets's separate our task in two parts: EAST node and CRNN node f:id:chowagiken_uzzal:20190513171307p:plain

Modification for EAST node.

We need a http server which will receive an image as HTTP POST request. That server call the EAST detection task. For make it more useful we will cache the deep neural network weight at the very first request. After that every request will use the cached weight. This will make the pipeline faster.

Create east_weight directory in EAST root folder and copy your pretrained weight here.

Create a east_server.py from https://gitlab.com/snippets/1856321/. We use the run_demo_server.py and customized according our need. This is actually a flask server which will run the EAST and pass isolated image to CRNN and receive the result from CRNN.

The call_crnn function crop text region and save it to a shared location. Then it calls the make_req function and pass the saved location and number of images. Actually instead sending all cropped images generated by EAST, we will just share the shared file location. This is not mandatory. We did it just because of simplicity. You can use HTTP post if you want.

def call_crnn(text_lines_dict, img, save_dir):
    img_dir = os.path.join('/home/PROJECTS/crnn/tmp_crnn/', save_dir)

    # cropping and saving text regions
    for counter, cords in enumerate(text_lines_dict):
        txt_img = img[int(cords['y0']):int(cords['y2']), int(cords['x0']): int(cords['x2'])]

        os.makedirs(img_dir, exist_ok=True)
        cv2.imwrite(os.path.join( img_dir, str(counter)+'.png'), txt_img)

    img_num = str(len(text_lines_dict)-1)
    out_for_crnn = '/home/PROJECTS/crnn/tmp_crnn/' + save_dir + '/'
    out = make_req(out_for_crnn, img_num)

    return out.split('\n')


make_req function create a HTTP request to CRNN node with parameters shared location and number of image and waits for response. You may need to install requests python library to make request in CRNN node. (use pip install requests).

def make_req(imgdir, imgnum ):
    url = 'http://crnn_node:5000/' # Set destination URL here
    payload = {"imgdir": imgdir, "imgnum": imgnum,}
    header = {"Content-type": "application/json",}
    response_decoded_json = requests.post(url, data=json.dumps(payload), headers=header)
    res = response_decoded_json.text
    return res

You have noticed that we use http post request just to send two string. Using http get request may be more suitable here. But use http post because you may send cropped images if you want.

For this project we used very simple html template Index templet ($1856358) · Snippets · GitLab. Please put it EAST/templates/index.html


Now save this docker state

docker commit -m container_id east_node:v1


Modification for CRNN node.

We will use a very simple flask server to receive request from EAST. This server will receive the shared file location and number of image. Then it will prepare a command line arguments and call the CRNN.

Install flask and create a very simple request handler crnn_server.py

from flask import Flask, abort, request
import json
import subprocess

app = Flask(__name__)


def call_subprocess(out_for_crnn, img_num ):

    docker_cmd = ['th', 'multi_demo.lua',
                  '-imgPath', out_for_crnn,
                  '-imgNumber', img_num]

    print('Requested command: ', docker_cmd)

    process = subprocess.Popen(docker_cmd, stdout=subprocess.PIPE)
    out, err = process.communicate()
    out = out.decode("utf-8")
    out = out.split('\n')

    return out


@app.route('/', methods=['POST'])
def foo():

    if not request.json:
        abort(400)
    print ('reuest ', request.json )
    imgdir = request.json['imgdir']
    imgnum = request.json['imgnum']
    return json.dumps(call_subprocess(imgdir, imgnum))


if __name__ == '__main__':
    app.run(host='0.0.0.0', port=5000, debug=True)


At this moment CRNN can receive http request. But we also need to modify the crnn/src/demo.lua code, cause we have to handle multiple image based on shared file location and number of images.

Create a new crnn/src/multi_demo.lua from crnn/src/demo.lua with these following modifications.

cmd = torch.CmdLine()
cmd:option('-imgPath','../data/','image full path')
cmd:option('-imgNumber','0','Number of images')
cmd:option('-modelDir','../model/crnn_demo/','crnn model directory')
cmd:option('-modelName','crnn_demo_model.t7','pre-trained model name')
config = cmd:parse(arg)

local imgPath = config.imgPath
local modelDir = config.modelDir
local imgNumber = config.imgNumber

paths.dofile(paths.concat(modelDir, 'config.lua'))
local modelLoadPath = paths.concat(modelDir, config.modelName)

gConfig = getConfig()
gConfig.modelDir = modelDir
gConfig.maxT = 0
local model, criterion = createModel(gConfig)
local snapshot = torch.load(modelLoadPath)
loadModelState(model, snapshot)
model:evaluate()

local imagePath = imgPath

loop_len = tonumber(imgNumber)
for i = 0, loop_len, 1
do
    imageFullPath = imagePath .. i .. '.png'
    local img = loadAndResizeImage(imageFullPath)
    local text, raw = recognizeImageLexiconFree(model, img)
    print(string.format('%s', text))
end


Now save this docker state

docker commit -m container_id crnn_node:v1


Connecting EAST with CRNN

There are many way of communication between two docker container. Here we will use link method. First we will run the CRNN docker with a specified name. Then we will run EAST and link with that specified name of CRNN.

docker run --runtime=nvidia -e NVIDIA_VISIBLE_DEVICES=0 -it \
  --name crnn_node \
  -v /share/personal/uzzal:/home  \
  -w /opt/crnn/src \
  -p 5013:5000 -p 8813:8888 \
  crnn_node:v1 /bin/bash  
docker run --runtime=nvidia -e NVIDIA_VISIBLE_DEVICES=1 -it \
  --name east_server  --link crnn_node \
  -v /share/personal/uzzal:/home  \
  -w /home \
  -p 5014:5001 -p 8814:8889 -p 6013:6006 \
  east_node:v1 bash  


In crnn_node run the flask server python crnn_server.py

In east_node run the flask server python east_server.py --port=5001


Now go broswer and enter host:5014. Upload a test image and you will get response with detected text on that image.

Use case

  • Scene image text detection
  • Custom shaped character detection
  • Brand name/Logo detection

Limitations

  • Dataset pre-process is expensive, specially for CRNN
  • EAST is little heavy network and relatively slower than some other newly developed model like TextBox++
  • CRNN takes long time to converge

Reference

[1704.03155v2] EAST: An Efficient and Accurate Scene Text Detector

[1801.02765] TextBoxes++: A Single-Shot Oriented Scene Text Detector

[1507.05717] An End-to-End Trainable Neural Network for Image-based Sequence Recognition and Its Application to Scene Text Recognition

IBM Research | Video AI Technologies