''The 2nd Shandong Province Data Application Innovation and Entrepreneurship Competition-Main Arena-Inspection Report Recognition''

Preliminary && Baseline

Competition analysis

The title only gives the data set for local testing, not the data used for training, and encourages the use of open data sets. Therefore, this repo uses open source models and parameters to reason and complete the task of the competition. I divided the question into two parts:

Text recognition (position + content)
Extraction of effective information (information filtering and combination)

Solution

First use the public model and weights and models to detect the text position and text content recognition.

Here I provide the URLs of several public OCR tasks for reference:

[EasyOCR] (https://github.com/JaidedAI/EasyOCR)

( Ready-to-use OCR with 80+ supported languages and all popular writing scripts including Latin, Chinese, Arabic, Devanagari, Cyrillic and etc.)

[tensorflow] (https://github.com/xiaofengShi/CHINESE-OCR)

( Use tf to achieve natural scene text detection, keras/pytorch to achieve ctpn+crnn+ctc to achieve variable length scene text OCR recognition)

[chineseocr] (https://github.com/chineseocr/chineseocr)

( This project is based on yolo3 and crnn to realize Chinese natural scene text detection and recognition)

[PaddleOCR] (https://github.com/PaddlePaddle/PaddleOCR)

( Awesome multilingual OCR toolkits based on PaddlePaddle （practical ultra lightweight OCR system, provide data annotation and synthesis tools, support training and deployment among server, mobile, embedded and IoT devices)

[chineseocr_lite] (https://github.com/ouyanghuiyu/chineseocr_lite)
[PytorchOCR] (https://github.com/WenmuZhou/PytorchOCR)

This repo uses PaddleOCR as the detection and recognition part

Filtering and combining；Regarding the extraction of effective information, everyone can take a variety of different methods and play freely. Here I offer two simple ideas.
1. Find the dividing line, and then use the up and down translation method to extract effective information:
  
  You can detection by open-cv:
```
 cv2.Canny
```
2. Find the heading keyword directly first, and extract the information according to the coordinate position of the keyword.

This repo temporarily provides the second method as a baseline.

Usage: ( Paddle Pipeline)

Configuration

Ubuntu 18.04 Cuda 10.1 cudnn 7.6.5+ Python 3.6.12

Clone this repo:

git clone https://github.com/Complicateddd/PaddlePL.git

Install library (GPU):

cd PaddlePL/
pip install -r requirements.txt

Not support gpu you can Install library (CPU):

cd PaddlePL/
pip install -r requirements_cpu.txt

Download Test Data

Then put it to path ''./data/img''

[ TestData] (http://data.sd.gov.cn/cmpt/competion/shandong.html)

Inference Demo: (GPU)

python run.py ./data/img/ submit.csv

(CPU version) set ./hyper_config.py self.use_gpu=False

./hyper_config.py     -->    self.use_gpu=False

python run.py ./data/img/ submit.csv

Eval (ROUGE-L)
```
python rouge.py
```

Result && Evaluation

	Image Number	ROUGE-L
Offline	100	~0.78
Online	unknown	~0.74

Note:

Here I just provide a simple process, more need you to explore , welcom to communicate with me ! If this repo 'PaddlePL' is helpful for you, star or fork will be my motivation.

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
PPOCRLabel		PPOCRLabel
StyleText		StyleText
__pycache__		__pycache__
configs		configs
deploy		deploy
doc		doc
img		img
ppocr		ppocr
tools		tools
weights		weights
.~lock.submit.csv#		.~lock.submit.csv#
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
README.md		README.md
__init__.py		__init__.py
help_filter.py		help_filter.py
hyper_config.py		hyper_config.py
paddleocr.py		paddleocr.py
predict_system.py		predict_system.py
requirements.txt		requirements.txt
requirements_cpu.txt		requirements_cpu.txt
rouge.py		rouge.py
run.py		run.py
setup.py		setup.py
submit.csv		submit.csv
test_pandas.py		test_pandas.py
train.sh		train.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

''The 2nd Shandong Province Data Application Innovation and Entrepreneurship Competition-Main Arena-Inspection Report Recognition''

Preliminary && Baseline

Competition analysis

Solution

Usage: ( Paddle Pipeline)

Result && Evaluation

About

Releases

Packages

Languages

License

Complicateddd/PaddlePL

Folders and files

Latest commit

History

Repository files navigation

''The 2nd Shandong Province Data Application Innovation and Entrepreneurship Competition-Main Arena-Inspection Report Recognition''

Preliminary && Baseline

Competition analysis

Solution

Usage: ( Paddle Pipeline)

Result && Evaluation

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages