Skip to content

Latest commit

 

History

History
245 lines (189 loc) · 18.4 KB

README.md

File metadata and controls

245 lines (189 loc) · 18.4 KB

Revisiting Scene Text Recognition: A Data Perspective

Union14M is a large scene text recognition (STR) dataset collected from 17 publicly available datasets, which contains 4M of labeled data (Union14M-L) and 10M of unlabeled data (Union14M-U), intended to provide a more profound analysis for the STR community

arXiv preprint Gradio demo Open In Colab

Introduction Download MAERec

What's New

1. Introduction

  • Scene Text Recognition (STR) is a fundamental task in computer vision, which aims to recognize the text in natural images. STR has been developed rapidly in recent years, and recent state-of-the-arts have shown a trend of accuracy saturation on six commonly used benchmarks (IC13, IC15, SVT, IIIT5K, SVTP, CUTE80). This is a promising result, but it also raises a question: Are we done with STR? Or it's just the lack of challenges in current benchmarks that cover the drawbacks of existing methods in read-world scenarios.
  • To explore the challenges that STR models still face, we consolidate a large-scale STR dataset for analysis and identified seven open challenges. Furthermore, we propose a challenge-driven benchmark to facilitate the future development of STR. Additionally, we reveal that the utilization of massive unlabeled data through self-supervised pre-training can remarkably enhance the performance of the STR model in real-world scenarios, suggesting a practical solution for STR from a data perspective. We hope this work can spark future research beyond the realm of existing data paradigms.

2. Contents

3. Union14M Dataset

3.1. Union14M-L

  • Union14M-L contains 4M images collected from 14 public available datasets. See Source Datasets for the details of the 14 datasets. We adopt serval strategies to refine the naive concatation of the 14 datasaets, including:
    • Cropping: We use minimal axis-aligned bounding box to crop the images.
    • De-duplicate: Some datasets contains duplicate images, we remove them.
  • We also categorize the images in Union14M-L into five difficulty levels using an error voting method.

3.2. Union14M-U

  • The optimal solution to improve the performance of STR in real-world scenarios is to utilize more data for training. However, labeling text images is both costly and time-intensive, given that it involves annotating sequences and needs specialized language expertise. Therefore, it would be desirable to investigate the potential of utilizing unlabeled data via self-supervised learning for STR. To this end we collect 10M unlabeled images from 3 large datasets with an IoU Voting method

3.3. Union14M-Benchmark

  • We raise seven open challenges for STR in real-world scenarios, and propose a challenge-driven benchmark to facilitate the future development.

3.4. Download

Datasets One Drive Baidu Yun
Union14M-L & Union14M-Benchmark (12GB) One Drive Baidu Yun
Union14M-U (36.63GB) One Drive Baidu Yun
6 Common Benchmarks (17.6MB) One Drive Baidu Yun
  • The Structure of Union14M will be organized as follows:

    Structure of Union14M-L & Union14M-Benchmark
    |--Union14M-L
      |--full_images
        |--art_curve # Images collected from the 14 datasets
        |--art_scene
        |--COCOTextV2
        |--...
      |--train_annos
        |--mmocr-0.x # annotation in mmocr0.x format
          |--train_challenging.jsonl # challenging subset
          |--train_easy.jsonl # easy subset
          |--train_hard.jsonl # hard subset
          |--train_medium.jsonl # medium subset
          |--train_normal.jsonl # normal subset
          |--val_annos.jsonl # validation subset
        |--mmocr1.0.x # annotation in mmocr1.0 format
          |--...
      |--Union14M-Benchmarks
        |--artistic
          |--imgs
          |--annotation.json # annotation in mmocr1.0 format
          |--annotation.jsonl # annotation in mmocr0.x format
        |--...
    
    Structure of Union14M-U

    We store images in LMDB format, and the structure of Union14M-U will be organized as belows.

    |--Union14M-U
      |--book32_lmdb
      |--cc_lmdb
      |--openvino_lmdb
    

4. STR Models trained on Union14M-L

  • We train serval STR models on Union14M-L using MMOCR-1.0

4.1. Checkpoints

  • Evaluated on both common benchmarks and Union14M-Benchmark. Accuracy (WAICS) in $\color{grey}{grey}$ are original implementation (Trained on synthtic datasest), and accuracay in $\color{green}{green}$ are trained on Union14M-L. All the re-trained models are trained to predict upper & lower text, symbols and space.

    Models Checkpoint IIIT5K SVT IC13-1015 IC15-2077 SVTP CUTE80 Avg.
    ASTER GoogleDrive / BaiduYun / OneDrive $\color{grey}{93.57}$ \ $\color{green}{94.37}$ $\color{grey}{89.49}$ \ $\color{green}{89.03}$ $\color{grey}{92.81}$ \ $\color{green}{93.60}$ $\color{grey}{76.65}$ \ $\color{green}{78.57}$ $\color{grey}{80.62}$ \ $\color{green}{80.93}$ $\color{grey}{85.07}$ \ $\color{green}{90.97}$ $\color{grey}{86.37}$ \ $\color{green}{88.07}$
    ABINet GoogleDrive / BaiduYun / OneDrive $\color{grey}{95.23}$ \ $\color{green}{97.30}$ $\color{grey}{90.57}$ \ $\color{green}{96.45}$ $\color{grey}{93.69}$ \ $\color{green}{95.52}$ $\color{grey}{78.86}$ \ $\color{green}{85.36}$ $\color{grey}{84.03}$ \ $\color{green}{89.77}$ $\color{grey}{84.37}$ \ $\color{green}{94.79}$ $\color{grey}{87.79}$ \ $\color{green}{93.20}$
    NRTR Google Drive / BaiduYun / OneDrive $\color{grey}{91.50}$ \ $\color{green}{96.73}$ $\color{grey}{88.25}$ \ $\color{green}{93.20}$ $\color{grey}{93.69}$ \ $\color{green}{95.57}$ $\color{grey}{72.32}$ \ $\color{green}{80.74}$ $\color{grey}{77.83}$ \ $\color{green}{83.57}$ $\color{grey}{75.00}$ \ $\color{green}{92.01}$ $\color{grey}{83.09}$ \ $\color{green}{90.30}$
    SATRN Google Drive / BaiduYun / OneDrive $\color{grey}{96.00}$ \ $\color{green}{97.27}$ $\color{grey}{91.96}$ \ $\color{green}{95.36}$ $\color{grey}{96.06}$ \ $\color{green}{96.85}$ $\color{grey}{80.31}$ \ $\color{green}{87.14}$ $\color{grey}{88.37}$ \ $\color{green}{90.39}$ $\color{grey}{89.93}$ \ $\color{green}{96.18}$ $\color{grey}{90.43}$ \ $\color{green}{93.89}$
    SAR Google Drive / BaiduYun / OneDrive $\color{grey}{95.33}$ \ $\color{green}{97.07}$ $\color{grey}{88.41}$ \ $\color{green}{93.66}$ $\color{grey}{93.69}$ \ $\color{green}{95.76}$ $\color{grey}{76.02}$ \ $\color{green}{82.19}$ $\color{grey}{83.26}$ \ $\color{green}{86.98}$ $\color{grey}{90.28}$ \ $\color{green}{92.01}$ $\color{grey}{87.83}$ \ $\color{green}{91.27}$

5. MAERec

  • MAERec is a scene text recognition model composed of a ViT backbone and a Transformer decoder in auto-regressive style. It shows an outstanding performance in scene text recognition, especially when pre-trained on the Union14M-U through MAE.

  • Results of MAERec on six common benchmarks and Union14M-Benchmarks

  • Predictions of MAERec on some challenging examples

5.1. Pre-training

5.2. Fine-tuning

5.3. Evaluation

  • If you want to evaluate MAERec on benchmarks, check evaluation

5.4. Inferencing

  • If you want to inferencing MAERec on your raw pictures, check inferencing

5.5. Demo

  • We also provide a Gradio APP for MAERec, which can be used to inferencing on your own pictures. You can run it locally or play with it on 🤗HuggingFace Spaces.
  • To run it locally, you can run the following command:
      1. Install gradio and download the pretrained weights
      pip install gradio
      wget https://download.openmmlab.com/mmocr/textdet/dbnetpp/dbnetpp_resnet50-oclip_fpnc_1200e_icdar2015/dbnetpp_resnet50-oclip_fpnc_1200e_icdar2015_20221101_124139-4ecb39ac.pth -O dbnetpp.pth
      wget https://github.com/Mountchicken/Union14M/releases/download/Checkpoint/maerec_b_union14m.pth -O maerec_b.pth
      1. Run the gradio app
      python tools/gradio_app.py \
        --rec_config mmocr-dev-1.x/configs/textrecog/maerec/maerec_b_union14m.py \
        --rec_weight ${PATH_TO_MAEREC_B} \
        --det_config mmocr-dev-1.x/configs/textdet/dbnetpp/dbnetpp_resnet50-oclip_fpnc_1200e_icdar2015.py \
        --det_weight ${PATH_TO_DBNETPP} \

6. License

7. Acknowledgment

  • We sincerely thank all the constructors of the 17 datasets used in Union14M, and also the developers of MMOCR.

8. Citation

@inproceedings{jiang2023revisiting,
      title={Revisiting Scene Text Recognition: A Data Perspective}, 
      author={Qing Jiang and Jiapeng Wang and Dezhi Peng and Chongyu Liu and Lianwen Jin}
      booktitle={Proceedings of the IEEE/CVF international conference on computer vision},
      year={2023},
}