Skip to content

Commit

Permalink
DocXChain V0.95
Browse files Browse the repository at this point in the history
  • Loading branch information
yue kun committed Oct 20, 2023
1 parent 4eb14b9 commit 324d1c3
Show file tree
Hide file tree
Showing 15 changed files with 326 additions and 71 deletions.
26 changes: 26 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,26 @@
__pycache__
CIHP_PGN
*.pkl
*.bmp
*.pdf
train_log
train_logs
*.swp
net_*
*.tar
*best-epoch-*
*epoch-*
.DS_Store
.nfs*
.ipynb_checkpoints
*.model
*.gz
others
images
tmp*.py
*.dlc
*.ipynb*
*.pyc
__pycache__
.idea/
*.new
54 changes: 43 additions & 11 deletions Applications/DocXChain/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,46 +2,64 @@

## Introduction

<font color=#FFA500 size=10> ***"Making Every Document Literally Accessible to Machines"*** </font>
<font color=#FFA500 size=3> ***"Make Every Unstructured Document Literally Accessible to Machines"*** </font>

Documents have been playing a critically important role in the daily work, study and life of people around the world. Billions, if not trillions, of documents in different forms are created, viewed, processed, transmited and stored everyday, eitherwws physically or digitally. However, not all documents in the digital world can be directly accessed by machines (including computers and other automatic equipments), as only a minor portion of the documents can be successfully parsed with low-level procedures. For instance, the [Adobe Extract APIs](https://developer.adobe.com/document-services/docs/overview/pdf-extract-api/) are able to directly convert the metadata of born-digital PDF files into HTML-like trees, but will completely fail when handling PDFs generated from photographs produced by scanners or images captured by cameras. Therefore, if one would like to make documents that are not born-digital accessible to machines, a powerful toolset for extracting elements from documents is of the essence.
Documents are ubiquitous, since they are excellent carriers for recording and spreading information across space and time. Documents have been playing a critically important role in the daily work, study and life of people all over the world. Every day, billions of documents in different forms are created, viewed, processed, transmited and stored around the world, either physically or digitally. However, not all documents in the digital world can be directly accessed by machines (including computers and other automatic equipments), as only a portion of the documents can be successfully parsed with low-level procedures. For instance, the [Adobe Extract APIs](https://developer.adobe.com/document-services/docs/overview/pdf-extract-api/) are able to directly convert the metadata of born-digital PDF files into HTML-like trees, but would completely fail when handling PDFs generated from photographs produced by scanners or images captured by cameras. Therefore, if one would like to make documents that are not born-digital conveniently and instantly accessible to machines, a powerful toolset for extracting the structures and contents from such unstructured documents is of the essence.

DocXChain is a powerful open-source toolchain for document parsing, which can convert the rich contents in ***unstructured documents***, such as text, tables and charts, into ***structured representations*** that are readble and manipulable by machines. Currently, basic capabilities, including text detection, text recognition, table structure recognition and layout analysis, are provided. In addition, upon these basic capabilities, we also build typical pipelines, i.e., text reading, table parsing, and document structurization, to drive more complicated applications related to documents in real-world scenarios.
DocXChain is a powerful open-source toolchain for document parsing, which can convert the rich information in ***unstructured documents***, such as text, tables and charts, into ***structured representations*** that are readable and manipulable by machines. Currently, basic capabilities, including text detection, text recognition, table structure recognition and layout analysis, are provided. In addition, upon these basic capabilities, we also build typical pipelines, i.e., text reading, table parsing, and document structurization, to drive more complicated applications related to documents in real-world scenarios.

DocXChain is designed and developed with the original aspiration of ***promoting the level of digitization and structurization for documents***. In the future, we will go beyond pure document parsing capabilities, to explore more possibilities, e.g., combining DocXChain with large language models (LLMs) to perform document information extraction (IE), question answering (QA) and retrieval-augmented generation (RAG).

**Notice 1:** In this project, we adopt the ***broad concept of document***, meaning DocXChain can support various kinds of documents, including regular documents (such as books, academic papers and business forms), street view photos, presentations and even screenshots.
For more details, prelase refer to the [technical report](https://arxiv.org/abs/2310.12430) of DocXChain.

**Notice 2:** We also provide commercial products (online APIs) for document parsing on Alibaba Could. Please visit the [homepage of DocMind](https://docmind.console.aliyun.com/doc-overview), if you are interested.
**Notice 1:** In this project, we adopt the ***broad concept of documents***, meaning DocXChain can support various kinds of documents, including regular documents (such as books, academic papers and business forms), street view photos, presentations and even screenshots.

**Notice 2:** You are welcome to experience our online PoC system [DocMaster](https://www.modelscope.cn/studios/damo/DocMaster/summary), which combines basic document parsing capabilities with LLMs to realize precise document information extraction and question answering.

**Notice 3:** We also provide commercial products (online APIs) for document parsing on Alibaba Could. Please visit the [homepage of DocMind](https://docmind.console.aliyun.com/doc-overview), if you are interested.

## Core Ideology

The core design ideas of DocXChain are summarized as follows:
- **Object:** The central objects of DocXChain are ***documents***, rather than ***LLMs***.
- **Concision:** The capabilities for document parsing are presented in a "modules + pipelines" fashion, while unnecessary abstraction and encapsulation are abandoned.
- **Compatibility:** This toolchain can be readily integrated with existing tools, libraries or models (such as LangChain and ChatGPT), to build more powerful sysytems that can accomplish more complicated and challenging tasks.
- **Compatibility:** This toolchain can be readily integrated with existing tools, libraries or models (such as LangChain and ChatGPT), to build more powerful systems that can accomplish more complicated and challenging tasks.

## Qualitative Examples

* Example of General Text Reading:

![DocXChain_text_reading_example](./resources/DocXChain_text_reading_example.png)

* Example of Table Parsing:

![DocXChain_table_parsing_example](./resources/DocXChain_table_parsing_example.png)

* Example of Document Structurization:

![DocXChain_document_structurization_example](./resources/DocXChain_document_structurization_example.png)

## Installation

* Python version >= 3.7
* Install basic requirements (Python version >= 3.7):

```
pip install -r requirements.txt
```

**[Important]** Install ModelScope as well as related frameworks and libraries (such as PyTorch and TensorFlow). Please refer to the [GitHub homepage of ModelScope](https://github.com/modelscope/modelscope) for more details regarding the installation instructions.
* **[Important]** Install ModelScope as well as related frameworks and libraries (such as PyTorch and TensorFlow). Please refer to the [GitHub homepage of ModelScope](https://github.com/modelscope/modelscope) for more details regarding the installation instructions.

Install ImageMagick (needed to load PDF):
* Install ImageMagick (needed to load PDFs):
```bash
apt-get update
apt-get install libmagickwand-dev
pip install Wand
sed -i '/disable ghostscript format types/,+6d' /etc/ImageMagick-6/policy.xml # run this command if the following message occurs: "wand.exceptions.PolicyError: attempt to perform an operation not allowed by the security policy `PDF'"
```

Download the layout analysis model (a homebrewed model provided by us):
* Download the layout analysis model (a homebrewed model provided by us):
```bash
wget -c -t 100 -P /home/ https://github.com/AlibabaResearch/AdvancedLiterateMachinery/releases/download/v1.2.0-docX-release/DocXLayout_230829.pth
wget -c -t 100 -P /home/ https://github.com/AlibabaResearch/AdvancedLiterateMachinery/releases/download/v1.2.0-docX-release/DocXLayout_231012.pth
```

## Inference
Expand All @@ -53,6 +71,20 @@ python example.py table_parsing <document_file_path> <output_file_path> # task:
python example.py document_structurization <document_file_path> <output_file_path> # task: document structurization
```

## Citation

If you find our work beneficial, please cite:

```
@article{DocXChain2023,
title={{DocXChain: A Powerful Open-Source Toolchain for Document Parsing and Beyond}},
author={Cong Yao},
journal={ArXiv},
year={2023}
url={https://arxiv.org/abs/2310.12430}
}
```

## *License*

DocXChain is released under the terms of the [Apache License, Version 2.0](LICENSE).
Expand Down
42 changes: 27 additions & 15 deletions Applications/DocXChain/example.py
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,7 @@
from pipelines.general_text_reading import GeneralTextReading
from pipelines.table_parsing import TableParsing
from pipelines.document_structurization import DocumentStructurization
from utilities.visualization import *

def general_text_reading_example(image):

Expand All @@ -36,13 +37,16 @@ def general_text_reading_example(image):
# run
final_result = text_reader(image)

# display
output_image = image.copy()
if True:
print (final_result)

# visualize
output_image = general_text_reading_visualization(final_result, image)

# release
text_reader.release()

return output_image, final_result
return final_result, output_image

def table_parsing_example(image):

Expand Down Expand Up @@ -70,13 +74,16 @@ def table_parsing_example(image):
# run
final_result = table_parser(image)

# display
output_image = image.copy()
if True:
print (final_result)

# visualize
output_image = table_parsing_visualization(final_result, image)

# release
table_parser.release()

return output_image, final_result
return final_result, output_image


def document_structurization_example(image):
Expand All @@ -86,7 +93,7 @@ def document_structurization_example(image):

layout_analysis_configs = dict()
layout_analysis_configs['from_modelscope_flag'] = False
layout_analysis_configs['model_path'] = '/home/DocXLayout_230829.pth' # layout analysis model is NOT from modelscope
layout_analysis_configs['model_path'] = '/home/DocXLayout_231012.pth' # note that: currently the layout analysis model is NOT from modelscope
configs['layout_analysis_configs'] = layout_analysis_configs

text_detection_configs = dict()
Expand All @@ -96,7 +103,7 @@ def document_structurization_example(image):

text_recognition_configs = dict()
text_recognition_configs['from_modelscope_flag'] = True
text_recognition_configs['model_path'] = 'damo/cv_convnextTiny_ocr-recognition-general_damo' # alternatives: 'damo/cv_convnextTiny_ocr-recognition-scene_damo', 'damo/cv_convnextTiny_ocr-recognition-document_damo', 'damo/cv_convnextTiny_ocr-recognition-handwritten_damo'
text_recognition_configs['model_path'] = 'damo/cv_convnextTiny_ocr-recognition-document_damo' # alternatives: 'damo/cv_convnextTiny_ocr-recognition-scene_damo', 'damo/cv_convnextTiny_ocr-recognition-general_damo', 'damo/cv_convnextTiny_ocr-recognition-handwritten_damo'
configs['text_recognition_configs'] = text_recognition_configs

# initialize
Expand All @@ -105,13 +112,16 @@ def document_structurization_example(image):
# run
final_result = document_structurizer(image)

# display
output_image = image.copy()
if True:
print (final_result)

# visualize
output_image = document_structurization_visualization(final_result, image)

# release
document_structurizer.release()

return output_image, final_result
return final_result, output_image

# main routine
def main():
Expand All @@ -137,18 +147,20 @@ def main():
image = load_document(args.document_path)

# process
output_image = None
if image is not None:
if args.task == 'general_text_reading':
general_text_reading_example(image)
final_result, output_image = general_text_reading_example(image)
elif args.task == 'table_parsing':
table_parsing_example(image)
final_result, output_image = table_parsing_example(image)
else: # args.task == 'document_structurization'
document_structurization_example(image)
final_result, output_image = document_structurization_example(image)
else:
print ("Failed to load the document file!")

# output
cv2.imwrite(args.output_path, image)
if output_image is not None:
cv2.imwrite(args.output_path, output_image)

# finish
now = datetime.datetime.now(tz)
Expand Down
27 changes: 22 additions & 5 deletions Applications/DocXChain/modules/layout_analysis.py
Original file line number Diff line number Diff line change
Expand Up @@ -4,9 +4,7 @@
import os
import sys
import numpy as np
import datetime
import time
import cv2
import json

BASE_DIR = os.path.dirname(__file__)
sys.path.append(BASE_DIR + '/../../../DocumentUnderstanding/DocXLayout')
Expand All @@ -29,13 +27,22 @@ def __init__(self, configs):

# initialize and launch module
if configs['from_modelscope_flag'] is True:
self.layout_analyser = None # (20230912) currently we only support models from Advanced Literate Machinery (https://github.com/AlibabaResearch/AdvancedLiterateMachinery)
self.layout_analyser = None # (20230912) currently we only support layout analysis model from Advanced Literate Machinery (https://github.com/AlibabaResearch/AdvancedLiterateMachinery)
else:
params = {
'model_file': configs['model_path'],
'debug': 0, # 1: save vis results, 0: don't save
}


# load map information
map_info = json.load(open(BASE_DIR + '/../../../DocumentUnderstanding/DocXLayout/map_info.json'))
category_map = {}
for cate, idx in map_info["huntie"]["primary_map"].items():
category_map[idx] = cate

self.category_map = category_map

# initialize
self.layout_analyser = DocXLayoutPredictor(params)


Expand All @@ -62,6 +69,16 @@ def __call__(self, image):

return result

def mapping(self, index):
"""
Description:
return the category name all the index
"""

category = self.category_map[index]

return category

def release(self):
"""
Description:
Expand Down
4 changes: 1 addition & 3 deletions Applications/DocXChain/modules/text_recognition.py
Original file line number Diff line number Diff line change
Expand Up @@ -51,11 +51,10 @@ def __call__(self, image, detections):
if self.text_recognizer is not None:
# recognize the text instances one by one
result = []
for i in range(detections.shape[0]): # this part can be accelerated via parallelization (leave for future work)
for i in range(detections.shape[0]): # this part can be largely accelerated via parallelization (leave for future work)
pts = self.order_point(detections[i])
image_crop = self.crop_image(image, pts)
rec = self.text_recognizer(image_crop)
#result.append([rec, ','.join([str(e) for e in list(pts.reshape(-1))])])
result.append(rec)

return result
Expand All @@ -78,7 +77,6 @@ def recognize_cropped_image(self, cropped_image):
# perform text recognition
if self.text_recognizer is not None:
# recognize the text instance
#result = self.text_recognizer(cropped_image)['text']
result = self.text_recognizer(cropped_image)

return result
Expand Down
Loading

0 comments on commit 324d1c3

Please sign in to comment.