DocXChain V0.95

AlibabaResearch · Oct 20, 2023 · 324d1c3 · 324d1c3
1 parent 4eb14b9
commit 324d1c3
Show file tree

Hide file tree

Showing 15 changed files with 326 additions and 71 deletions.
diff --git a/.gitignore b/.gitignore
@@ -0,0 +1,26 @@
+__pycache__
+CIHP_PGN
+*.pkl
+*.bmp
+*.pdf
+train_log
+train_logs
+*.swp
+net_*
+*.tar
+*best-epoch-*
+*epoch-*
+.DS_Store
+.nfs*
+.ipynb_checkpoints
+*.model
+*.gz
+others
+images
+tmp*.py
+*.dlc
+*.ipynb*
+*.pyc
+__pycache__
+.idea/
+*.new
diff --git a/Applications/DocXChain/README.md b/Applications/DocXChain/README.md
@@ -2,46 +2,64 @@
 
 ## Introduction
 
-<font color=#FFA500 size=10> ***"Making Every Document Literally Accessible to Machines"*** </font>
+<font color=#FFA500 size=3> ***"Make Every Unstructured Document Literally Accessible to Machines"*** </font>
 
-Documents have been playing a critically important role in the daily work, study and life of people around the world. Billions, if not trillions, of documents in different forms are created, viewed, processed, transmited and stored everyday, eitherwws physically or digitally. However, not all documents in the digital world can be directly accessed by machines (including computers and other automatic equipments), as only a minor portion of the documents can be successfully parsed with low-level procedures. For instance, the [Adobe Extract APIs](https://developer.adobe.com/document-services/docs/overview/pdf-extract-api/) are able to directly convert the metadata of born-digital PDF files into HTML-like trees, but will completely fail when handling PDFs generated from photographs produced by scanners or images captured by cameras. Therefore, if one would like to make documents that are not born-digital accessible to machines, a powerful toolset for extracting elements from documents is of the essence.
+Documents are ubiquitous, since they are excellent carriers for recording and spreading information across space and time. Documents have been playing a critically important role in the daily work, study and life of people all over the world. Every day, billions of documents in different forms are created, viewed, processed, transmited and stored around the world, either physically or digitally. However, not all documents in the digital world can be directly accessed by machines (including computers and other automatic equipments), as only a portion of the documents can be successfully parsed with low-level procedures. For instance, the [Adobe Extract APIs](https://developer.adobe.com/document-services/docs/overview/pdf-extract-api/) are able to directly convert the metadata of born-digital PDF files into HTML-like trees, but would completely fail when handling PDFs generated from photographs produced by scanners or images captured by cameras. Therefore, if one would like to make documents that are not born-digital conveniently and instantly accessible to machines, a powerful toolset for extracting the structures and contents from such unstructured documents is of the essence.
 
-DocXChain is a powerful open-source toolchain for document parsing, which can convert the rich contents in ***unstructured documents***, such as text, tables and charts, into ***structured representations*** that are readble and manipulable by machines. Currently, basic capabilities, including text detection, text recognition, table structure recognition and layout analysis, are provided. In addition, upon these basic capabilities, we also build typical pipelines, i.e., text reading, table parsing, and document structurization, to drive more complicated applications related to documents in real-world scenarios.
+DocXChain is a powerful open-source toolchain for document parsing, which can convert the rich information in ***unstructured documents***, such as text, tables and charts, into ***structured representations*** that are readable and manipulable by machines. Currently, basic capabilities, including text detection, text recognition, table structure recognition and layout analysis, are provided. In addition, upon these basic capabilities, we also build typical pipelines, i.e., text reading, table parsing, and document structurization, to drive more complicated applications related to documents in real-world scenarios.
 
 DocXChain is designed and developed with the original aspiration of ***promoting the level of digitization and structurization for documents***. In the future, we will go beyond pure document parsing capabilities, to explore more possibilities, e.g., combining DocXChain with large language models (LLMs) to perform document information extraction (IE), question answering (QA) and retrieval-augmented generation (RAG).
 
-**Notice 1:** In this project, we adopt the ***broad concept of document***, meaning DocXChain can support various kinds of documents, including regular documents (such as books, academic papers and business forms), street view photos, presentations and even screenshots.
+For more details, prelase refer to the [technical report](https://arxiv.org/abs/2310.12430) of DocXChain. 
 
-**Notice 2:** We also provide commercial products (online APIs) for document parsing on Alibaba Could. Please visit the [homepage of DocMind](https://docmind.console.aliyun.com/doc-overview), if you are interested.
+**Notice 1:** In this project, we adopt the ***broad concept of documents***, meaning DocXChain can support various kinds of documents, including regular documents (such as books, academic papers and business forms), street view photos, presentations and even screenshots.
+
+**Notice 2:** You are welcome to experience our online PoC system [DocMaster](https://www.modelscope.cn/studios/damo/DocMaster/summary), which combines basic document parsing capabilities with LLMs to realize precise document information extraction and question answering.
+
+**Notice 3:** We also provide commercial products (online APIs) for document parsing on Alibaba Could. Please visit the [homepage of DocMind](https://docmind.console.aliyun.com/doc-overview), if you are interested.
 
 ## Core Ideology
 
 The core design ideas of DocXChain are summarized as follows:
 - **Object:** The central objects of DocXChain are ***documents***, rather than ***LLMs***.
 - **Concision:** The capabilities for document parsing are presented in a "modules + pipelines" fashion, while unnecessary abstraction and encapsulation are abandoned.
-- **Compatibility:** This toolchain can be readily integrated with existing tools, libraries or models (such as LangChain and ChatGPT), to build more powerful sysytems that can accomplish more complicated and challenging tasks.
+- **Compatibility:** This toolchain can be readily integrated with existing tools, libraries or models (such as LangChain and ChatGPT), to build more powerful systems that can accomplish more complicated and challenging tasks.
+
+## Qualitative Examples
+
+* Example of General Text Reading:
+
+![DocXChain_text_reading_example](./resources/DocXChain_text_reading_example.png)
+
+* Example of Table Parsing:
+
+![DocXChain_table_parsing_example](./resources/DocXChain_table_parsing_example.png)
+
+* Example of Document Structurization:
+
+![DocXChain_document_structurization_example](./resources/DocXChain_document_structurization_example.png)
 
 ## Installation
 
-* Python version >= 3.7
+* Install basic requirements (Python version >= 3.7):
 
 ```
 pip install -r requirements.txt
 ```
 
-**[Important]** Install ModelScope as well as related frameworks and libraries (such as PyTorch and TensorFlow). Please refer to the [GitHub homepage of ModelScope](https://github.com/modelscope/modelscope) for more details regarding the installation instructions.
+* **[Important]** Install ModelScope as well as related frameworks and libraries (such as PyTorch and TensorFlow). Please refer to the [GitHub homepage of ModelScope](https://github.com/modelscope/modelscope) for more details regarding the installation instructions.
 
-Install ImageMagick (needed to load PDF):
+* Install ImageMagick (needed to load PDFs):
 ```bash
 apt-get update
 apt-get install libmagickwand-dev
 pip install Wand
 sed -i '/disable ghostscript format types/,+6d' /etc/ImageMagick-6/policy.xml  # run this command if the following message occurs: "wand.exceptions.PolicyError: attempt to perform an operation not allowed by the security policy `PDF'"
 ```
 
-Download the layout analysis model (a homebrewed model provided by us):
+* Download the layout analysis model (a homebrewed model provided by us):
 ```bash
-wget -c -t 100 -P /home/ https://github.com/AlibabaResearch/AdvancedLiterateMachinery/releases/download/v1.2.0-docX-release/DocXLayout_230829.pth
+wget -c -t 100 -P /home/ https://github.com/AlibabaResearch/AdvancedLiterateMachinery/releases/download/v1.2.0-docX-release/DocXLayout_231012.pth
 ``` 
 
 ## Inference
@@ -53,6 +71,20 @@ python example.py table_parsing <document_file_path> <output_file_path>  # task:
 python example.py document_structurization <document_file_path> <output_file_path>  # task: document structurization
 ``` 
 
+## Citation
+
+If you find our work beneficial, please cite:
+
+```
+@article{DocXChain2023,
+  title={{DocXChain: A Powerful Open-Source Toolchain for Document Parsing and Beyond}},
+  author={Cong Yao},
+  journal={ArXiv},
+  year={2023}
+  url={https://arxiv.org/abs/2310.12430}
+}
+```
+
 ## *License*
 
 DocXChain is released under the terms of the [Apache License, Version 2.0](LICENSE).

diff --git a/Applications/DocXChain/example.py b/Applications/DocXChain/example.py
@@ -14,6 +14,7 @@
 from pipelines.general_text_reading import GeneralTextReading
 from pipelines.table_parsing import TableParsing
 from pipelines.document_structurization import DocumentStructurization
+from utilities.visualization import *
 
 def general_text_reading_example(image):
 
@@ -36,13 +37,16 @@ def general_text_reading_example(image):
     # run
     final_result = text_reader(image)
 
-    # display
-    output_image = image.copy()
+    if True:
+        print (final_result)
+
+    # visualize
+    output_image = general_text_reading_visualization(final_result, image)
 
     # release
     text_reader.release()
 
-    return output_image, final_result
+    return final_result, output_image
 
 def table_parsing_example(image):
 
@@ -70,13 +74,16 @@ def table_parsing_example(image):
     # run
     final_result = table_parser(image)
 
-    # display
-    output_image = image.copy()
+    if True:
+        print (final_result)
+
+    # visualize
+    output_image = table_parsing_visualization(final_result, image)
 
     # release
     table_parser.release()
 
-    return output_image, final_result
+    return final_result, output_image
 
 
 def document_structurization_example(image):
@@ -86,7 +93,7 @@ def document_structurization_example(image):
 
     layout_analysis_configs = dict()
     layout_analysis_configs['from_modelscope_flag'] = False
-    layout_analysis_configs['model_path'] = '/home/DocXLayout_230829.pth'  # layout analysis model is NOT from modelscope
+    layout_analysis_configs['model_path'] = '/home/DocXLayout_231012.pth'  # note that: currently the layout analysis model is NOT from modelscope
     configs['layout_analysis_configs'] = layout_analysis_configs
 
     text_detection_configs = dict()
@@ -96,7 +103,7 @@ def document_structurization_example(image):
 
     text_recognition_configs = dict()
     text_recognition_configs['from_modelscope_flag'] = True
-    text_recognition_configs['model_path'] = 'damo/cv_convnextTiny_ocr-recognition-general_damo'  # alternatives: 'damo/cv_convnextTiny_ocr-recognition-scene_damo', 'damo/cv_convnextTiny_ocr-recognition-document_damo', 'damo/cv_convnextTiny_ocr-recognition-handwritten_damo' 
+    text_recognition_configs['model_path'] = 'damo/cv_convnextTiny_ocr-recognition-document_damo'  # alternatives: 'damo/cv_convnextTiny_ocr-recognition-scene_damo', 'damo/cv_convnextTiny_ocr-recognition-general_damo', 'damo/cv_convnextTiny_ocr-recognition-handwritten_damo' 
     configs['text_recognition_configs'] = text_recognition_configs
 
     # initialize
@@ -105,13 +112,16 @@ def document_structurization_example(image):
     # run
     final_result = document_structurizer(image)
 
-    # display
-    output_image = image.copy()
+    if True:
+        print (final_result)
+
+    # visualize
+    output_image = document_structurization_visualization(final_result, image)
 
     # release
     document_structurizer.release()
 
-    return output_image, final_result
+    return final_result, output_image
 
 # main routine
 def main():
@@ -137,18 +147,20 @@ def main():
     image = load_document(args.document_path)
 
     # process
+    output_image = None
     if image is not None:
         if args.task == 'general_text_reading':
-            general_text_reading_example(image)
+            final_result, output_image = general_text_reading_example(image)
         elif args.task == 'table_parsing':
-            table_parsing_example(image)
+            final_result, output_image = table_parsing_example(image)
         else: # args.task == 'document_structurization'
-            document_structurization_example(image)
+            final_result, output_image = document_structurization_example(image)
     else:
         print ("Failed to load the document file!")
 
     # output
-    cv2.imwrite(args.output_path, image)
+    if output_image is not None:
+        cv2.imwrite(args.output_path, output_image)
 
     # finish
     now = datetime.datetime.now(tz)

diff --git a/Applications/DocXChain/modules/layout_analysis.py b/Applications/DocXChain/modules/layout_analysis.py
@@ -4,9 +4,7 @@
 import os
 import sys
 import numpy as np
-import datetime
-import time
-import cv2
+import json
 
 BASE_DIR = os.path.dirname(__file__)
 sys.path.append(BASE_DIR + '/../../../DocumentUnderstanding/DocXLayout')
@@ -29,13 +27,22 @@ def __init__(self, configs):
 
         # initialize and launch module
         if configs['from_modelscope_flag'] is True:
-            self.layout_analyser = None  # (20230912) currently we only support models from Advanced Literate Machinery (https://github.com/AlibabaResearch/AdvancedLiterateMachinery)
+            self.layout_analyser = None  # (20230912) currently we only support layout analysis model from Advanced Literate Machinery (https://github.com/AlibabaResearch/AdvancedLiterateMachinery)
         else:
             params = {
                 'model_file': configs['model_path'],
                 'debug': 0, # 1: save vis results, 0: don't save
             }
-
+
+            # load map information
+            map_info = json.load(open(BASE_DIR + '/../../../DocumentUnderstanding/DocXLayout/map_info.json'))
+            category_map = {}
+            for cate, idx in map_info["huntie"]["primary_map"].items():
+                category_map[idx] = cate
+
+            self.category_map = category_map
+
+            # initialize
             self.layout_analyser = DocXLayoutPredictor(params) 
 
 
@@ -62,6 +69,16 @@ def __call__(self, image):
 
         return result
 
+    def mapping(self, index):
+        """
+        Description:
+          return the category name all the index
+        """
+
+        category = self.category_map[index]
+
+        return category
+
     def release(self):
         """
         Description:

diff --git a/Applications/DocXChain/modules/text_recognition.py b/Applications/DocXChain/modules/text_recognition.py
@@ -51,11 +51,10 @@ def __call__(self, image, detections):
         if self.text_recognizer is not None:
             # recognize the text instances one by one
             result = []
-            for i in range(detections.shape[0]):  # this part can be accelerated via parallelization (leave for future work)
+            for i in range(detections.shape[0]):  # this part can be largely accelerated via parallelization (leave for future work)
                 pts = self.order_point(detections[i])
                 image_crop = self.crop_image(image, pts)
                 rec = self.text_recognizer(image_crop)
-                #result.append([rec, ','.join([str(e) for e in list(pts.reshape(-1))])])
                 result.append(rec)
 
         return result
@@ -78,7 +77,6 @@ def recognize_cropped_image(self, cropped_image):
         # perform text recognition
         if self.text_recognizer is not None:
             # recognize the text instance
-            #result = self.text_recognizer(cropped_image)['text']
             result = self.text_recognizer(cropped_image)
 
         return result