simple_pdf: works with newer PyMuPDF versions

Add pypdf text extraction backend Update README for pypdf, and also for the fact that PyMuPDF is now easy to install Remove embedded XML file from PDF test file tests/pdf/akretion_france-test.pdf !
OCA · Oct 20, 2023 · 9de1261 · 9de1261
1 parent 1ec3e46
commit 9de1261
Show file tree

Hide file tree

Showing 7 changed files with 59 additions and 21 deletions.
diff --git a/account_invoice_import_simple_pdf/__manifest__.py b/account_invoice_import_simple_pdf/__manifest__.py
@@ -12,9 +12,13 @@
     "maintainers": ["alexis-via"],
     "website": "https://github.com/OCA/edi",
     "depends": ["account_invoice_import"],
-    # "excludes": ["account_invoice_import_invoice2data"],
     "external_dependencies": {
-        "python": ["pdfplumber", "regex", "dateparser"],
+        "python": [
+            "pdfplumber",
+            "regex",
+            "dateparser",
+            "pypdf>=3.1.0",
+        ],
         "deb": ["libmupdf-dev", "mupdf", "mupdf-tools", "poppler-utils"],
     },
     "data": [

diff --git a/account_invoice_import_simple_pdf/readme/CONFIGURE.rst b/account_invoice_import_simple_pdf/readme/CONFIGURE.rst
@@ -9,6 +9,7 @@ If you want to force Odoo to use a specific text extraction method, go to the me
   #. pdftotext.lib
   #. pdftotext.cmd
   #. pdfplumber
+  #. pypdf
 
 In this configuration, Odoo will only use the selected text extraction method and, if it fails, it will display an error message.
 

diff --git a/account_invoice_import_simple_pdf/readme/INSTALL.rst b/account_invoice_import_simple_pdf/readme/INSTALL.rst
@@ -1,33 +1,27 @@
 The most important technical component of this module is the tool that converts the PDF to text. Converting PDF to text is not an easy job. As outlined in this `blog post <https://dida.do/blog/how-to-extract-text-from-pdf>`_, different tools can give quite different results. The best results are usually achieved with tools based on a PDF viewer, which exclude pure-python tools. But pure-python tools are easier to install than tools based on a PDF viewer. It is important to understand that, if you change the PDF to text tool, you will certainly have a slightly different text output, which may oblige you to update the field extraction rule, which can be time-consuming if you have already configured many vendors.
 
-The module supports 4 different extraction methods:
+The module supports 5 different extraction methods:
 
 1. `PyMuPDF <https://github.com/pymupdf/PyMuPDF>`_ which is a Python binding for `MuPDF <https://mupdf.com/>`_, a lightweight PDF toolkit/viewer/renderer published under the AGPL licence by the company `Artifex Software <https://artifex.com/>`_.
 #. `pdftotext python library <https://pypi.org/project/pdftotext/>`_, which is a python binding for the pdftotext tool.
 #. `pdftotext command line tool <https://en.wikipedia.org/wiki/Pdftotext>`_, which is based on `poppler <https://poppler.freedesktop.org/>`_, a PDF rendering library used by `xpdf <https://www.xpdfreader.com/>`_ and `Evince <https://wiki.gnome.org/Apps/Evince/FrequentlyAskedQuestions>`_ (the PDF reader of `Gnome <https://www.gnome.org/>`_).
 #. `pdfplumber <https://pypi.org/project/pdfplumber/>`_, which is a python library built on top the of the python library `pdfminer.six <https://pypi.org/project/pdfminer.six/>`_. pdfplumber is a pure-python solution, so it's very easy to install on all OSes.
+#. `pypdf <https://github.com/py-pdf/pypdf/>`_, which is one of the most common PDF lib for Python. pypdf is a pure-python solution, so it's very easy to install on all OSes.
 
-PyMuPDF and pdftotext both give a very good text output. So far, I can't say which one is best. pdfplumber often gives lower-quality text output, but its advantage is that it's a pure-Python solution, so you will always be able to install it whatever your technical environnement is.
+PyMuPDF and pdftotext both give a very good text output. So far, I can't say which one is best. pdfplumber and pypdf often give lower-quality text output, but their advantage is that they are pure-Python librairies, so you will always be able to install it whatever your technical environnement is.
 
 You can choose one extraction method and only install the tools/libs for that method.
 
 Install PyMuPDF
 ~~~~~~~~~~~~~~~
 
-To install **PyMuPDF**, if you use Debian (Bullseye aka v11 or higher) or Ubuntu (20.04 or higher), run the following command:
+Install it via pip:
 
 .. code::
 
-  sudo apt install python3-fitz
+  sudo pip3 install --upgrade pymupdf
 
-You can also install it via pip:
-
-.. code::
-
-  sudo pip3 install --upgrade PyMuPDF
-
-
-but beware that *PyMuPDF* is just a binding on MuPDF, so it will require MuPDF and all the development libs required to compile the binding. That's why *PyMuPDF* is much easier to install via the packages of your Linux distribution (package name **python3-fitz** on Debian/Ubuntu, but the package name may be different in other distributions) than with pip.
+Beware that *PyMuPDF* is not a pure-python library: it uses MuPDF, which is written in C language. If a python wheel for your OS, CPU architecture and Python version is available on pypi (check the `list of PyMuPDF wheels <https://pypi.org/project/PyMuPDF/#files>`_ on pypi), it will install smoothly. Otherwize, the installation via pip will require MuPDF and all its development libs to compile the binding.
 
 Install pdftotext python lib
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~
@@ -64,6 +58,16 @@ To install the **pdfplumber** python lib, run:
 
   sudo pip3 install --upgrade pdfplumber
 
+Install pypdf
+~~~~~~~~~~~~~
+
+To install the **pypdf** python lib, run:
+
+.. code::
+
+  sudo pip3 install --upgrade pypdf
+
+
 Other requirements
 ~~~~~~~~~~~~~~~~~~
 

diff --git a/account_invoice_import_simple_pdf/tests/pdf/akretion_france-test.pdf b/account_invoice_import_simple_pdf/tests/pdf/akretion_france-test.pdf
diff --git a/account_invoice_import_simple_pdf/tests/test_invoice_import.py b/account_invoice_import_simple_pdf/tests/test_invoice_import.py
@@ -555,17 +555,22 @@ def test_complete_import(self):
         self.assertEqual(float_compare(iline.price_unit, 1509, precision_digits=2), 0)
         inv.unlink()
 
-    def test_complete_import_pdfplumber(self):
+    def _complete_import_specific_method(self, method):
         icpo = self.env["ir.config_parameter"]
         key = "invoice_import_simple_pdf.pdf2txt"
-        method = "pdfplumber"
         configp = icpo.search([("key", "=", key)])
         if configp:
             configp.write({"value": method})
         else:
             icpo.create({"key": key, "value": method})
         self.test_complete_import()
 
+    def test_specific_python_methods(self):
+        # test only pure-pdf methods
+        # because we are sure they work on the Github test environment
+        self._complete_import_specific_method("pdfplumber")
+        self._complete_import_specific_method("pypdf")
+
     def test_test_mode(self):
         self.partner_ak.write(
             {

diff --git a/account_invoice_import_simple_pdf/wizard/account_invoice_import.py b/account_invoice_import_simple_pdf/wizard/account_invoice_import.py
@@ -28,6 +28,10 @@
     import pdftotext
 except ImportError:
     logger.debug("Cannot import pdftotext")
+try:
+    import pypdf
+except ImportError:
+    logger.debug("Cannot import pypdf")
 
 
 class AccountInvoiceImport(models.TransientModel):
@@ -50,13 +54,13 @@ def _simple_pdf_text_extraction_pymupdf(self, fileobj, test_info):
             pages = []
             doc = fitz.open(fileobj.name)
             for page in doc:
-                pages.append(page.getText("text"))
+                pages.append(page.get_text())
             res = {
                 "all": "\n\n".join(pages),
                 "first": pages and pages[0] or "",
             }
-            logger.info("Text extraction made with PyMuPDF")
-            test_info["text_extraction"] = "pymupdf"
+            logger.info("Text extraction made with PyMuPDF %s", fitz.__version__)
+            test_info["text_extraction"] = "pymupdf %s" % fitz.__version__
         except Exception as e:
             logger.warning("Text extraction with PyMuPDF failed. Error: %s", e)
         return res
@@ -76,8 +80,23 @@ def _simple_pdf_text_extraction_pdfplumber(self, fileobj, test_info):
                 "all": "\n\n".join(pages),
                 "first": pages and pages[0] or "",
             }
-        test_info["text_extraction"] = "pdfplumber"
-        logger.info("Text extraction made with pdfplumber")
+        test_info["text_extraction"] = "pdfplumber %s" % pdfplumber.__version__
+        logger.info("Text extraction made with pdfplumber %s", pdfplumber.__version__)
+        return res
+
+    @api.model
+    def _simple_pdf_text_extraction_pypdf(self, fileobj, test_info):
+        res = False
+        reader = pypdf.PdfReader(fileobj.name)
+        pages = []
+        for pdf_page in reader.pages:
+            pages.append(pdf_page.extract_text())
+            res = {
+                "all": "\n\n".join(pages),
+                "first": pages and pages[0] or "",
+            }
+        test_info["text_extraction"] = "pypdf %s" % pypdf.__version__
+        logger.info("Text extraction made with pypdf %s", pypdf.__version__)
         return res
 
     @api.model
@@ -147,6 +166,8 @@ def _simple_pdf_text_extraction_specific_tool(
             res = self._simple_pdf_text_extraction_pdftotext_cmd(fileobj, test_info)
         elif specific_tool == "pdfplumber":
             res = self._simple_pdf_text_extraction_pdfplumber(fileobj, test_info)
+        elif specific_tool == "pypdf":
+            res = self._simple_pdf_text_extraction_pypdf(fileobj, test_info)
         else:
             raise UserError(
                 _(
@@ -195,6 +216,8 @@ def simple_pdf_text_extraction(self, file_data, test_info):
                 res = self._simple_pdf_text_extraction_pdftotext_cmd(fileobj, test_info)
             if not res:
                 res = self._simple_pdf_text_extraction_pdfplumber(fileobj, test_info)
+            if not res:
+                res = self._simple_pdf_text_extraction_pypdf(fileobj, test_info)
             if not res:
                 raise UserError(
                     _(

diff --git a/requirements.txt b/requirements.txt
@@ -5,6 +5,7 @@ invoice2data
 ovh
 pdfplumber
 phonenumbers
+pypdf>=3.1.0
 pyyaml
 regex
 xmlschema