Skip to content

Commit

Permalink
simple_pdf: works with newer PyMuPDF versions
Browse files Browse the repository at this point in the history
Add pypdf text extraction backend
Update README for pypdf, and also for the fact that PyMuPDF is now easy to install
Remove embedded XML file from PDF test file tests/pdf/akretion_france-test.pdf !
  • Loading branch information
alexis-via committed Oct 20, 2023
1 parent 1ec3e46 commit 9de1261
Show file tree
Hide file tree
Showing 7 changed files with 59 additions and 21 deletions.
8 changes: 6 additions & 2 deletions account_invoice_import_simple_pdf/__manifest__.py
Original file line number Diff line number Diff line change
Expand Up @@ -12,9 +12,13 @@
"maintainers": ["alexis-via"],
"website": "https://github.com/OCA/edi",
"depends": ["account_invoice_import"],
# "excludes": ["account_invoice_import_invoice2data"],
"external_dependencies": {
"python": ["pdfplumber", "regex", "dateparser"],
"python": [
"pdfplumber",
"regex",
"dateparser",
"pypdf>=3.1.0",
],
"deb": ["libmupdf-dev", "mupdf", "mupdf-tools", "poppler-utils"],
},
"data": [
Expand Down
1 change: 1 addition & 0 deletions account_invoice_import_simple_pdf/readme/CONFIGURE.rst
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,7 @@ If you want to force Odoo to use a specific text extraction method, go to the me
#. pdftotext.lib
#. pdftotext.cmd
#. pdfplumber
#. pypdf

In this configuration, Odoo will only use the selected text extraction method and, if it fails, it will display an error message.

Expand Down
28 changes: 16 additions & 12 deletions account_invoice_import_simple_pdf/readme/INSTALL.rst
Original file line number Diff line number Diff line change
@@ -1,33 +1,27 @@
The most important technical component of this module is the tool that converts the PDF to text. Converting PDF to text is not an easy job. As outlined in this `blog post <https://dida.do/blog/how-to-extract-text-from-pdf>`_, different tools can give quite different results. The best results are usually achieved with tools based on a PDF viewer, which exclude pure-python tools. But pure-python tools are easier to install than tools based on a PDF viewer. It is important to understand that, if you change the PDF to text tool, you will certainly have a slightly different text output, which may oblige you to update the field extraction rule, which can be time-consuming if you have already configured many vendors.

The module supports 4 different extraction methods:
The module supports 5 different extraction methods:

1. `PyMuPDF <https://github.com/pymupdf/PyMuPDF>`_ which is a Python binding for `MuPDF <https://mupdf.com/>`_, a lightweight PDF toolkit/viewer/renderer published under the AGPL licence by the company `Artifex Software <https://artifex.com/>`_.
#. `pdftotext python library <https://pypi.org/project/pdftotext/>`_, which is a python binding for the pdftotext tool.
#. `pdftotext command line tool <https://en.wikipedia.org/wiki/Pdftotext>`_, which is based on `poppler <https://poppler.freedesktop.org/>`_, a PDF rendering library used by `xpdf <https://www.xpdfreader.com/>`_ and `Evince <https://wiki.gnome.org/Apps/Evince/FrequentlyAskedQuestions>`_ (the PDF reader of `Gnome <https://www.gnome.org/>`_).
#. `pdfplumber <https://pypi.org/project/pdfplumber/>`_, which is a python library built on top the of the python library `pdfminer.six <https://pypi.org/project/pdfminer.six/>`_. pdfplumber is a pure-python solution, so it's very easy to install on all OSes.
#. `pypdf <https://github.com/py-pdf/pypdf/>`_, which is one of the most common PDF lib for Python. pypdf is a pure-python solution, so it's very easy to install on all OSes.

PyMuPDF and pdftotext both give a very good text output. So far, I can't say which one is best. pdfplumber often gives lower-quality text output, but its advantage is that it's a pure-Python solution, so you will always be able to install it whatever your technical environnement is.
PyMuPDF and pdftotext both give a very good text output. So far, I can't say which one is best. pdfplumber and pypdf often give lower-quality text output, but their advantage is that they are pure-Python librairies, so you will always be able to install it whatever your technical environnement is.

You can choose one extraction method and only install the tools/libs for that method.

Install PyMuPDF
~~~~~~~~~~~~~~~

To install **PyMuPDF**, if you use Debian (Bullseye aka v11 or higher) or Ubuntu (20.04 or higher), run the following command:
Install it via pip:

.. code::
sudo apt install python3-fitz
sudo pip3 install --upgrade pymupdf
You can also install it via pip:

.. code::
sudo pip3 install --upgrade PyMuPDF
but beware that *PyMuPDF* is just a binding on MuPDF, so it will require MuPDF and all the development libs required to compile the binding. That's why *PyMuPDF* is much easier to install via the packages of your Linux distribution (package name **python3-fitz** on Debian/Ubuntu, but the package name may be different in other distributions) than with pip.
Beware that *PyMuPDF* is not a pure-python library: it uses MuPDF, which is written in C language. If a python wheel for your OS, CPU architecture and Python version is available on pypi (check the `list of PyMuPDF wheels <https://pypi.org/project/PyMuPDF/#files>`_ on pypi), it will install smoothly. Otherwize, the installation via pip will require MuPDF and all its development libs to compile the binding.

Install pdftotext python lib
~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Expand Down Expand Up @@ -64,6 +58,16 @@ To install the **pdfplumber** python lib, run:
sudo pip3 install --upgrade pdfplumber
Install pypdf
~~~~~~~~~~~~~

To install the **pypdf** python lib, run:

.. code::
sudo pip3 install --upgrade pypdf
Other requirements
~~~~~~~~~~~~~~~~~~

Expand Down
Binary file not shown.
Original file line number Diff line number Diff line change
Expand Up @@ -555,17 +555,22 @@ def test_complete_import(self):
self.assertEqual(float_compare(iline.price_unit, 1509, precision_digits=2), 0)
inv.unlink()

def test_complete_import_pdfplumber(self):
def _complete_import_specific_method(self, method):
icpo = self.env["ir.config_parameter"]
key = "invoice_import_simple_pdf.pdf2txt"
method = "pdfplumber"
configp = icpo.search([("key", "=", key)])
if configp:
configp.write({"value": method})
else:
icpo.create({"key": key, "value": method})
self.test_complete_import()

def test_specific_python_methods(self):
# test only pure-pdf methods
# because we are sure they work on the Github test environment
self._complete_import_specific_method("pdfplumber")
self._complete_import_specific_method("pypdf")

def test_test_mode(self):
self.partner_ak.write(
{
Expand Down
33 changes: 28 additions & 5 deletions account_invoice_import_simple_pdf/wizard/account_invoice_import.py
Original file line number Diff line number Diff line change
Expand Up @@ -28,6 +28,10 @@
import pdftotext
except ImportError:
logger.debug("Cannot import pdftotext")
try:
import pypdf
except ImportError:
logger.debug("Cannot import pypdf")

Check warning on line 34 in account_invoice_import_simple_pdf/wizard/account_invoice_import.py

View check run for this annotation

Codecov / codecov/patch

account_invoice_import_simple_pdf/wizard/account_invoice_import.py#L33-L34

Added lines #L33 - L34 were not covered by tests


class AccountInvoiceImport(models.TransientModel):
Expand All @@ -50,13 +54,13 @@ def _simple_pdf_text_extraction_pymupdf(self, fileobj, test_info):
pages = []
doc = fitz.open(fileobj.name)
for page in doc:
pages.append(page.getText("text"))
pages.append(page.get_text())

Check warning on line 57 in account_invoice_import_simple_pdf/wizard/account_invoice_import.py

View check run for this annotation

Codecov / codecov/patch

account_invoice_import_simple_pdf/wizard/account_invoice_import.py#L57

Added line #L57 was not covered by tests
res = {
"all": "\n\n".join(pages),
"first": pages and pages[0] or "",
}
logger.info("Text extraction made with PyMuPDF")
test_info["text_extraction"] = "pymupdf"
logger.info("Text extraction made with PyMuPDF %s", fitz.__version__)
test_info["text_extraction"] = "pymupdf %s" % fitz.__version__

Check warning on line 63 in account_invoice_import_simple_pdf/wizard/account_invoice_import.py

View check run for this annotation

Codecov / codecov/patch

account_invoice_import_simple_pdf/wizard/account_invoice_import.py#L62-L63

Added lines #L62 - L63 were not covered by tests
except Exception as e:
logger.warning("Text extraction with PyMuPDF failed. Error: %s", e)
return res
Expand All @@ -76,8 +80,23 @@ def _simple_pdf_text_extraction_pdfplumber(self, fileobj, test_info):
"all": "\n\n".join(pages),
"first": pages and pages[0] or "",
}
test_info["text_extraction"] = "pdfplumber"
logger.info("Text extraction made with pdfplumber")
test_info["text_extraction"] = "pdfplumber %s" % pdfplumber.__version__
logger.info("Text extraction made with pdfplumber %s", pdfplumber.__version__)
return res

@api.model
def _simple_pdf_text_extraction_pypdf(self, fileobj, test_info):
res = False
reader = pypdf.PdfReader(fileobj.name)
pages = []
for pdf_page in reader.pages:
pages.append(pdf_page.extract_text())
res = {
"all": "\n\n".join(pages),
"first": pages and pages[0] or "",
}
test_info["text_extraction"] = "pypdf %s" % pypdf.__version__
logger.info("Text extraction made with pypdf %s", pypdf.__version__)
return res

@api.model
Expand Down Expand Up @@ -147,6 +166,8 @@ def _simple_pdf_text_extraction_specific_tool(
res = self._simple_pdf_text_extraction_pdftotext_cmd(fileobj, test_info)
elif specific_tool == "pdfplumber":
res = self._simple_pdf_text_extraction_pdfplumber(fileobj, test_info)
elif specific_tool == "pypdf":
res = self._simple_pdf_text_extraction_pypdf(fileobj, test_info)
else:
raise UserError(
_(
Expand Down Expand Up @@ -195,6 +216,8 @@ def simple_pdf_text_extraction(self, file_data, test_info):
res = self._simple_pdf_text_extraction_pdftotext_cmd(fileobj, test_info)
if not res:
res = self._simple_pdf_text_extraction_pdfplumber(fileobj, test_info)
if not res:
res = self._simple_pdf_text_extraction_pypdf(fileobj, test_info)

Check warning on line 220 in account_invoice_import_simple_pdf/wizard/account_invoice_import.py

View check run for this annotation

Codecov / codecov/patch

account_invoice_import_simple_pdf/wizard/account_invoice_import.py#L220

Added line #L220 was not covered by tests
if not res:
raise UserError(
_(
Expand Down
1 change: 1 addition & 0 deletions requirements.txt
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,7 @@ invoice2data
ovh
pdfplumber
phonenumbers
pypdf>=3.1.0
pyyaml
regex
xmlschema

0 comments on commit 9de1261

Please sign in to comment.